Chapter 10. BlueStore


BlueStore is the back-end object store for the OSD daemons and puts objects directly on the block device.

Important

BlueStore provides a high-performance backend for OSD daemons in a production environment. By default, BlueStore is configured to be self-tuning. If you determine that your environment performs better with BlueStore tuned manually, please contact Red Hat support and share the details of your configuration to help us improve the auto-tuning capability. Red Hat looks forward to your feedback and appreciates your recommendations.

10.1. Ceph BlueStore

The following are some of the main features of using BlueStore:

Direct management of storage devices
BlueStore consumes raw block devices or partitions. This avoids any intervening layers of abstraction, such as local file systems like XFS, that might limit performance or add complexity.
Metadata management with RocksDB
BlueStore uses the RocksDB key-value database to manage internal metadata, such as the mapping from object names to block locations on a disk.
Full data and metadata checksumming
By default all data and metadata written to BlueStore is protected by one or more checksums. No data or metadata are read from disk or returned to the user without verification.
Efficient copy-on-write
The Ceph Block Device and Ceph File System snapshots rely on a copy-on-write clone mechanism that is implemented efficiently in BlueStore. This results in efficient I/O both for regular snapshots and for erasure coded pools which rely on cloning to implement efficient two-phase commits.
No large double-writes
BlueStore first writes any new data to unallocated space on a block device, and then commits a RocksDB transaction that updates the object metadata to reference the new region of the disk. Only when the write operation is below a configurable size threshold, it falls back to a write-ahead journaling scheme.
Multi-device support
BlueStore can use multiple block devices for storing different data. For example: Hard Disk Drive (HDD) for the data, Solid-state Drive (SSD) for metadata, Non-volatile Memory (NVM) or Non-volatile random-access memory (NVRAM) or persistent memory for the RocksDB write-ahead log (WAL). See Ceph BlueStore devices for details.
Efficient block device usage
Because BlueStore does not use any file system, it minimizes the need to clear the storage device cache.

10.2. Ceph BlueStore devices

This section explains what block devices the BlueStore back end uses.

BlueStore manages either one, two, or three storage devices.

  • Primary
  • WAL
  • DB

In the simplest case, BlueStore consumes a single (primary) storage device. The storage device is partitioned into two parts that contain:

  • OSD metadata: A small partition formatted with XFS that contains basic metadata for the OSD. This data directory includes information about the OSD, such as its identifier, which cluster it belongs to, and its private keyring.
  • Data: A large partition occupying the rest of the device that is managed directly by BlueStore and that contains all of the OSD data. This primary device is identified by a block symbolic link in the data directory.

You can also use two additional devices:

  • A WAL (write-ahead-log) device: A device that stores BlueStore internal journal or write-ahead log. It is identified by the block.wal symbolic link in the data directory. Consider using a WAL device only if the device is faster than the primary device. For example, when the WAL device uses an SSD disk and the primary devices uses an HDD disk.
  • A DB device: A device that stores BlueStore internal metadata. The embedded RocksDB database puts as much metadata as it can on the DB device instead of on the primary device to improve performance. If the DB device is full, it starts adding metadata to the primary device. Consider using a DB device only if the device is faster than the primary device.
Warning

If you have only a less than a gigabyte storage available on fast devices. Red Hat recommends using it as a WAL device. If you have more fast devices available, consider using it as a DB device. The BlueStore journal is always placed on the fastest device, so using a DB device provides the same benefit that the WAL device while also allows for storing additional metadata.

10.3. Ceph BlueStore caching

The BlueStore cache is a collection of buffers that, depending on configuration, can be populated with data as the OSD daemon does reading from or writing to the disk. By default in Red Hat Ceph Storage, BlueStore will cache on reads, but not writes. This is because the bluestore_default_buffered_write option is set to false to avoid potential overhead associated with cache eviction.

If the bluestore_default_buffered_write option is set to true, data is written to the buffer first, and then committed to disk. Afterwards, a write acknowledgement is sent to the client, allowing subsequent reads faster access to the data already in cache, until that data is evicted.

Read-heavy workloads will not see an immediate benefit from BlueStore caching. As more reading is done, the cache will grow over time and subsequent reads will see an improvement in performance. How fast the cache populates depends on the BlueStore block and database disk type, and the client’s workload requirements.

Important

Please contact Red Hat support before enabling the bluestore_default_buffered_write option.

10.4. Sizing considerations for Ceph BlueStore

When mixing traditional and solid state drives using BlueStore OSDs, it is important to size the RocksDB logical volume (block.db) appropriately. Red Hat recommends that the RocksDB logical volume be no less than 4% of the block size with object, file and mixed workloads. Red Hat supports 1% of the BlueStore block size with RocksDB and OpenStack block workloads. For example, if the block size is 1 TB for an object workload, then at a minimum, create a 40 GB RocksDB logical volume.

When not mixing drive types, there is no requirement to have a separate RocksDB logical volume. BlueStore will automatically manage the sizing of RocksDB.

BlueStore’s cache memory is used for the key-value pair metadata for RocksDB, BlueStore metadata and object data.

Note

The BlueStore cache memory values are in addition to the memory footprint already being consumed by the OSD.

10.5. Tuning Ceph BlueStore using bluestore_min_alloc_size parameter

This procedure is for new or freshly deployed OSDs.

In BlueStore, the raw partition is allocated and managed in chunks of bluestore_min_alloc_size. By default, bluestore_min_alloc_size is 4096, equivalent to 4 KiB for HDDs and SSDs. The unwritten area in each chunk is filled with zeroes when it is written to the raw partition. This can lead to wasted unused space when not properly sized for your workload, for example when writing small objects.

It is best practice to set bluestore_min_alloc_size to match the smallest write so this can write amplification penalty can be avoided.

Important

Changing the value of bluestore_min_alloc_size is not recommended. For any assistance, contact Red Hat support.

Note

The settings bluestore_min_alloc_size_ssd and bluestore_min_alloc_size_hdd are specific to SSDs and HDDs, respectively, but setting them is not necessary because setting bluestore_min_alloc_size overrides them.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Ceph monitors and managers are deployed in the cluster.
  • Servers or nodes that can be freshly provisioned as OSD nodes
  • The admin keyring for the Ceph Monitor node, if you are redeploying an existing Ceph OSD node.

Procedure

  1. On the bootstrapped node, change the value of bluestore_min_alloc_size parameter:

    Syntax

    ceph config set osd.OSD_ID bluestore_min_alloc_size_DEVICE_NAME_ VALUE

    Example

    [ceph: root@host01 /]# ceph config set osd.4 bluestore_min_alloc_size_hdd 8192

    You can see bluestore_min_alloc_size is set to 8192 bytes, which is equivalent to 8 KiB.

    Note

    The selected values should be power of 2 aligned.

  2. Restart the OSD’s service.

    Syntax

    systemctl restart SERVICE_ID

    Example

    [ceph: root@host01 /]# systemctl restart ceph-499829b4-832f-11eb-8d6d-001a4a000635@osd.4.service

Verification

  • Verify the setting using the ceph daemon command:

    Syntax

    ceph daemon osd.OSD_ID config get bluestore_min_alloc_size__DEVICE_

    Example

    [ceph: root@host01 /]# ceph daemon osd.4 config get bluestore_min_alloc_size_hdd
    
    ceph daemon osd.4 config get bluestore_min_alloc_size
    {
        "bluestore_min_alloc_size": "8192"
    }

Additional Resources

  • For OSD removal and addition, see the Management of OSDs using the Ceph Orchestrator chapter in the Red Hat Ceph Storage Operations Guide and follow the links. For already deployed OSDs, you cannot modify the bluestore_min_alloc_size parameter so you have to remove the OSDs and freshly deploy them again.

10.6. Resharding the RocksDB database using the BlueStore admin tool

You can reshard the database with the BlueStore admin tool. It transforms BlueStore’s RocksDB database from one shape to another into several column families without redeploying the OSDs. Column families have the same features as the whole database, but allows users to operate on smaller data sets and apply different options. It leverages the different expected lifetime of keys stored. The keys are moved during the transformation without creating new keys or deleting existing keys.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • The object store configured as BlueStore.
  • OSD nodes deployed on the hosts.
  • Root level access to the all the hosts.
  • The ceph-common and cephadm packages instaled on all the hosts.

Procedure

  1. Log into the cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Fetch the OSD_ID and the host details from the administration node:

    Example

    [ceph: root@host01 /]# ceph orch ps

  3. Log into the respective host as a root user and stop the OSD:

    Syntax

    cephadm unit --name OSD_ID stop

    Example

    [root@host02 ~]# cephadm unit --name osd.0 stop

  4. Enter into the stopped OSD daemon container:

    Syntax

    cephadm shell --name OSD_ID

    Example

    [root@host02 ~]# cephadm shell --name osd.0

  5. Log into the cephadm shell and check the file system consistency:

    Syntax

    ceph-bluestore-tool --path/var/lib/ceph/osd/ceph-OSD_ID/ fsck

    Example

    [ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ fsck
    
    fsck success

  6. Check the sharding status of the OSD node:

    Syntax

    ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-OSD_ID/ show-sharding

    Example

    [ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ show-sharding
    
    m(3) p(3,0-12) O(3,0-13) L P

  7. Run the ceph-bluestore-tool command to reshard. Red Hat recommends to use the parameters as given in the command:

    Syntax

    ceph-bluestore-tool --log-level 10 -l log.txt --path /var/lib/ceph/osd/ceph-OSD_ID/ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard

    Example

    [ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard
    
    reshard success

  8. To check the sharding status of the OSD node, run the show-sharding command:

    Syntax

    ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-OSD_ID/ show-sharding

    Example

    [ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ show-sharding
    
    m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

  9. Exit from the cephadm shell:

    [ceph: root@host02 /]# exit
  10. Log into the respective host as a root user and start the OSD:

    Syntax

    cephadm unit --name OSD_ID start

    Example

    [root@host02 ~]# cephadm unit --name osd.0 start

Additional Resources

10.7. The BlueStore fragmentation tool

As a storage administrator, you will want to periodically check the fragmentation level of your BlueStore OSDs. You can check fragmentation levels with one simple command for offline or online OSDs.

10.7.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • BlueStore OSDs.

10.7.2. What is the BlueStore fragmentation tool?

For BlueStore OSDs, the free space gets fragmented over time on the underlying storage device. Some fragmentation is normal, but when there is excessive fragmentation this causes poor performance.

The BlueStore fragmentation tool generates a score on the fragmentation level of the BlueStore OSD. This fragmentation score is given as a range, 0 through 1. A score of 0 means no fragmentation, and a score of 1 means severe fragmentation.

Table 10.1. Fragmentation scores' meaning
ScoreFragmentation Amount

0.0 - 0.4

None to tiny fragmentation.

0.4 - 0.7

Small and acceptable fragmentation.

0.7 - 0.9

Considerable, but safe fragmentation.

0.9 - 1.0

Severe fragmentation and that causes performance issues.

Important

If you have severe fragmentation, and need some help in resolving the issue, contact Red Hat Support.

10.7.3. Checking for fragmentation

Checking the fragmentation level of BlueStore OSDs can be done either online or offline.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • BlueStore OSDs.

Online BlueStore fragmentation score

  1. Inspect a running BlueStore OSD process:

    1. Simple report:

      Syntax

      ceph daemon OSD_ID bluestore allocator score block

      Example

      [ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator score block

    2. A more detailed report:

      Syntax

      ceph daemon OSD_ID bluestore allocator dump block

      Example

      [ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator dump block

Offline BlueStore fragmentation score

  1. Stop the OSD service.

    Syntax

    systemctl stop SERVICE_ID

    Example

    [root@host01 ~]# systemctl stop ceph-110bad0a-bc57-11ee-8138-fa163eb9ffc2@osd.2.service

  2. Reshard to check the offline BlueStore OSD.

    Syntax

    [root@host01 ~]# cephadm shell --name osd.ID

    Example

    [root@host01 ~]# cephadm shell --name osd.2
    Inferring fsid 110bad0a-bc57-11ee-8138-fa163eb9ffc2
    Inferring config /var/lib/ceph/110bad0a-bc57-11ee-8138-fa163eb9ffc2/osd.2/config
    Using recent ceph image registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:09fc3e5baf198614d70669a106eb87dbebee16d4e91484375778d4adbccadacd

  3. Inspect the non-running BlueStore OSD process:

    1. For a simple report, run the following command:

      Syntax

      ceph-bluestore-tool --path PATH_TO_OSD_DATA_DIRECTORY --allocator block free-score

      Example

      [root@7fbd6c6293c0 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score

    2. For a more detailed report, run the following command:

      Syntax

      ceph-bluestore-tool --path PATH_TO_OSD_DATA_DIRECTORY --allocator block free-dump
      block:
      {
          "fragmentation_rating": 0.018290238194701977
      }

      Example

      [root@7fbd6c6293c0 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump
      block:
      {
          "capacity": 21470642176,
          "alloc_unit": 4096,
          "alloc_type": "hybrid",
          "alloc_name": "block",
          "extents": [
              {
                  "offset": "0x370000",
                  "length": "0x20000"
              },
              {
                  "offset": "0x3a0000",
                  "length": "0x10000"
              },
              {
                  "offset": "0x3f0000",
                  "length": "0x20000"
              },
              {
                  "offset": "0x460000",
                  "length": "0x10000"
              },

Additional Resources

10.8. Ceph BlueStore BlueFS

BlueStore block database stores metadata as key-value pairs in a RocksDB database. The block database resides on a small BlueFS partition on the storage device. BlueFS is a minimal file system that is designed to hold the RocksDB files.

BlueFS files

There are 3 types of files that RocksDB produces:

  • Control files, for example CURRENT, IDENTITY, and MANIFEST-000011.
  • DB table files, for example 004112.sst.
  • Write ahead logs, for example 000038.log.

Additionally, there is an internal, hidden file that serves as BlueFS replay log, ino 1, that works as directory structure, file mapping, and operations log.

Fallback hierarchy

With BlueFS it is possible to put any file on any device. Parts of file can even reside on different devices, that is WAL, DB, and SLOW. There is an order to where BlueFS puts files. File is put to secondary storage only when primary storage is exhausted, and tertiary only when secondary is exhausted.

The order for the specific files is:

  • Write ahead logs: WAL, DB, SLOW
  • Replay log ino 1: DB, SLOW
  • Control and DB files: DB, SLOW

    • Control and DB file order when running out of space: SLOW

      Important

      There is an exception to control and DB file order. When RocksDB detects that you are running out of space on DB file, it directly notifies you to put file to SLOW device.

10.8.1. Viewing the bluefs_buffered_io setting

As a storage administrator, you can view the current setting for the bluefs_buffered_io parameter.

The option bluefs_buffered_io is set to True by default for Red Hat Ceph Storage. This option enable BlueFS to perform buffered reads in some cases, and enables the kernel page cache to act as a secondary cache for reads like RocksDB block reads.

Important

Changing the value of bluefs_buffered_io is not recommended. Before changing the bluefs_buffered_io parameter, contact your Red Hat Support account team.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor node.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. You can view the current value of the bluefs_buffered_io parameter in three different ways:

Method 1

  • View the value stored in the configuration database:

    Example

    [ceph: root@host01 /]# ceph config get osd bluefs_buffered_io

Method 2

  • View the value stored in the configuration database for a specific OSD:

    Syntax

    ceph config get OSD_ID bluefs_buffered_io

    Example

    [ceph: root@host01 /]# ceph config get osd.2 bluefs_buffered_io

Method 3

  • View the running value for an OSD where the running value is different from the value stored in the configuration database:

    Syntax

    ceph config show OSD_ID bluefs_buffered_io

    Example

    [ceph: root@host01 /]# ceph config show osd.3 bluefs_buffered_io

10.8.2. Viewing Ceph BlueFS statistics for Ceph OSDs

View the BluesFS related information about collocated and non-collocated Ceph OSDs with the bluefs stats command.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • The object store configured as BlueStore.
  • Root-level access to the OSD node.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. View the BlueStore OSD statistics:

    Syntax

    ceph daemon osd.OSD_ID bluefs stats

    Example for collocated OSDs

    [ceph: root@host01 /]# ceph daemon osd.1 bluefs stats
    1 : device size 0x3bfc00000 : using 0x1a428000(420 MiB)
    wal_total:0, db_total:15296836403, slow_total:0

    Example for non-collocated OSDs

    [ceph: root@host01 /]# ceph daemon osd.1 bluefs stats
    0 :
    1 : device size 0x1dfbfe000 : using 0x1100000(17 MiB)
    2 : device size 0x27fc00000 : using 0x248000(2.3 MiB)
    RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7646425907, slow_total:10196562739, db_avail:935539507
    Usage matrix:
    DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
    LOG         0 B         4 MiB       0 B         0 B         0 B         756 KiB     1
    WAL         0 B         4 MiB       0 B         0 B         0 B         3.3 MiB     1
    DB          0 B         9 MiB       0 B         0 B         0 B         76 KiB      10
    SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0
    TOTALS      0 B         17 MiB      0 B         0 B         0 B         0 B         12
    MAXIMUMS:
    LOG         0 B         4 MiB       0 B         0 B         0 B         756 KiB
    WAL         0 B         4 MiB       0 B         0 B         0 B         3.3 MiB
    DB          0 B         11 MiB      0 B         0 B         0 B         112 KiB
    SLOW        0 B         0 B         0 B         0 B         0 B         0 B
    TOTALS      0 B         17 MiB      0 B         0 B         0 B         0 B

    where:

    0: This refers to dedicated WAL device, that is block.wal.

    1: This refers to dedicated DB device, that is block.db.

    2: This refers to main block device, that is block or slow.

    device size: It represents an actual size of the device.

    using: It represents total usage. It is not restricted to BlueFS.

    Note

    DB and WAL devices are used only by BlueFS. For main device, usage from stored BlueStore data is also included. In the above example, 2.3 MiB is the data from BlueStore.

    wal_total, db_total, slow_total: These values reiterate the device values above.

    db_avail: This value represents how many bytes can be taken from SLOW device if necessary.

    Usage matrix
    • The rows WAL, DB, SLOW: Describe where specific file was intended to be put.
    • The row LOG: Describes the BlueFS replay log ino 1.
    • The columns WAL, DB, SLOW: Describe where data is actually put. The values are in allocation units. WAL and DB have bigger allocation units for performance reasons.
    • The columns * / *: Relate to virtual devices new-db and new-wal that are used for ceph-bluestore-tool. It should always show 0 B.
    • The column REAL: Shows actual usage in bytes.
    • The column FILES: Shows count of files.

    MAXIMUMS: This table captures the maximum value of each entry from the usage matrix.

Additional Resources

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.