Chapter 12. BlueStore

12.1. Ceph BlueStore
Copier lien

The following are some of the main features of using BlueStore:

Direct management of storage devices: BlueStore consumes raw block devices or partitions. This avoids any intervening layers of abstraction, such as local file systems like XFS, that might limit performance or add complexity.
Metadata management with RocksDB: BlueStore uses the RocksDB key-value database to manage internal metadata, such as the mapping from object names to block locations on a disk.
Full data and metadata checksumming: By default all data and metadata written to BlueStore is protected by one or more checksums. No data or metadata are read from disk or returned to the user without verification.
Inline compression: Data can be optionally compressed before being written to a disk.
Efficient copy-on-write: The Ceph Block Device and Ceph File System snapshots rely on a copy-on-write clone mechanism that is implemented efficiently in BlueStore. This results in efficient I/O both for regular snapshots and for erasure coded pools which rely on cloning to implement efficient two-phase commits.
No large double-writes: BlueStore first writes any new data to unallocated space on a block device, and then commits a RocksDB transaction that updates the object metadata to reference the new region of the disk. Only when the write operation is below a configurable size threshold, it falls back to a write-ahead journaling scheme.
Multi-device support: BlueStore can use multiple block devices for storing different data. For example: Hard Disk Drive (HDD) for the data, Solid-state Drive (SSD) for metadata, Non-volatile Memory (NVM) or Non-volatile random-access memory (NVRAM) or persistent memory for the RocksDB write-ahead log (WAL). See Ceph BlueStore devices for details.
Efficient block device usage: Because BlueStore does not use any file system, it minimizes the need to clear the storage device cache.
Allocation metadata: Allocation metadata is no longer using the standalone objects in RocksDB as the allocation information can be deduced from the aggregate allocation state of all onodes in the system which are stored in the RocksDB already. BlueStore V3 code skips the RocksDB updates on allocation time and performs a full destage of the allocator object with all the OSD allocation state in a single step during umount. This results in a 25% increase in IOPS and reduced latency in small random-write workloads; however, it prolongs the recovery time, usually by a few extra minutes, in failure cases where an umount is not called since you need to iterate over all onodes to recreate the allocation metadata.
Cache age binning: Red Hat Ceph Storage associates items in the different caches with "age bins", which gives a view of the relative ages of all the cache items.

12.2. Ceph BlueStore devices
Copier lien

BlueStore manages either one, two, or three storage devices in the backend.

Primary
WAL
DB

In the simplest case, BlueStore consumes a single primary storage device. The storage device is normally used as a whole, occupying the full device that is managed by BlueStore directly. The primary device is identified by a block symlink in the data directory.

The data directory is a tmpfs mount which gets populated with all the common OSD files that hold information about the OSD, like the identifier, which cluster it belongs to, and its private keyring.

The storage device is partitioned into two parts that contain:

OSD metadata: A small partition formatted with XFS that contains basic metadata for the OSD. This data directory includes information about the OSD, such as its identifier, which cluster it belongs to, and its private keyring.
Data: A large partition occupying the rest of the device that is managed directly by BlueStore and that contains all of the OSD data. This primary device is identified by a block symbolic link in the data directory.

You can also use two additional devices:

A WAL (write-ahead-log) device: A device that stores BlueStore internal journal or write-ahead log. It is identified by the block.wal symbolic link in the data directory. Consider using a WAL device only if the device is faster than the primary device. For example, when the WAL device uses an SSD disk and the primary device uses an HDD disk.
A DB device: A device that stores BlueStore internal metadata. The embedded RocksDB database puts as much metadata as it can on the DB device instead of on the primary device to improve performance. If the DB device is full, it starts adding metadata to the primary device. Consider using a DB device only if the device is faster than the primary device.

Warning

If you have only less than a gigabyte storage available on fast devices, Red Hat recommends using it as a WAL device. If you have more fast devices available, consider using it as a DB device. The BlueStore journal is always placed on the fastest device, so using a DB device provides the same benefit that the WAL device while also allows for storing additional metadata.

12.3. Ceph BlueStore caching
Copier lien

The BlueStore cache is a collection of buffers that, depending on configuration, can be populated with data as the OSD daemon does reading from or writing to the disk. By default in Red Hat Ceph Storage, BlueStore will cache on reads, but not writes. This is because the bluestore_default_buffered_write option is set to false to avoid potential overhead associated with cache eviction.

If the bluestore_default_buffered_write option is set to true, data is written to the buffer first, and then committed to disk. Afterwards, a write acknowledgement is sent to the client, allowing subsequent reads faster access to the data already in cache, until that data is evicted.

Read-heavy workloads will not see an immediate benefit from BlueStore caching. As more reading is done, the cache will grow over time and subsequent reads will see an improvement in performance. How fast the cache populates depends on the BlueStore block and database disk type, and the client’s workload requirements.

Important

Please contact Red Hat support before enabling the bluestore_default_buffered_write option.

Cache age binning

Red Hat Ceph Storage associates items in the different caches with "age bins", which gives a view of the relative ages of all the cache items. For example, when there are old onode entries sitting in the BlueStore onode cache, a hot read workload occurs against a single large object. The priority cache for that OSD sorts the older onode entries into a lower priority level than the buffer cache data for the hot object. Although Ceph might, in general, heavily favor onodes at a given priority level, in this hot workload scenario, older onodes might be assigned a lower priority level than the hot workload data, so that the buffer data memory request is fulfilled first.

12.4. Sizing considerations for Ceph BlueStore
Copier lien

When mixing traditional and solid state drives using BlueStore OSDs, it is important to size the RocksDB logical volume (block.db) appropriately. Red Hat recommends that the RocksDB logical volume be no less than 4% of the block size with object, file and mixed workloads. Red Hat supports 1% of the BlueStore block size with RocksDB and OpenStack block workloads. For example, if the block size is 1 TB for an object workload, then at a minimum, create a 40 GB RocksDB logical volume.

When not mixing drive types, there is no requirement to have a separate RocksDB logical volume. BlueStore will automatically manage the sizing of RocksDB.

BlueStore’s cache memory is used for the key-value pair metadata for RocksDB, BlueStore metadata, and object data.

Note

The BlueStore cache memory values are in addition to the memory footprint already being consumed by the OSD.

12.5. Tuning Ceph BlueStore using bluestore_min_alloc_size parameter
Copier lien

This procedure is for new or freshly deployed OSDs.

In BlueStore, the raw partition is allocated and managed in chunks of bluestore_min_alloc_size. By default, bluestore_min_alloc_size is 4096, equivalent to 4 KiB for HDDs and SSDs. The unwritten area in each chunk is filled with zeroes when it is written to the raw partition. This can lead to wasted unused space when not properly sized for your workload, for example when writing small objects.

It is best practice to set bluestore_min_alloc_size to match the smallest write so this write amplification penalty can be avoided.

Important

Changing the value of bluestore_min_alloc_size is not recommended. For any assistance, contact Red Hat support.

Note

The settings bluestore_min_alloc_size_ssd and bluestore_min_alloc_size_hdd are specific to SSDs and HDDs, respectively, but setting them is not necessary because setting bluestore_min_alloc_size overrides them.

Prerequisites

A running Red Hat Ceph Storage cluster.
Ceph monitors and managers are deployed in the cluster.
Servers or nodes that can be freshly provisioned as OSD nodes
The admin keyring for the Ceph Monitor node, if you are redeploying an existing Ceph OSD node.

Procedure

On the bootstrapped node, change the value of bluestore_min_alloc_size parameter:
Syntax
```
ceph config set osd.OSD_ID bluestore_min_alloc_size_DEVICE_NAME_ VALUE
```
```
ceph config set osd.OSD_ID bluestore_min_alloc_size_DEVICE_NAME_ VALUE
```
Copy to Clipboard Toggle word wrap
Example
```
[ceph: root@host01 /]# ceph config set osd.4 bluestore_min_alloc_size_hdd 8192
```
```
[ceph: root@host01 /]# ceph config set osd.4 bluestore_min_alloc_size_hdd 8192
```
Copy to Clipboard Toggle word wrap
You can see bluestore_min_alloc_size is set to 8192 bytes, which is equivalent to 8 KiB.
Note
The selected values should be power of 2 aligned.

Restart the OSD’s service.

Syntax

systemctl restart SERVICE_ID

systemctl restart SERVICE_ID

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# systemctl restart ceph-499829b4-832f-11eb-8d6d-001a4a000635@osd.4.service

[ceph: root@host01 /]# systemctl restart ceph-499829b4-832f-11eb-8d6d-001a4a000635@osd.4.service

Copy to Clipboard

Toggle word wrap

Verification

Verify the setting using the ceph daemon command:

Syntax

ceph daemon osd.OSD_ID config get bluestore_min_alloc_size__DEVICE_

ceph daemon osd.OSD_ID config get bluestore_min_alloc_size__DEVICE_

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# ceph daemon osd.4 config get bluestore_min_alloc_size_hdd

ceph daemon osd.4 config get bluestore_min_alloc_size
{
    "bluestore_min_alloc_size": "8192"
}

[ceph: root@host01 /]# ceph daemon osd.4 config get bluestore_min_alloc_size_hdd

ceph daemon osd.4 config get bluestore_min_alloc_size
{
    "bluestore_min_alloc_size": "8192"
}

Copy to Clipboard

Toggle word wrap

12.6. Resharding the RocksDB database using the BlueStore admin tool
Copier lien

You can reshard the database with the BlueStore admin tool. It transforms BlueStore’s RocksDB database from one shape to another into several column families without redeploying the OSDs. Column families have the same features as the whole database, but allows users to operate on smaller data sets and apply different options. It leverages the different expected lifetime of keys stored. The keys are moved during the transformation without creating new keys or deleting existing keys.

There are two ways to reshard the OSD:

Use the rocksdb-resharding.yml playbook.
Manually reshard the OSDs.

Prerequisites

A running Red Hat Ceph Storage cluster.
The object store configured as BlueStore.
OSD nodes deployed on the hosts.
Root level access to the all the hosts.
The ceph-common and cephadm packages installed on all the hosts.

12.6.1. Use the rocksdb-resharding.yml playbook
Copier lien

As a root user, on the administration node, navigate to the cephadm folder where the playbook is installed:
Example
```
cd /usr/share/cephadm-ansible
```
```
[root@host01 ~]# cd /usr/share/cephadm-ansible
```
Copy to Clipboard Toggle word wrap

Run the playbook:

Syntax

ansible-playbook -i hosts rocksdb-resharding.yml -e osd_id=OSD_ID -e admin_node=HOST_NAME

ansible-playbook -i hosts rocksdb-resharding.yml -e osd_id=OSD_ID -e admin_node=HOST_NAME

Copy to Clipboard

Toggle word wrap

Example

ansible-playbook -i hosts rocksdb-resharding.yml -e osd_id=7 -e admin_node=host03

...............
TASK [stop the osd] ***********************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:18 +0000 (0:00:00.037)       0:00:03.864 ****
changed: [localhost -> host03]
TASK [set_fact ceph_cmd] ******************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:32 +0000 (0:00:14.128)       0:00:17.992 ****
ok: [localhost -> host03]

TASK [check fs consistency with fsck before resharding] ***********************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:32 +0000 (0:00:00.041)       0:00:18.034 ****
ok: [localhost -> host03]

TASK [show current sharding] **************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:43 +0000 (0:00:11.053)       0:00:29.088 ****
ok: [localhost -> host03]

TASK [reshard] ****************************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:45 +0000 (0:00:01.446)       0:00:30.534 ****
ok: [localhost -> host03]

TASK [check fs consistency with fsck after resharding] ************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:46 +0000 (0:00:01.479)       0:00:32.014 ****
ok: [localhost -> host03]

TASK [restart the osd] ********************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:57 +0000 (0:00:10.699)       0:00:42.714 ****
changed: [localhost -> host03]

[root@host01 ~]# ansible-playbook -i hosts rocksdb-resharding.yml -e osd_id=7 -e admin_node=host03

...............
TASK [stop the osd] ***********************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:18 +0000 (0:00:00.037)       0:00:03.864 ****
changed: [localhost -> host03]
TASK [set_fact ceph_cmd] ******************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:32 +0000 (0:00:14.128)       0:00:17.992 ****
ok: [localhost -> host03]

TASK [check fs consistency with fsck before resharding] ***********************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:32 +0000 (0:00:00.041)       0:00:18.034 ****
ok: [localhost -> host03]

TASK [show current sharding] **************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:43 +0000 (0:00:11.053)       0:00:29.088 ****
ok: [localhost -> host03]

TASK [reshard] ****************************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:45 +0000 (0:00:01.446)       0:00:30.534 ****
ok: [localhost -> host03]

TASK [check fs consistency with fsck after resharding] ************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:46 +0000 (0:00:01.479)       0:00:32.014 ****
ok: [localhost -> host03]

TASK [restart the osd] ********************************************************************************************************************************************************************************************
Wednesday 29 November 2023  11:25:57 +0000 (0:00:10.699)       0:00:42.714 ****
changed: [localhost -> host03]

Copy to Clipboard

Toggle word wrap

Verify that the resharding is complete.

Stop the OSD that is resharded:

Example

[ceph: root@host01 /]# ceph orch daemon stop osd.7

[ceph: root@host01 /]# ceph orch daemon stop osd.7

Copy to Clipboard

Toggle word wrap

Enter the OSD container:
Example
```
cephadm shell --name osd.7
```
```
[root@host03 ~]# cephadm shell --name osd.7
```
Copy to Clipboard Toggle word wrap

Check for resharding:

Example

[ceph: root@host03 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-7/ show-sharding
    m(3) p(3,0-12) O(3,0-13) L P

[ceph: root@host03 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-7/ show-sharding
    m(3) p(3,0-12) O(3,0-13) L P

Copy to Clipboard

Toggle word wrap

Start the OSD:

Example

[ceph: root@host01 /]# ceph orch daemon start osd.7

[ceph: root@host01 /]# ceph orch daemon start osd.7

Copy to Clipboard

Toggle word wrap

12.6.2. Manually resharding the OSDs
Copier lien

Log into the cephadm shell:
Example
```
cephadm shell
```
```
[root@host01 ~]# cephadm shell
```
Copy to Clipboard Toggle word wrap
Fetch the OSD_ID and the host details from the administration node:
Example
```
[ceph: root@host01 /]# ceph orch ps
```
```
[ceph: root@host01 /]# ceph orch ps
```
Copy to Clipboard Toggle word wrap
Log into the respective host as a root user and stop the OSD:
Syntax
```
cephadm unit --name OSD_ID stop
```
```
cephadm unit --name OSD_ID stop
```
Copy to Clipboard Toggle word wrap
Example
```
cephadm unit --name osd.0 stop
```
```
[root@host02 ~]# cephadm unit --name osd.0 stop
```
Copy to Clipboard Toggle word wrap
Enter into the stopped OSD daemon container:
Syntax
```
cephadm shell --name OSD_ID
```
```
cephadm shell --name OSD_ID
```
Copy to Clipboard Toggle word wrap
Example
```
cephadm shell --name osd.0
```
```
[root@host02 ~]# cephadm shell --name osd.0
```
Copy to Clipboard Toggle word wrap

Log into the cephadm shell and check the file system consistency:

Syntax

ceph-bluestore-tool --path/var/lib/ceph/osd/ceph-OSD_ID/ fsck

ceph-bluestore-tool --path/var/lib/ceph/osd/ceph-OSD_ID/ fsck

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ fsck

fsck success

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ fsck

fsck success

Copy to Clipboard

Toggle word wrap

Check the sharding status of the OSD node:

Syntax

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-OSD_ID/ show-sharding

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-OSD_ID/ show-sharding

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ show-sharding

m(3) p(3,0-12) O(3,0-13) L P

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ show-sharding

m(3) p(3,0-12) O(3,0-13) L P

Copy to Clipboard

Toggle word wrap

Run the ceph-bluestore-tool command to reshard. Red Hat recommends to use the parameters as given in the command:

Syntax

ceph-bluestore-tool --log-level 10 -l log.txt --path /var/lib/ceph/osd/ceph-OSD_ID/ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard

ceph-bluestore-tool --log-level 10 -l log.txt --path /var/lib/ceph/osd/ceph-OSD_ID/ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard

reshard success

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard

reshard success

Copy to Clipboard

Toggle word wrap

To check the sharding status of the OSD node, run the show-sharding command:

Syntax

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-OSD_ID/ show-sharding

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-OSD_ID/ show-sharding

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ show-sharding

m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

[ceph: root@host02 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-6/ show-sharding

m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

Copy to Clipboard

Toggle word wrap

Exit from the cephadm shell:
```
[ceph: root@host02 /]# exit
```
```
[ceph: root@host02 /]# exit
```
Copy to Clipboard Toggle word wrap
Log into the respective host as a root user and start the OSD:
Syntax
```
cephadm unit --name OSD_ID start
```
```
cephadm unit --name OSD_ID start
```
Copy to Clipboard Toggle word wrap
Example
```
cephadm unit --name osd.0 start
```
```
[root@host02 ~]# cephadm unit --name osd.0 start
```
Copy to Clipboard Toggle word wrap

12.7. The BlueStore fragmentation tool
Copier lien

As a storage administrator, you will want to periodically check the fragmentation level of your BlueStore OSDs. You can check fragmentation levels with one simple command for offline or online OSDs.

12.7.1. What is the BlueStore fragmentation tool?
Copier lien

For BlueStore OSDs, the free space gets fragmented over time on the underlying storage device. Some fragmentation is normal, but when there is excessive fragmentation this causes poor performance.

The BlueStore fragmentation tool generates a score on the fragmentation level of the BlueStore OSD. This fragmentation score is given as a range, 0 through 1. A score of 0 means no fragmentation, and a score of 1 means severe fragmentation.

Expand

Table 12.1. Fragmentation scores' meaning
Score	Fragmentation Amount
0.0 - 0.4	None to tiny fragmentation.
0.4 - 0.7	Small and acceptable fragmentation.
0.7 - 0.9	Considerable, but safe fragmentation.
0.9 - 1.0	Severe fragmentation and that causes performance issues.

Important

If you have severe fragmentation, and need some help in resolving the issue, contact Red Hat Support.

12.7.2. Checking for fragmentation
Copier lien

Checking the fragmentation level of BlueStore OSDs can be done either online or offline.

Prerequisites

A running Red Hat Ceph Storage cluster.
BlueStore OSDs.

Online BlueStore fragmentation score

Inspect a running BlueStore OSD process:

Simple report:

Syntax

ceph daemon OSD_ID bluestore allocator score block

ceph daemon OSD_ID bluestore allocator score block

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator score block

[ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator score block

Copy to Clipboard

Toggle word wrap

A more detailed report:

Syntax

ceph daemon OSD_ID bluestore allocator dump block

ceph daemon OSD_ID bluestore allocator dump block

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator dump block

[ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator dump block

Copy to Clipboard

Toggle word wrap

Offline BlueStore fragmentation score

Reshard the BlueStore OSD.

Syntax

cephadm shell --name osd.ID

[root@host01 ~]# cephadm shell --name osd.ID

Copy to Clipboard

Toggle word wrap

Example

cephadm shell --name osd.2
Inferring fsid 110bad0a-bc57-11ee-8138-fa163eb9ffc2
Inferring config /var/lib/ceph/110bad0a-bc57-11ee-8138-fa163eb9ffc2/osd.2/config
Using ceph image with id `17334f841482` and tag `ceph-7-rhel-9-containers-candidate-59483-20240301201929` created on 2024-03-01 20:22:41 +0000 UTC
registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:09fc3e5baf198614d70669a106eb87dbebee16d4e91484375778d4adbccadacd

[root@host01 ~]# cephadm shell --name osd.2
Inferring fsid 110bad0a-bc57-11ee-8138-fa163eb9ffc2
Inferring config /var/lib/ceph/110bad0a-bc57-11ee-8138-fa163eb9ffc2/osd.2/config
Using ceph image with id `17334f841482` and tag `ceph-7-rhel-9-containers-candidate-59483-20240301201929` created on 2024-03-01 20:22:41 +0000 UTC
registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:09fc3e5baf198614d70669a106eb87dbebee16d4e91484375778d4adbccadacd

Copy to Clipboard

Toggle word wrap

Inspect the non-running BlueStore OSD process.

For a simple report, run the following command:

Syntax

ceph-bluestore-tool --path PATH_TO_OSD_DATA_DIRECTORY --allocator block free-score

ceph-bluestore-tool --path PATH_TO_OSD_DATA_DIRECTORY --allocator block free-score

Copy to Clipboard

Toggle word wrap

Example

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score

[root@host01 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score

Copy to Clipboard

Toggle word wrap

For a more detailed report, run the following command:

Syntax

ceph-bluestore-tool --path PATH_TO_OSD_DATA_DIRECTORY --allocator block free-dump
block:
{
    "fragmentation_rating": 0.018290238194701977
}

ceph-bluestore-tool --path PATH_TO_OSD_DATA_DIRECTORY --allocator block free-dump
block:
{
    "fragmentation_rating": 0.018290238194701977
}

Copy to Clipboard

Toggle word wrap

Example

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump
block:
{
    "capacity": 21470642176,
    "alloc_unit": 4096,
    "alloc_type": "hybrid",
    "alloc_name": "block",
    "extents": [
        {
            "offset": "0x370000",
            "length": "0x20000"
        },
        {
            "offset": "0x3a0000",
            "length": "0x10000"
        },
        {
            "offset": "0x3f0000",
            "length": "0x20000"
        },
        {
            "offset": "0x460000",
            "length": "0x10000"
        },

[root@host01 /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump
block:
{
    "capacity": 21470642176,
    "alloc_unit": 4096,
    "alloc_type": "hybrid",
    "alloc_name": "block",
    "extents": [
        {
            "offset": "0x370000",
            "length": "0x20000"
        },
        {
            "offset": "0x3a0000",
            "length": "0x10000"
        },
        {
            "offset": "0x3f0000",
            "length": "0x20000"
        },
        {
            "offset": "0x460000",
            "length": "0x10000"
        },

Copy to Clipboard

Toggle word wrap

12.8. Ceph BlueStore BlueFS
Copier lien

BlueStore block database stores metadata as key-value pairs in a RocksDB database. The block database resides on a small BlueFS partition on the storage device. BlueFS is a minimal file system that is designed to hold the RocksDB files.

BlueFS files

Following are the three types of files that RocksDB produces:

Control files, for example CURRENT, IDENTITY, and MANIFEST-000011.
DB table files, for example 004112.sst.
Write ahead logs, for example 000038.log.

Additionally, there is an internal, hidden file that serves as BlueFS replay log, ino 1, that works as directory structure, file mapping, and operations log.

Fallback hierarchy

With BlueFS it is possible to put any file on any device. Parts of file can even reside on different devices, that is WAL, DB, and SLOW. There is an order to where BlueFS puts files. File is put to secondary storage only when primary storage is exhausted, and tertiary only when secondary is exhausted.

The order for the specific files is:

Write ahead logs: WAL, DB, SLOW
Replay log ino 1: DB, SLOW
Control and DB files: DB, SLOW
- Control and DB file order when running out of space: SLOW
  Important
  There is an exception to control and DB file order. When RocksDB detects that you are running out of space on DB file, it directly notifies you to put file to SLOW device.

12.8.1. Viewing the bluefs_buffered_io setting
Copier lien

As a storage administrator, you can view the current setting for the bluefs_buffered_io parameter.

The option bluefs_buffered_io is set to True by default for Red Hat Ceph Storage. This option enable BlueFS to perform buffered reads in some cases, and enables the kernel page cache to act as a secondary cache for reads like RocksDB block reads.

Important

Changing the value of bluefs_buffered_io is not recommended. Before changing the bluefs_buffered_io parameter, contact your Red Hat Support account team.

Prerequisites

A running Red Hat Ceph Storage cluster.
Root-level access to the Ceph Monitor node.

Procedure

Log into the Cephadm shell:
Example
```
cephadm shell
```
```
[root@host01 ~]# cephadm shell
```
Copy to Clipboard Toggle word wrap
You can view the current value of the bluefs_buffered_io parameter in three different ways:

Method 1

View the value stored in the configuration database:

Example

[ceph: root@host01 /]# ceph config get osd bluefs_buffered_io

[ceph: root@host01 /]# ceph config get osd bluefs_buffered_io

Copy to Clipboard

Toggle word wrap

Method 2

View the value stored in the configuration database for a specific OSD:

Syntax

ceph config get OSD_ID bluefs_buffered_io

ceph config get OSD_ID bluefs_buffered_io

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# ceph config get osd.2 bluefs_buffered_io

[ceph: root@host01 /]# ceph config get osd.2 bluefs_buffered_io

Copy to Clipboard

Toggle word wrap

Method 3

View the running value for an OSD where the running value is different from the value stored in the configuration database:
Syntax
```
ceph config show OSD_ID bluefs_buffered_io
```
```
ceph config show OSD_ID bluefs_buffered_io
```
Copy to Clipboard Toggle word wrap
Example
```
[ceph: root@host01 /]# ceph config show osd.3 bluefs_buffered_io
```
```
[ceph: root@host01 /]# ceph config show osd.3 bluefs_buffered_io
```
Copy to Clipboard Toggle word wrap

12.8.2. Viewing Ceph BlueFS statistics for Ceph OSDs
Copier lien

View the BluesFS related information about collocated and non-collocated Ceph OSDs with the bluefs stats command.

Prerequisites

A running Red Hat Ceph Storage cluster.
The object store configured as BlueStore.
Root-level access to the OSD node.

Procedure

Log into the Cephadm shell:
Example
```
cephadm shell
```
```
[root@host01 ~]# cephadm shell
```
Copy to Clipboard Toggle word wrap

View the BlueStore OSD statistics:

Syntax

ceph daemon osd.OSD_ID bluefs stats

ceph daemon osd.OSD_ID bluefs stats

Copy to Clipboard

Toggle word wrap

Example for collocated OSDs

[ceph: root@host01 /]# ceph daemon osd.1 bluefs stats

1 : device size 0x3bfc00000 : using 0x1a428000(420 MiB)
wal_total:0, db_total:15296836403, slow_total:0

[ceph: root@host01 /]# ceph daemon osd.1 bluefs stats

1 : device size 0x3bfc00000 : using 0x1a428000(420 MiB)
wal_total:0, db_total:15296836403, slow_total:0

Copy to Clipboard

Toggle word wrap

Example for non-collocated OSDs

[ceph: root@host01 /]# ceph daemon osd.1 bluefs stats

0 :
1 : device size 0x1dfbfe000 : using 0x1100000(17 MiB)
2 : device size 0x27fc00000 : using 0x248000(2.3 MiB)
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7646425907, slow_total:10196562739, db_avail:935539507
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         4 MiB       0 B         0 B         0 B         756 KiB     1
WAL         0 B         4 MiB       0 B         0 B         0 B         3.3 MiB     1
DB          0 B         9 MiB       0 B         0 B         0 B         76 KiB      10
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0
TOTALS      0 B         17 MiB      0 B         0 B         0 B         0 B         12
MAXIMUMS:
LOG         0 B         4 MiB       0 B         0 B         0 B         756 KiB
WAL         0 B         4 MiB       0 B         0 B         0 B         3.3 MiB
DB          0 B         11 MiB      0 B         0 B         0 B         112 KiB
SLOW        0 B         0 B         0 B         0 B         0 B         0 B
TOTALS      0 B         17 MiB      0 B         0 B         0 B         0 B

[ceph: root@host01 /]# ceph daemon osd.1 bluefs stats

0 :
1 : device size 0x1dfbfe000 : using 0x1100000(17 MiB)
2 : device size 0x27fc00000 : using 0x248000(2.3 MiB)
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7646425907, slow_total:10196562739, db_avail:935539507
Usage matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES
LOG         0 B         4 MiB       0 B         0 B         0 B         756 KiB     1
WAL         0 B         4 MiB       0 B         0 B         0 B         3.3 MiB     1
DB          0 B         9 MiB       0 B         0 B         0 B         76 KiB      10
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0
TOTALS      0 B         17 MiB      0 B         0 B         0 B         0 B         12
MAXIMUMS:
LOG         0 B         4 MiB       0 B         0 B         0 B         756 KiB
WAL         0 B         4 MiB       0 B         0 B         0 B         3.3 MiB
DB          0 B         11 MiB      0 B         0 B         0 B         112 KiB
SLOW        0 B         0 B         0 B         0 B         0 B         0 B
TOTALS      0 B         17 MiB      0 B         0 B         0 B         0 B

Copy to Clipboard

Toggle word wrap

where:

0: This refers to dedicated WAL device, that is block.wal.

1: This refers to dedicated DB device, that is block.db.

2: This refers to main block device, that is block or slow.

device size: It represents an actual size of the device.

using: It represents total usage. It is not restricted to BlueFS.

Note

DB and WAL devices are used only by BlueFS. For main device, usage from stored BlueStore data is also included. In the above example, 2.3 MiB is the data from BlueStore.

wal_total, db_total, slow_total: These values reiterate the device values above.

db_avail: This value represents how many bytes can be taken from SLOW device if necessary.

Usage matrix

The rows WAL, DB, SLOW: Describe where specific file was intended to be put.
The row LOG: Describes the BlueFS replay log ino 1.
The columns WAL, DB, SLOW: Describe where data is actually put. The values are in allocation units. WAL and DB have bigger allocation units for performance reasons.
The columns * / *: Relate to virtual devices new-db and new-wal that are used for ceph-bluestore-tool. It should always show 0 B.
The column REAL: Shows actual usage in bytes.
The column FILES: Shows count of files.

MAXIMUMS: this table captures the maximum value of each entry from the usage matrix.

12.9. Using the ceph-blustore-tool
Copier lien

ceph-bluestore-tool is a utility to perform low-level administrative operations on a BlueStore instance.

The following commands are available with the ceph-bluestore-tool

Syntax

ceph-bluestore-tool COMMAND [ --dev DEVICE … ] [ -i OSD_ID ] [ --path OSD_PATH ] [ --out-dir DIR ] [ --log-file | -l filename ] [ --deep ]

ceph-bluestore-tool fsck|repair --path OSD_PATH [ --deep ]

ceph-bluestore-tool qfsck --path OSD_PATH

ceph-bluestore-tool allocmap --path OSD_PATH

ceph-bluestore-tool restore_cfb --path OSD_PATH

ceph-bluestore-tool show-label --dev DEVICE …

ceph-bluestore-tool prime-osd-dir --dev DEVICE --path OSD_PATH

ceph-bluestore-tool bluefs-export --path OSD_PATH --out-dir DIR

ceph-bluestore-tool bluefs-bdev-new-wal --path OSD_PATH --dev-target NEW_DEVICE

ceph-bluestore-tool bluefs-bdev-new-db --path OSD_PATH --dev-target NEW_DEVICE

ceph-bluestore-tool bluefs-bdev-migrate --path OSD_PATH --dev-target NEW_DEVICE --devs-source DEVICE1 [--devs-source DEVICE2]

ceph-bluestore-tool free-dump|free-score --path OSD_PATH [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]

ceph-bluestore-tool reshard --path OSD_PATH --sharding NEW_SHARDING [ --sharding-ctrl CONTROL_STRING ]

ceph-bluestore-tool show-sharding --path OSD_PATH

ceph-bluestore-tool COMMAND [ --dev DEVICE … ] [ -i OSD_ID ] [ --path OSD_PATH ] [ --out-dir DIR ] [ --log-file | -l filename ] [ --deep ]

ceph-bluestore-tool fsck|repair --path OSD_PATH [ --deep ]

ceph-bluestore-tool qfsck --path OSD_PATH

ceph-bluestore-tool allocmap --path OSD_PATH

ceph-bluestore-tool restore_cfb --path OSD_PATH

ceph-bluestore-tool show-label --dev DEVICE …

ceph-bluestore-tool prime-osd-dir --dev DEVICE --path OSD_PATH

ceph-bluestore-tool bluefs-export --path OSD_PATH --out-dir DIR

ceph-bluestore-tool bluefs-bdev-new-wal --path OSD_PATH --dev-target NEW_DEVICE

ceph-bluestore-tool bluefs-bdev-new-db --path OSD_PATH --dev-target NEW_DEVICE

ceph-bluestore-tool bluefs-bdev-migrate --path OSD_PATH --dev-target NEW_DEVICE --devs-source DEVICE1 [--devs-source DEVICE2]

ceph-bluestore-tool free-dump|free-score --path OSD_PATH [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]

ceph-bluestore-tool reshard --path OSD_PATH --sharding NEW_SHARDING [ --sharding-ctrl CONTROL_STRING ]

ceph-bluestore-tool show-sharding --path OSD_PATH

Copy to Clipboard

Toggle word wrap

Every BlueStore block device has a single block label at the beginning of the device. You can dump the contents of the label with:

ceph-bluestore-tool show-label --dev DEVICE

The main device contains a lot of metadata, including information that used to be stored in small files in the OSD data directory. The auxiliary devices (db and wal) only have the minimum required fields: OSD UUID, size, device type, and birth time.

Generate the content for an OSD data directory that can start up a BlueStore OSD with the prime-osd-dir command.

ceph-bluestore-tool prime-osd-dir --dev MAIN_DEVICE --path /var/lib/ceph/osd/ceph-ID

Expand

Table 12.2. ceph-bluestore-tool commands
Command	Description
`help`	Show help
`fsck [--deep]`	Options: on,off; yes,no; 1,0; or true,false. Run consistency check on BlueStore metadata. If --deep is specified, also read all object data and verify checksums.
`repair`	Run a consistency check and repair any errors.
`qfsck`	Run consistency check on BlueStore metadata comparing allocator data with ONodes state. The allocator data comes from the RocksDB CFB, when exists, and if not uses allocation-file.
`allocmap`	Performs the same check done by `qfsck` and then stores a new allocation-file. This command is disabled by default and requires a special build.
`restore_cfb`	Reverses changes done by the new NCB code (either through `ceph restart` or when running the `allocmap` command) and restores RocksDB B Column-Family (allocator-map).
`bluefs-export`	Export the contents of BlueFS to an output directory.
`bluefs-bdev-sizes --path OSD_PATH`	Print the device sizes, as understood by BlueFS, to stdout.
`bluefs-bdev-expand --path OSD_PATH`	Instruct BlueFS to check the size of its block devices and, if they have expanded, make use of the additional space. Note that only the new files created by BlueFS will be allocated on the preferred block device if it has enough free space, and the existing files that have spilled over to the slow device will be gradually removed when RocksDB performs compaction. In other words, if there is any data spilled over to the slow device, it will be moved to the fast device over time.
`bluefs-bdev-new-wal --path OSD_PATH --dev-target NEW_DEVICE`	Adds WAL device to BlueFS, fails if WAL device already exists.
`bluefs-bdev-new-db --path OSD_PATH --dev-target NEW_DEVICE`	Adds DB device to BlueFS, fails if DB device already exists.
`bluefs-bdev-migrate --dev-target NEW_DEVICE --devs-source DEVICE1 [--devs-source DEVICE2]`	Moves BlueFS data from source device(s) to the target one, source devices (except the main one) are removed on success. Target device can be both already attached or new device. In the latter case it’s added to OSD replacing one of the source devices. Following replacement rules apply (in the order of precedence, stop on the first match): (1) if source list has DB volume - target device replaces it. (2) if source list has WAL volume - target device replace it. (3) if source list has slow volume only - operation isn’t permitted, requires explicit allocation via new-db/new-wal command.
`show-label --dev DEVICE […]`	Show any device labels.
`free-dump --path OSD_PATH [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]`	Dump all free regions in allocator.
`free-score --path OSD_PATH [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]`	Give a [0-1] number that represents quality of fragmentation in allocator. 0 represents case when all free space is in one chunk. 1 represents worst possible fragmentation.
`reshard --path OSD_PATH --sharding NEW_SHARDING [ --resharding-ctrl CONTROL_STRING ]`	Changes sharding of BlueStore’s RocksDB. Sharding is build on top of RocksDB column families. This option allows to test performance of new sharding without need to redeploy OSD. Resharding is usually a long process, which involves walking through entire RocksDB key space and moving some of them to different column families. The `--resharding-ctrl` option provides performance control over resharding process. Interrupted resharding will prevent OSD from running. Interrupted resharding does not corrupt data. It is always possible to continue previous resharding, or select any other sharding scheme, including reverting to original one. For more information about resharding, see the Manually resharding the OSDs section of Resharding the RocksDB database using the BlueStore admin tool.
`show-sharding --path OSD_PATH`	Show sharding that is currently applied to BlueStore’s RocksDB.

Expand

Table 12.3. ceph-bluestore-tool command options
Command option	Description
`--dev DEVICE`	Add the device to the list of devices to consider.
`-i OSD_ID`	Operate as OSD OSD_ID. Connect to monitor for OSD specific options. If monitor is unavailable, add `--no-mon-config` to read from ceph.conf instead.
`--devs-source DEVICE`	Add the device to the list of devices to consider as sources for migration.
`--dev-target DEVICE`	Specify target device migrate operation or device to add for adding new DB/WAL.
`--path OSD_PATH`	Specify an OSD path. In most cases, the device list is inferred from the symlinks present in osd path. This is usually simpler than explicitly specifying the device(s) with --dev. This option is not necessary if `-i osd_id` is provided.
`--out-dir DIR`	Output directory for `bluefs-export`.
`-l`, `--log-file LOG_FILE`	The file to log to.
`--log-level NUM`	Debug log level. Default is 30 (extremely verbose), 20 is very verbose, 10 is verbose, and 1 is not very verbose.
`--deep`	Deep scrub/repair (read and validate object data, not just metadata).
`--allocator NAME`	Useful for free-dump and free-score actions. Selects allocator(s).
`--resharding-ctrl CONTROL_STRING`	Provides control over resharding process. Specifies how often refresh RocksDB iterator, and how large should commit batch be before committing to RocksDB. Option format is: `<iterator_refresh_bytes>/<iterator_refresh_keys>/<batch_commit_bytes>/<batch_commit_keys>` Default: 10000000/10000/1000000/1000

Procedure

Stop the OSD before using the ceph-bluestore-tool.
Syntax
```
ceph orch daemon stop osd.ID
```
```
ceph orch daemon stop osd.ID
```
Copy to Clipboard Toggle word wrap
Example
```
[ceph: root@host01 /]# ceph orch daemon stop osd.2
```
```
[ceph: root@host01 /]# ceph orch daemon stop osd.2
```
Copy to Clipboard Toggle word wrap
From the OSD node, log into the target OSD container.
Syntax
```
cephadm shell --name osd.ID
```
```
cephadm shell --name osd.ID
```
Copy to Clipboard Toggle word wrap
Example
```
[ceph: root@host01 /]# ceph shell --name osd.2
```
```
[ceph: root@host01 /]# ceph shell --name osd.2
```
Copy to Clipboard Toggle word wrap

Run the needed command.

Example

[ceph: root@host01 /]# ceph-bluestore-tool bluefs-bdev-new-wal --dev-target /dev/test/newdb --path /var/lib/ceph/osd/ceph-0

[ceph: root@host01 /]# ceph-bluestore-tool bluefs-bdev-new-wal --dev-target /dev/test/newdb --path /var/lib/ceph/osd/ceph-0

Copy to Clipboard

Toggle word wrap

Note

This example shows adding a new wal device.

From the cephadm shell, restart the OSD.
Syntax
```
ceph orch daemon start osd.ID
```
```
ceph orch daemon start osd.ID
```
Copy to Clipboard Toggle word wrap
Example
```
[ceph: root@host01 /]# ceph orch daemon start osd.2
```
```
[ceph: root@host01 /]# ceph orch daemon start osd.2
```
Copy to Clipboard Toggle word wrap

Ce contenu n'est pas disponible dans la langue sélectionnée.

12.1. Ceph BlueStore
Copier lien

12.2. Ceph BlueStore devices
Copier lien

12.3. Ceph BlueStore caching
Copier lien

12.4. Sizing considerations for Ceph BlueStore
Copier lien

12.5. Tuning Ceph BlueStore using bluestore_min_alloc_size parameter
Copier lien

12.6. Resharding the RocksDB database using the BlueStore admin tool
Copier lien

12.6.1. Use the rocksdb-resharding.yml playbook
Copier lien

12.6.2. Manually resharding the OSDs
Copier lien

12.7. The BlueStore fragmentation tool
Copier lien

12.7.1. What is the BlueStore fragmentation tool?
Copier lien

12.7.2. Checking for fragmentation
Copier lien

12.8. Ceph BlueStore BlueFS
Copier lien

12.8.1. Viewing the bluefs_buffered_io setting
Copier lien

12.8.2. Viewing Ceph BlueFS statistics for Ceph OSDs
Copier lien

12.9. Using the ceph-blustore-tool
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 12. BlueStore

12.1. Ceph BlueStoreCopier lienLien copié sur presse-papiers!

12.2. Ceph BlueStore devicesCopier lienLien copié sur presse-papiers!

12.3. Ceph BlueStore cachingCopier lienLien copié sur presse-papiers!

12.4. Sizing considerations for Ceph BlueStoreCopier lienLien copié sur presse-papiers!

12.5. Tuning Ceph BlueStore using bluestore_min_alloc_size parameterCopier lienLien copié sur presse-papiers!

12.6. Resharding the RocksDB database using the BlueStore admin toolCopier lienLien copié sur presse-papiers!

12.6.1. Use the rocksdb-resharding.yml playbookCopier lienLien copié sur presse-papiers!

12.6.2. Manually resharding the OSDsCopier lienLien copié sur presse-papiers!

12.7. The BlueStore fragmentation toolCopier lienLien copié sur presse-papiers!

12.7.1. What is the BlueStore fragmentation tool?Copier lienLien copié sur presse-papiers!

12.7.2. Checking for fragmentationCopier lienLien copié sur presse-papiers!

12.8. Ceph BlueStore BlueFSCopier lienLien copié sur presse-papiers!

12.8.1. Viewing the bluefs_buffered_io settingCopier lienLien copié sur presse-papiers!

12.8.2. Viewing Ceph BlueFS statistics for Ceph OSDsCopier lienLien copié sur presse-papiers!

12.9. Using the ceph-blustore-toolCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

12.1. Ceph BlueStore
Copier lien

12.2. Ceph BlueStore devices
Copier lien

12.3. Ceph BlueStore caching
Copier lien

12.4. Sizing considerations for Ceph BlueStore
Copier lien

12.5. Tuning Ceph BlueStore using bluestore_min_alloc_size parameter
Copier lien

12.6. Resharding the RocksDB database using the BlueStore admin tool
Copier lien

12.6.1. Use the rocksdb-resharding.yml playbook
Copier lien

12.6.2. Manually resharding the OSDs
Copier lien

12.7. The BlueStore fragmentation tool
Copier lien

12.7.1. What is the BlueStore fragmentation tool?
Copier lien

12.7.2. Checking for fragmentation
Copier lien

12.8. Ceph BlueStore BlueFS
Copier lien

12.8.1. Viewing the bluefs_buffered_io setting
Copier lien

12.8.2. Viewing Ceph BlueFS statistics for Ceph OSDs
Copier lien

12.9. Using the ceph-blustore-tool
Copier lien