このコンテンツは選択した言語では利用できません。

Chapter 13. Crimson (Technology Preview)


As a storage administrator, the Crimson project is an effort to build a replacement of ceph-osd daemon that is suited to the new reality of low latency, high throughput persistent memory, and NVMe technologies.

Important

The Crimson feature is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend using them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. See the support scope for Red Hat Technology Preview features for more details.

13.1. Crimson overview

Crimson is the code name for crimson-osd, which is the next generation ceph-osd for multi-core scalability. It improves performance with fast network and storage devices, employing state-of-the-art technologies that includes DPDK and SPDK. BlueStore continues to support HDDs and SSDs. Crimson aims to be compatible with an earlier version of OSD daemon with the class ceph-osd.

Built on the SeaStar C++ framework, Crimson is a new implementation of the core Ceph object storage daemon (OSD) component and replaces ceph-osd. The crimson-osd minimizes latency and increased CPU processor usage. It uses high-performance asynchronous IO and a new threading architecture that is designed to minimize context switches and inter-thread communication for an operation for cross communication.

Important

For Red Hat Ceph Storage 8, you can test RADOS Block Device (RBD) workloads on replicated pools with Crimson only. Do not use Crimson for production data.

Crimson goals

Crimson OSD is a replacement for the OSD daemon with the following goals:

Minimize CPU overload

  • Minimize cycles or IOPS.
  • Minimize cross-core communication.
  • Minimize copies.
  • Bypass kernel, avoid context switches.

Enable emerging storage technologies

  • Zoned namespaces
  • Persistent memory
  • Fast NVMe

13.2. Difference between Crimson and Classic Ceph OSD architecture

In a classic ceph-osd architecture, a messenger thread reads a client message from the wire, which places the message in the OP queue. The osd-op thread-pool then picks up the message and creates a transaction and queues it to BlueStore, the current default ObjectStore implementation. BlueStore’s kv_queue then picks up this transaction and anything else in the queue, synchronously waits for rocksdb to commit the transaction, and then places the completion callback in the finisher queue. The finisher thread then picks up the completion callback and queues to replace the messenger thread to send.

Each of these actions requires inter-thread co-ordination over the contents of a queue. For pg state, more than one thread might need to access the internal metadata of any PG to lock contention.

This lock contention with increased processor usage scales rapidly with the number of tasks and cores, and every locking point might become the scaling bottleneck under certain scenarios. Moreover, these locks and queues incur latency costs even when uncontended. Due to this latency, the thread pools and task queues deteriorate, as the bookkeeping effort delegates tasks between the worker thread and locks can force context-switches.

Unlike the ceph-osd architecture, Crimson allows a single I/O operation to complete on a single core without context switches and without blocking if the underlying storage operations do not require it. However, some operations still need to be able to wait for asynchronous processes to complete, probably nondeterministically depending on the state of the system such as recovery or the underlying device.

Crimson uses the C++ framework that is called Seastar, a highly asynchronous engine, which generally pre-allocates one thread pinned to each core. These divide work among those cores such that the state can be partitioned between cores and locking can be avoided. With Seastar, the I/O operations are partitioned among a group of threads based on the target object. Rather than splitting the stages of running an I/O operation among different groups of threads, run all the pipeline stages within a single thread. If an operation needs to be blocked, the core’s Seastar reactor switches to another concurrent operation and progresses.

Ideally, all the locks and context-switches are no longer needed as each running nonblocking task owns the CPU until it completes or cooperatively yields. No other thread can preempt the task at the same time. If the communication is not needed with other shards in the data path, the ideal performance scales linearly with the number of cores until the I/O device reaches its limit. This design fits the Ceph OSD well because, at the OSD level, the PG shard all IOs.

Unlike ceph-osd, crimson-osd does not daemonize itself even if the daemonize option is enabled. Do not daemonize crimson-osd since supported Linux distributions use systemd, which is able to daemonize the application. With sysvinit, use start-stop-daemon to daemonize crimson-osd.

ObjectStore backend

The crimson-osd crimson-osd supports two categories of object store backends: - native - non-native

Native backends perform I/O operations using the Seastar reactor. These are tightly integrated with the Seastar framework and follow its design principles. SeaStore is the primary native object store for Crimson OSD. It is built with the Seastar framework and adheres to its asynchronous, shard-based architecture.BlueStore, which is the default object store that is used by the classical ceph-osd, is also adapted and is supported with Crimson.

Note

A Red Hat Ceph Storage 9.0 is the first tech-preview release that supports deploying Crimson with SeaStore as the Object Store.

Following three ObjectStore backend is supported for Crimson:

  • AlienStore - Provides compatibility with an earlier version of object store, that is, BlueStore.
  • CyanStore - A dummy backend for tests, which are implemented by volatile memory. This object store is modeled after the memstore in the classic OSD.
  • SeaStore - The new object store designed specifically for Crimson OSD. The paths toward multiple shard support are different depending on the specific goal of the backend.

Following are the other two classic OSD ObjectStore backends:

  • MemStore - The memory as the backend object store.
  • BlueStore - The object store used by the classic ceph-osd.

13.3. Crimson metrics

Crimson has three ways to report statistics and metrics:

  • PG stats reported to manager.
  • Prometheus text protocol.
  • The asock command.

PG stats reported to manager

Crimson collects the per-pg, per-pool, and per-osd stats in MPGStats message, which is sent to the Ceph Managers.

Prometheus text protocol

Configure the listening port and address by using the --prometheus-port command-line option.

The asock command

An admin socket command is offered to dump metrics.

Syntax

ceph tell OSD_ID dump_metrics
ceph tell OSD_ID dump_metrics reactor_utilization
Copy to Clipboard Toggle word wrap

Example

[ceph: root@host01 /]# ceph tell osd.0 dump_metrics
[ceph: root@host01 /]# ceph tell osd.0 dump_metrics reactor_utilization
Copy to Clipboard Toggle word wrap

Here, reactor_utilization is an optional string to filter the dumped metrics by prefix.

13.4. Crimson configuration options

Run the crimson-osd --help-seastar command for Seastar specific command-line options. Following are the options that you can use to configure Crimson:

--crimson, Description
Start crimson-osd instead of ceph-osd.
--nodaemon, Description
Do not daemonize the service.
--redirect-output, Description
Redirect the stdout and stderr to out/$type.$num.stdout
--osd-args, Description
Pass extra command-line options to crimson-osd or ceph-osd. This option is useful for passing Seastar options to crimson-osd. For example, one can supply --osd-args "--memory 2G" to set the amount of memory to use.
--cyanstore, Description
Use CyanStore as the object store backend.
--bluestore, Description
Use the alienized BlueStore as the object store backend. --bluestore is the default memory store.
--memstore, Description
Use the alienized MemStore as the object store backend.
--seastore, Description
Use SeaStore as the back end object store.
--seastore-devs, Description
Specify the block device used by SeaStore.
--seastore-secondary-devs, Description
Optional. SeaStore supports multiple devices. Enable this feature by passing the block device to this option.
--seastore-secondary-devs-type, Description
Optional. Specify the type of secondary devices. When the secondary device is slower than main device passed to --seastore-devs, the cold data in faster device will be evicted to the slower devices over time. Valid types include HDD, SSD, (default), ZNS, and RANDOM_BLOCK_SSD. Note that secondary devices should not be faster than the main device.

13.5. Configuring Crimson

Configure crimson-osd by installing a new storage cluster. Install a new cluster by using the bootstrap option. You cannot upgrade this cluster as it is in the experimental phase. WARNING: Do not use production data as it might result in data loss.

Prerequisites

  • An IP address for the first Ceph Monitor container, which is also the IP address for the first node in the storage cluster.
  • Login access to registry.redhat.io.
  • A minimum of 10 GB of free space for /var/lib/containers/.
  • Root-level access to all nodes.

Procedure

  1. While bootstrapping, use the --image flag to use Crimson build.

    Example

    [root@host 01 ~]# cephadm --image quay.ceph.io/ceph-ci/ceph:b682861f8690608d831f58603303388dd7915aa7-crimson bootstrap --mon-ip 10.1.240.54 --allow-fqdn-hostname --initial-dashboard-password Ceph_Crims
    Copy to Clipboard Toggle word wrap

  2. Log in to the cephadm shell:

    Example

    [root@host 01 ~]# cephadm shell
    Copy to Clipboard Toggle word wrap

  3. Enable Crimson globally as an experimental feature.

    Example

    [ceph: root@host01 /]# ceph config set global 'enable_experimental_unrecoverable_data_corrupting_features' crimson
    Copy to Clipboard Toggle word wrap

    Crimson is in a technology preview stage and is not suitable for production use.

  4. Enable the OSD Map flag.

    Example

    [ceph: root@host01 /]# ceph osd set-allow-crimson --yes-i-really-mean-it
    Copy to Clipboard Toggle word wrap

    The monitor allows crimson-osd to boot only with the --yes-i-really-mean-it flag.

  5. Enable Crimson parameter for the monitor to direct the default pools to be created as Crimson pools.

    Example

    [ceph: root@host01 /]#  ceph config set mon osd_pool_default_crimson true
    Copy to Clipboard Toggle word wrap

    The crimson-osd does not initiate placement groups (PG) for non-crimson pools.

  6. Configure CPU allocation according the the resources available.

Note: It is recommended that the value of crimson_seastar_num_threads multiplied with the number of OSDs on each host, should be less than the number of CPU cores (nproc) on the host.

13.6. Crimson configuration parameters

Following are the parameters that you can use to configure Crimson.

Expand
Table 13.1. Table: Seastar CPU pinning configuration parameters
NameTypeDescriptionDefault

crimson_seastar_num_threads

uint

Specifies the number of threads used to serve Seastar reactors when CPU pinning is not enabled.
This setting is overridden if crimson_seastar_cpu_cores is configured.
Valid values range from 0 to 32.

0

crimson_seastar_cpu_cores

string

Specifies the CPU cores on which Seastar reactor threads run, using the cpuset(7) format.
This setting overrides crimson_seastar_num_threads.

0

Expand
Table 13.2. Table: Bluestore CPU pinning configuration parameters
NameTypeDescriptionDefault

crimson_alien_op_num_threads

uint

Specifies the number of threads used to serve the alienized ObjectStore.
This parameter is applicable only when Bluestore is in use.

6

crimson_alien_thread_cpu_cores

string

Specifies the CPU cores on which Alienstore threads run, using the cpuset(7) format.
If not set, Alienstore threads are not pinned.
This parameter is applicable only when Bluestore is in use.

0

Expand
Table 13.3. Table: Crimson OSD configuration parameters
NameTypeDescriptionDefault

crimson_osd_objectstore

string

Specifies the backend type for a Crimson OSD, such as seastore or bluestore.

None

crimson_osd_obc_lru_size

uint

Specifies the number of Object Contexts (OBCs) to cache.

512

crimson_osd_max_concurrent_ios

uint

Specifies the maximum number of concurrent I/O operations.
A value of 0 allows unlimited operations.

0

crimson_osd_stat_interval

int

Specifies the interval, in seconds, for reporting OSD status.
A value of 0 disables reporting.

0

Expand
Table 13.4. Table: Crimson reactor configuration parameters
NameTypeDescriptionDefault

crimson_reactor_idle_poll_time_us

uint

Specifies Seastar’s reactor idle polling time, in microseconds, before the reactor returns to sleep.
Longer polling times increase CPU usage.

200

crimson_reactor_io_latency_goal_ms

float

Specifies the maximum time, in milliseconds, that Seastar reactor I/O operations can take.
If not set, the default is 1.5 * crimson_reactor_task_quota_ms.
Increasing this value allows more I/O requests to be dispatched concurrently.

0

crimson_reactor_task_quota_ms

float

Specifies the maximum time, in milliseconds, that Seastar reactors wait between polls.
Shorter wait times increase CPU utilization.

0.5

Expand
Table 13.5. Table: Seastore configuration parameters
NameTypeDescriptionDefault

seastore_segment_size

size

Specifies the segment size used by the SegmentManager.

64_M

seastore_device_size

size

Specifies the total size of the SegmentManager block file, if one is created.

50_G

seashore_block_create

boolean

Specifies whether to create the SegmentManager file if it does not already exist.
See also seastore_device_size.

True

seastore_journal_batch_capacity

uint

Specifies the maximum number of records allowed in a journal batch.

16

seastore_journal_batch_flush_size

size

Specifies the size threshold that triggers a forced flush of a journal batch.

16_M

seastore_journal_iodepth_limit

uint

Specifies the I/O depth limit for submitting journal records.

5

seastore_journal_batch_prefered_fullness

float

Specifies the record fullness threshold that triggers a flush of a journal batch.

0.95

seastore_default_max_object_size

uint

Specifies the default logical address space reserved for Seastore object metadata.

16777216

seastore_default_object_metadata_reservation

uint

Specifies the default logical address for Seastore objects.

13.7. Profiling Crimson

Profiling Crimson is a methodology to do performance testing with Crimson. Two types of profiling are supported:

  • Flexible I/O (FIO) - The crimson-store-nbd shows the configurable FuturizedStore internals as an NBD server for use with FIO.
  • Ceph benchmarking tool (CBT) - A testing harness in python to test the performance of a Ceph cluster.

Procedure

  1. Install libnbd and compile FIO:

    Example

    [root@host01 ~]# dnf install libnbd
    [root@host01 ~]# git clone git://git.kernel.dk/fio.git
    [root@host01 ~]# cd fio
    [root@host01 ~]# ./configure --enable-libnbd
    [root@host01 ~]# make
    Copy to Clipboard Toggle word wrap

  2. Build crimson-store-nbd:

    Example

    [root@host01 ~]# cd build
    [root@host01 ~]# ninja crimson-store-nbd
    Copy to Clipboard Toggle word wrap

  3. Run the crimson-store-nbd server with a block device. Specify the path to the raw device, like /dev/nvme1n1:

    Example

    [root@host01 ~]# export disk_img=/tmp/disk.img
    [root@host01 ~]# export unix_socket=/tmp/store_nbd_socket.sock
    [root@host01 ~]# rm -f $disk_img $unix_socket
    [root@host01 ~]# truncate -s 512M $disk_img
    [root@host01 ~]# ./bin/crimson-store-nbd \
      --device-path $disk_img \
      --smp 1 \
      --mkfs true \
      --type transaction_manager \
      --uds-path ${unix_socket} &
     --smp is the CPU cores.
    --mkfs initializes the device first.
    --type is the backend.
    Copy to Clipboard Toggle word wrap

  4. Create an FIO job named nbd.fio:

    Example

    [global]
    ioengine=nbd
    uri=nbd+unix:///?socket=${unix_socket}
    rw=randrw
    time_based
    runtime=120
    group_reporting
    iodepth=1
    size=512M
    
    [job0]
    offset=0
    Copy to Clipboard Toggle word wrap

  5. Test the Crimson object with the FIO compiled:

    Example

    [root@host01 ~]# ./fio nbd.fio
    Copy to Clipboard Toggle word wrap

Ceph Benchmarking Tool (CBT)

Run the same test against two branches. One is main(master), another is topic branch of your choice. Compare the test results. Along with every test case, a set of rules is defined to check whether you need to perform regressions when two sets of test results are compared. If a possible regression is found, the rule and corresponding test results are highlighted.

Procedure

  1. From the main branch and the topic branch, run make crimson osd:

    Example

    [root@host01 ~]# git checkout master
    [root@host01 ~]# make crimson-osd
    [root@host01 ~]# ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml
    [root@host01 ~]# git checkout topic
    [root@host01 ~]# make crimson-osd
    [root@host01 ~]# ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml
    Copy to Clipboard Toggle word wrap

  2. Compare the test results:

    Example

    [root@host01 ~]# ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v
    Copy to Clipboard Toggle word wrap

Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

Theme

© 2026 Red Hat
トップに戻る