Appendix A. Ceph block device configuration reference
As a storage administrator, you can fine tune the behavior of Ceph block devices through the various options that are available. You can use this reference for viewing such things as the default Ceph block device options, and Ceph block device caching options.
Prerequisites
- A running Red Hat Ceph Storage cluster.
A.1. Block device default options
It is possible to override the default settings for creating an image. Ceph will create images with format 2
and no striping.
- rbd_default_format
- Description
-
The default format (
2
) if no other format is specified. Format1
is the original format for a new image, which is compatible with all versions oflibrbd
and the kernel module, but does not support newer features like cloning. Format2
is supported bylibrbd
and the kernel module since version 3.11 (except for striping). Format2
adds support for cloning and is more easily extensible to allow more features in the future. - Type
- Integer
- Default
-
2
- rbd_default_order
- Description
- The default order if no other order is specified.
- Type
- Integer
- Default
-
22
- rbd_default_stripe_count
- Description
- The default stripe count if no other stripe count is specified. Changing the default value requires striping v2 feature.
- Type
- 64-bit Unsigned Integer
- Default
-
0
- rbd_default_stripe_unit
- Description
-
The default stripe unit if no other stripe unit is specified. Changing the unit from
0
(that is, the object size) requires the striping v2 feature. - Type
- 64-bit Unsigned Integer
- Default
-
0
- rbd_default_features
- Description
The default features enabled when creating an block device image. This setting only applies to format 2 images. The settings are:
1: Layering support. Layering enables you to use cloning.
2: Striping v2 support. Striping spreads data across multiple objects. Striping helps with parallelism for sequential read/write workloads.
4: Exclusive locking support. When enabled, it requires a client to get a lock on an object before making a write.
8: Object map support. Block devices are thin-provisioned—meaning, they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, or importing and exporting a sparsely populated image.
16: Fast-diff support. Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image, and the actual data usage of a snapshot much faster.
32: Deep-flatten support. Deep-flatten makes
rbd flatten
work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, so the parent will not be delete-able until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.64: Journaling support. Journaling records all modifications to an image in the order they occur. This ensures that a crash-consistent mirror of the remote image is available locally
The enabled features are the sum of the numeric settings.
- Type
- Integer
- Default
61
- layering, exclusive-lock, object-map, fast-diff, and deep-flatten are enabledImportantThe current default setting is not compatible with the RBD kernel driver nor older RBD clients.
- rbd_default_map_options
- Description
-
Most of the options are useful mainly for debugging and benchmarking. See
man rbd
underMap Options
for details. - Type
- String
- Default
-
""
A.2. Block device general options
- rbd_op_threads
- Description
- The number of block device operation threads.
- Type
- Integer
- Default
-
1
Do not change the default value of rbd_op_threads
because setting it to a number higher than 1
might cause data corruption.
- rbd_op_thread_timeout
- Description
- The timeout (in seconds) for block device operation threads.
- Type
- Integer
- Default
-
60
- rbd_non_blocking_aio
- Description
-
If
true
, Ceph will process block device asynchronous I/O operations from a worker thread to prevent blocking. - Type
- Boolean
- Default
-
true
- rbd_concurrent_management_ops
- Description
- The maximum number of concurrent management operations in flight (for example, deleting or resizing an image).
- Type
- Integer
- Default
-
10
- rbd_request_timed_out_seconds
- Description
- The number of seconds before a maintenance request times out.
- Type
- Integer
- Default
-
30
- rbd_clone_copy_on_read
- Description
-
When set to
true
, copy-on-read cloning is enabled. - Type
- Boolean
- Default
-
false
- rbd_enable_alloc_hint
- Description
-
If
true
, allocation hinting is enabled, and the block device will issue a hint to the OSD back end to indicate the expected size object. - Type
- Boolean
- Default
-
true
- rbd_skip_partial_discard
- Description
-
If
true
, the block device will skip zeroing a range when trying to discard a range inside an object. - Type
- Boolean
- Default
-
true
- rbd_tracing
- Description
-
Set this option to
true
to enable the Linux Trace Toolkit Next Generation User Space Tracer (LTTng-UST) tracepoints. See Tracing RADOS Block Device (RBD) Workloads with the RBD Replay Feature for details. - Type
- Boolean
- Default
-
false
- rbd_validate_pool
- Description
-
Set this option to
true
to validate empty pools for RBD compatibility. - Type
- Boolean
- Default
-
true
- rbd_validate_names
- Description
-
Set this option to
true
to validate image specifications. - Type
- Boolean
- Default
-
true
A.3. Block device caching options
The user space implementation of the Ceph block device, that is, librbd
, cannot take advantage of the Linux page cache, so it includes its own in-memory caching, called RBD caching
. Ceph block device caching behaves just like well-behaved hard disk caching. When the operating system sends a barrier or a flush request, all dirty data is written to the Ceph OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a virtual machine that properly sends flushes, that is, Linux kernel version 2.6.32 or higher. The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can coalesce contiguous requests for better throughput.
Ceph block devices support write-back caching. To enable write-back caching, set rbd_cache = true
to the [client]
section of the Ceph configuration file. By default, librbd
does not perform any caching. Writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more than rbd_cache_max_dirty
unflushed bytes. In this case, the write triggers write-back and blocks until enough bytes are flushed.
Ceph block devices support write-through caching. You can set the size of the cache, and you can set targets and limits to switch from write-back caching to write-through caching. To enable write-through mode, set rbd_cache_max_dirty
to 0. This means writes return only when the data is on disk on all replicas, but reads may come from the cache. The cache is in memory on the client, and each Ceph block device image has its own. Since the cache is local to the client, there is no coherency if there are others accessing the image. Running other file systems, such as GFS or OCFS, on top of Ceph block devices will not work with caching enabled.
The Ceph configuration settings for Ceph block devices must be set in the [client]
section of the Ceph configuration file, by default, /etc/ceph/ceph.conf
.
The settings include:
- rbd_cache
- Description
- Enable caching for RADOS Block Device (RBD).
- Type
- Boolean
- Required
- No
- Default
-
true
- rbd_cache_size
- Description
- The RBD cache size in bytes.
- Type
- 64-bit Integer
- Required
- No
- Default
-
32 MiB
- rbd_cache_max_dirty
- Description
-
The
dirty
limit in bytes at which the cache triggers write-back. If0
, uses write-through caching. - Type
- 64-bit Integer
- Required
- No
- Constraint
-
Must be less than
rbd cache size
. - Default
-
24 MiB
- rbd_cache_target_dirty
- Description
-
The
dirty target
before the cache begins writing data to the data storage. Does not block writes to the cache. - Type
- 64-bit Integer
- Required
- No
- Constraint
-
Must be less than
rbd cache max dirty
. - Default
-
16 MiB
- rbd_cache_max_dirty_age
- Description
- The number of seconds dirty data is in the cache before writeback starts.
- Type
- Float
- Required
- No
- Default
-
1.0
- rbd_cache_max_dirty_object
- Description
-
The dirty limit for objects - set to
0
for auto calculate fromrbd_cache_size
. - Type
- Integer
- Default
-
0
- rbd_cache_block_writes_upfront
- Description
-
If
true
, it will block writes to the cache before theaio_write
call completes. Iffalse
, it will block before theaio_completion
is called. - Type
- Boolean
- Default
-
false
- rbd_cache_writethrough_until_flush
- Description
- Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but safe setting in case VMs running on rbd are too old to send flushes, like the virtio driver in Linux before 2.6.32.
- Type
- Boolean
- Required
- No
- Default
-
true
A.4. Block device parent and child read options
- rbd_balance_snap_reads
- Description
- Ceph typically reads objects from the primary OSD. Since reads are immutable, you may enable this feature to balance snap reads between the primary OSD and the replicas.
- Type
- Boolean
- Default
-
false
- rbd_localize_snap_reads
- Description
-
Whereas
rbd_balance_snap_reads
will randomize the replica for reading a snapshot. If you enablerbd_localize_snap_reads
, the block device will look to the CRUSH map to find the closest or local OSD for reading the snapshot. - Type
- Boolean
- Default
-
false
- rbd_balance_parent_reads
- Description
- Ceph typically reads objects from the primary OSD. Since reads are immutable, you may enable this feature to balance parent reads between the primary OSD and the replicas.
- Type
- Boolean
- Default
-
false
- rbd_localize_parent_reads
- Description
-
Whereas
rbd_balance_parent_reads
will randomize the replica for reading a parent. If you enablerbd_localize_parent_reads
, the block device will look to the CRUSH map to find the closest or local OSD for reading the parent. - Type
- Boolean
- Default
-
true
A.5. Block device read ahead options
RBD supports read-ahead/prefetching to optimize small, sequential reads. This should normally be handled by the guest OS in the case of a VM, but boot loaders may not issue efficient reads. Read-ahead is automatically disabled if caching is disabled.
- rbd_readahead_trigger_requests
- Description
- Number of sequential read requests necessary to trigger read-ahead.
- Type
- Integer
- Required
- No
- Default
-
10
- rbd_readahead_max_bytes
- Description
- Maximum size of a read-ahead request. If zero, read-ahead is disabled.
- Type
- 64-bit Integer
- Required
- No
- Default
-
512 KiB
- rbd_readahead_disable_after_bytes
- Description
- After this many bytes have been read from an RBD image, read-ahead is disabled for that image until it is closed. This allows the guest OS to take over read-ahead once it is booted. If zero, read-ahead stays enabled.
- Type
- 64-bit Integer
- Required
- No
- Default
-
50 MiB
A.6. Block device blocklist options
- rbd_blocklist_on_break_lock
- Description
- Whether to blocklist clients whose lock was broken.
- Type
- Boolean
- Default
-
true
- rbd_blocklist_expire_seconds
- Description
- The number of seconds to blocklist - set to 0 for OSD default.
- Type
- Integer
- Default
-
0
A.7. Block device journal options
- rbd_journal_order
- Description
-
The number of bits to shift to compute the journal object maximum size. The value is between
12
and64
. - Type
- 32-bit Unsigned Integer
- Default
-
24
- rbd_journal_splay_width
- Description
- The number of active journal objects.
- Type
- 32-bit Unsigned Integer
- Default
-
4
- rbd_journal_commit_age
- Description
- The commit time interval in seconds.
- Type
- Double Precision Floating Point Number
- Default
-
5
- rbd_journal_object_flush_interval
- Description
- The maximum number of pending commits per a journal object.
- Type
- Integer
- Default
-
0
- rbd_journal_object_flush_bytes
- Description
- The maximum number of pending bytes per a journal object.
- Type
- Integer
- Default
-
0
- rbd_journal_object_flush_age
- Description
- The maximum time interval in seconds for pending commits.
- Type
- Double Precision Floating Point Number
- Default
-
0
- rbd_journal_pool
- Description
- Specifies a pool for journal objects.
- Type
- String
- Default
-
""
A.8. Block device configuration override options
Block device configuration override options for global and pool levels.
Global level
Available keys
rbd_qos_bps_burst
- Description
- The desired burst limit of IO bytes.
- Type
- Integer
- Default
-
0
rbd_qos_bps_limit
- Description
- The desired limit of IO bytes per second.
- Type
- Integer
- Default
-
0
rbd_qos_iops_burst
- Description
- The desired burst limit of IO operations.
- Type
- Integer
- Default
-
0
rbd_qos_iops_limit
- Description
- The desired limit of IO operations per second.
- Type
- Integer
- Default
-
0
rbd_qos_read_bps_burst
- Description
- The desired burst limit of read bytes.
- Type
- Integer
- Default
-
0
rbd_qos_read_bps_limit
- Description
- The desired limit of read bytes per second.
- Type
- Integer
- Default
-
0
rbd_qos_read_iops_burst
- Description
- The desired burst limit of read operations.
- Type
- Integer
- Default
-
0
rbd_qos_read_iops_limit
- Description
- The desired limit of read operations per second.
- Type
- Integer
- Default
-
0
rbd_qos_write_bps_burst
- Description
- The desired burst limit of write bytes.
- Type
- Integer
- Default
-
0
rbd_qos_write_bps_limit
- Description
- The desired limit of write bytes per second.
- Type
- Integer
- Default
-
0
rbd_qos_write_iops_burst
- Description
- The desired burst limit of write operations.
- Type
- Integer
- Default
-
0
rbd_qos_write_iops_limit
- Description
- The desired burst limit of write operations per second.
- Type
- Integer
- Default
-
0
The above keys can be used for the following:
rbd config global set CONFIG_ENTITY KEY VALUE
- Description
- Set a global level configuration override.
rbd config global get CONFIG_ENTITY KEY
- Description
- Get a global level configuration override.
rbd config global list CONFIG_ENTITY
- Description
- List the global level configuration overrides.
rbd config global remove CONFIG_ENTITY KEY
- Description
- Remove a global level configuration override.
Pool level
rbd config pool set POOL_NAME KEY VALUE
- Description
- Set a pool level configuration override.
rbd config pool get POOL_NAME KEY
- Description
- Get a pool level configuration override.
rbd config pool list POOL_NAME
- Description
- List the pool level configuration overrides.
rbd config pool remove POOL_NAME KEY
- Description
- Remove a pool level configuration override.
CONFIG_ENTITY
is global, client or client id. KEY
is the config key. VALUE
is the config value. POOL_NAME
is the name of the pool.
A.9. Block device input and output options
General input and output options for Red Hat Ceph Storage.
rbd_compression_hint
- Description
-
Hint to send to the OSDs on write operations. If set to
compressible
and the OSDbluestore_compression_mode
setting ispassive
, the OSD attempts to compress data. If set toincompressible
and the OSDbluestore_compression_mode
setting isaggressive
, the OSD will not attempt to compress data. - Type
- Enum
- Required
- No
- Default
-
none
- Values
-
none
,compressible
,incompressible
rbd_read_from_replica_policy
- Description
Policy for determining which OSD receives read operations. If set to
default
, each PG’s primary OSD will always be used for read operations. If set tobalance
, read operations will be sent to a randomly selected OSD within the replica set. If set tolocalize
, read operations will be sent to the closest OSD as determined by the CRUSH map and thecrush_location
configuration option, where thecrush_location
is denoted usingkey=value
. Thekey
aligns with the CRUSH map keys.NoteThis feature requires the storage cluster to be configured with a minimum compatible OSD release of the latest version of Red Hat Ceph Storage.
- Type
- Enum
- Required
- No
- Default
-
default
- Values
-
default
,balance
,localize