Chapter 4. Configuring a Cluster
The initial configuring of a production cluster is identical to configuring a proof-of-concept system. The only material difference is that the initial deployment will use production-grade hardware. First, follow the Requirements for Installing Red Hat Ceph Storage chapter of the Red Hat Ceph Storage 4 Installation Guide and execute the appropriate steps for each node. The following sections provide additional guidance relevant to production clusters.
4.1. Naming Hosts
When naming hosts, consider their use case and performance profile. For example, if the hosts will store client data, consider naming them according to their hardware configuration and performance profile. For example:
-
data-ssd-1
,data-ssd-2
-
hot-storage-1
,hot-storage-2
-
sata-1
,sata-2
-
sas-ssd-1
,sas-ssd-2
The naming convention may make it easier to manage the cluster and troubleshoot hardware issues as they arise.
If the host contains hardware for multiple use cases—for example, the host contains SSDs for data, SAS drives with SSDs for journals, and SATA drives with co-located journals for cold storage—choose a generic name for the host. For example:
-
osd-node-1
osd-node-2
Generic host names can be extended when using logical host names in the CRUSH hierarchy as needed. For example:
-
osd-node-1-ssd
osd-node-1-sata
osd-node-1-sas-ssd
osd-node-1-bucket-index
-
osd-node-2-ssd
osd-node-2-sata
osd-node-2-sas-ssd
osd-node-2-bucket-index
See Using Logical Host Names in a CRUSH Map for additional details.
4.2. Tuning the Kernel
Production clusters benefit from tuning the operating system, specifically limits and memory allocation. Ensure that adjustments are set for all nodes within the cluster. Consult Red Hat support for additional guidance.
4.2.1. Reserving Free Memory for OSDs
To help prevent insufficient memory-related errors during OSD memory allocation requests, set the os_tuning_params
option in the group_vars/all.yml
on ceph-ansible
node. This option specifies the amount of physical memory to keep in reserve. The recommended settings are based on the amount of system RAM. For example:
For 64GB RAM, reserve 1GB.
vm.min_free_kbytes = 1048576
For 128GB RAM, reserve 2GB.
vm.min_free_kbytes = 2097152
For 256GB RAM, reserve 3GB.
vm.min_free_kbytes = 3145728
4.2.2. Increasing File Descriptors
The Ceph Object Gateway may hang if it runs out of file descriptors. Modify /etc/security/limits.conf
on Ceph Object Gateway nodes to increase the file descriptors for the Ceph Object Gateway. For example:
ceph soft nofile unlimited
4.2.3. Adjusting ulimit
on Large Clusters
For system administrators that will run Ceph administrator commands on large clusters—for example, 1024 OSDs or more—create an /etc/security/limits.d/50-ceph.conf
file on each node that will run administrator commands with the following contents:
<username> soft nproc unlimited
Replace <username>
with the name of the non-root account that will run Ceph administrator commands.
The root user’s ulimit is already set to "unlimited" by default on Red Hat Enterprise Linux.
4.3. Configuring Ansible Groups
This procedure is only pertinent for deploying Ceph using Ansible. The ceph-ansible
package is already configured with a default osds
group. If the cluster will only have one use case and storage policy, proceed with the procedure documented in the Installing a Red Hat Ceph Storage Cluster section of the Red Hat Ceph Storage Installation Guide. If the cluster will support multiple use cases and storage policies, create a group for each one. Each use case should copy /usr/share/ceph-ansible/group_vars/osd.sample
to a file named for the group name. For example, if the storage cluster has IOPS-optimized, throughput-optimized and capacity-optimized use cases, create separate files representing the groups for each use case:
cd /usr/share/ceph-ansible/group_vars/ cp osds.sample osds-iops cp osds.sample osds-throughput cp osds.sample osds-capacity
Then, configure each file according to the use case.
Once the group variable files are configured, edit the site.yml
file to ensure that it includes each new group. For example:
- hosts: osds-iops gather_facts: false become: True roles: - ceph-osd - hosts: osds-throughput gather_facts: false become: True roles: - ceph-osd - hosts: osds-capacity gather_facts: false become: True roles: - ceph-osd
Finally, in the /etc/ansible/hosts
file, place the OSD nodes associated to a group under the corresponding group name. For example:
[osds-iops] <ceph-host-name> devices="[ '<device_1>', '<device_2>' ]" [osds-throughput] <ceph-host-name> devices="[ '<device_1>', '<device_2>' ]" [osds-capacity] <ceph-host-name> devices="[ '<device_1>', '<device_2>' ]"
4.4. Configuring Ceph
Generally, administrators should configure the Red Hat Ceph Storage cluster before the initial deployment using the Ceph Ansible configuration files found in the /usr/share/ceph-ansible/group_vars
directory.
As indicated in the Installing a Red Hat Ceph Storage Cluster section of the Red Hat Ceph Storage installation guide:
-
For Monitors, create a
mons.yml
file from thesample.mons.yml
file. -
For OSDs, create an
osds.yml
file from thesample.osds.yml
file. -
For the cluster, create an
all.yml
file from thesample.all.yml
file.
Modify the settings as directed in the Installation Guide.
Also refer to Installing the Ceph Object Gateway, and create an rgws.yml
file from sample.rgws.yml
.
- NOTE
-
The settings in the foregoing files may take precedence over settings in
ceph_conf_overrides
.
To configure settings with no corresponding values in the mons.yml
, osds.yml
, or rgws.yml
file, add configuration settings to the ceph_conf_overrides
section of the all.yml
file. For example:
ceph_conf_overrides: global: osd_pool_default_pg_num: <number>
See the Configuration File Structure for details on configuration file sections.
There are syntactic differences between specifying a Ceph configuration setting in an Ansible configuration file and how it renders in the Ceph configuration file.
In RHCS version 3.1 and earlier, the Ceph configuration file uses ini
style notion. Sections like [global]
in the Ceph configuration file should be specified as global:
, indented on their own lines. It is also possible to specify configuration sections for specific daemon instances. For example, placing osd.1:
in the ceph_conf_overrides
section of the all.yml
file will render as [osd.1]
in the Ceph configuration file and the settings under that section will apply to osd.1
only.
Ceph configuration settings SHOULD contain dashes (-
) or underscores (_
) rather than spaces, and should terminate with a colon (:
), not an equal sign (=
).
Before deploying the Ceph cluster, consider the following configuration settings. When setting Ceph configuration settings, Red Hat recommends setting the values in the ceph-ansible
configuration files, which will generate a Ceph configuration file automatically.
4.4.1. Setting the Journal Size
Set the journal size for the Ceph cluster. Configuration tools such as Ansible may have a default value. Generally, the journal size should find the product of the synchronization interval and the slower of the disk and network throughput, and multiply the product by two (2).
For details, see the Journal Settings section in the Configuration Guide for Red Hat Ceph Storage 4.
4.4.2. Adjusting Backfill & Recovery Settings
I/O is negatively impacted by both backfilling and recovery operations, leading to poor performance and unhappy end users. To help accommodate I/O demand during a cluster expansion or recovery, set the following options and values in the Ceph Configuration file:
[osd] osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_op_priority = 1
4.4.3. Adjusting the Cluster Map Size
For Red Hat Ceph Storage version 2 and earlier, when the cluster has thousands of OSDs, download the cluster map and check its file size. By default, the ceph-osd
daemon caches 500 previous osdmaps. Even with deduplication, the map may consume a lot of memory per daemon. Tuning the cache size in the Ceph configuration file may help reduce memory consumption significantly. For example:
[global] osd_map_message_max=10 [osd] osd_map_cache_size=20 osd_map_max_advance=10 osd_map_share_max_epochs=10 osd_pg_epoch_persisted_max_stale=10
For Red Hat Ceph Storage version 3 and later, the ceph-manager
daemon handles PG queries, so the cluster map should not impact performance.
4.4.4. Adjusting Scrubbing
By default, Ceph performs light scrubbing daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that PGs are storing the same object data. Over time, disk sectors can go bad irrespective of object sizes and checksums. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. In this respect, deep scrubbing ensures data integrity in the manner of fsck
, but the procedure imposes an I/O penalty on the cluster. Even light scrubbing can impact I/O.
The default settings may allow Ceph OSDs to initiate scrubbing at inopportune times such as peak operating times or periods with heavy loads. End users may experience latency and poor performance when scrubbing operations conflict with end user operations.
To prevent end users from experiencing poor performance, Ceph provides a number of scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours. For details, see the Scrubbing the OSD section in the Red Hat Ceph Storage Configuration Guide.
If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours. For example:
[osd] osd_scrub_begin_hour = 23 #23:01H, or 10:01PM. osd_scrub_end_hour = 6 #06:01H or 6:01AM.
If time constraints aren’t an effective method of determining a scrubbing schedule, consider using the osd_scrub_load_threshold
. The default value is 0.5
, but it could be modified for low load conditions. For example:
[osd] osd_scrub_load_threshold = 0.25
4.4.5. Increase objecter_inflight_ops
In RHCS 3.0 and earlier releases, consider increasing objecter_inflight_ops
to the default size for version 3.1 and later releases to improve scalability.
objecter_inflight_ops = 24576
4.4.6. Increase rgw_thread_pool_size
In RHCS 3.0 and earlier releases, consider increasing rgw_thread_pool_size
to the default size for version 3.1 and later releases to improve scalability. For example:
rgw_thread_pool_size = 512
4.4.7. Adjusting Garbage Collection Settings
The Ceph Object Gateway allocates storage for new and overwritten objects immediately. Additionally, the parts of a multi-part upload also consume some storage.
The Ceph Object Gateway purges the storage space used for deleted objects after deleting the objects from the bucket index. Similarly, the Ceph Object Gateway will delete data associated with a multi-part upload after the multi-part upload completes or when the upload has gone inactive or failed to complete for a configurable amount of time. The process of purging the deleted object data from the Red Hat Ceph Storage cluster is known as garbage collection (GC).
Viewing the objects awaiting garbage collection can be done with the following command:
radosgw-admin gc list
Garbage collection is a background activity that executes continuously or during times of low loads, depending upon how the storage administrator configures the Ceph Object Gateway. By default, the Ceph Object Gateway conducts garbage collection operations continuously. Since garbage collection operations are a normal function of the Ceph Object Gateway, especially with object delete operations, objects eligible for garbage collection exist most of the time.
Some workloads can temporarily or permanently outpace the rate of garbage collection activity. This is especially true of delete-heavy workloads, where many objects get stored for a short period of time and then deleted. For these types of workloads, storage administrators can increase the priority of garbage collection operations relative to other operations with the following configuration parameters:
-
The
rgw_gc_obj_min_wait
configuration option waits a minimum length of time, in seconds, before purging a deleted object’s data. The default value is two hours, or 7200 seconds. The object is not purged immediately, because a client might be reading the object. Under delete heavy workloads, this setting can consume too much storage or have a large number of deleted objects to purge. Red Hat recommends not setting this value below 30 minutes, or 1800 seconds. -
The
rgw_gc_processor_period
configuration option is the garbage collection cycle run time. That is, the amount of time between the start of consecutive runs of garbage collection threads. If garbage collection runs longer than this period, the Ceph Object Gateway will not wait before running a garbage collection cycle again. -
The
rgw_gc_max_concurrent_io
configuration option specifies the maximum number of concurrent IO operations that the gateway garbage collection thread will use when purging deleted data. Under delete heavy workloads, consider increasing this setting to a larger number of concurrent IO operations. -
The
rgw_gc_max_trim_chunk
configuration option specifies the maximum number of keys to remove from the garbage collector log in a single operation. Under delete heavy operations, consider increasing the maximum number of keys so that more objects are purged during each garbage collection operation.
Starting with Red Hat Ceph Storage 4.1, offloading the index object’s OMAP from the garbage collection log helps lessen the performance impact of garbage collection activities on the storage cluster. Some new configuration parameters have been added to Ceph Object Gateway to tune the garbage collection queue, as follows:
-
The
rgw_gc_max_deferred_entries_size
configuration option sets the maximum size of deferred entries in the garbage collection queue. -
The
rgw_gc_max_queue_size
configuration option sets the maximum queue size used for garbage collection. This value should not be greater thanosd_max_object_size
minusrgw_gc_max_deferred_entries_size
minus 1 KB. -
The
rgw_gc_max_deferred
configuration option sets the maximum number of deferred entries stored in the garbage collection queue.
These garbage collection configuration parameters are for Red Hat Ceph Storage 4 and higher.
In testing, with an evenly balanced delete-write workload, such as 50% delete and 50% write operations, the storage cluster fills completely in 11 hours. This is because Ceph Object Gateway garbage collection fails to keep pace with the delete operations. The cluster status switches to the HEALTH_ERR
state if this happens. Aggressive settings for parallel garbage collection tunables significantly delayed the onset of storage cluster fill in testing and can be helpful for many workloads. Typical real-world storage cluster workloads are not likely to cause a storage cluster fill primarily due to garbage collection.
Administrators may also modify Ceph configuration settings at runtime after deployment. See Setting a Specific Configuration Setting at Runtime for details.