Chapter 2. Host Recommendations
Generally, we recommend running Ceph daemons of a specific type on a host configured for that type of daemon. We recommend using other hosts for processes that utilize your data cluster (e.g., OpenStack, CloudStack, etc).
How you select and configure a Ceph OSD host has a lot to do with how you intend to use the OSDs on it (e.g., for OpenStack volumes and images, for an S3 gateway, for a fast SSD pool or cache tier, etc.). See the Ceph’s Storage Stragegies Guide for details about defining storage strategies for your Ceph use case(s) and use these recommendations to help define your host requirements.
2.1. CPU
Ceph OSDs run the storage cluster service, calculate data placement with CRUSH, replicate data, and maintain their own copy of the cluster map. Ceph OSDs that host erasure-coded pools will use more CPU than Ceph OSDs that host replicated pools. Therefore, OSDs should have a reasonable amount of processing power and should consider the storage strategy(ies) you intend to use. Monitors simply maintain a master copy of the cluster map, so they are not CPU intensive.
You must also consider whether the host machine will run CPU-intensive processes in addition to Ceph daemons. For example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will need to ensure that these other processes leave sufficient processing power for Ceph daemons. We recommend running additional CPU-intensive processes on separate hosts.
2.2. RAM
Ceph monitors must be capable of serving the data quickly, so they need to have reasonable amount of RAM, for example, 1 GB of RAM per daemon instance. OSDs need around 2 GB or more of RAM per daemon. Generally, more RAM is better.
2.3. Data Storage
Plan your data storage configuration carefully. There are significant cost and performance tradeoffs to consider when planning for data storage. Simultaneous OS operations and simultaneous requests for read and write operations from multiple daemons against a single drive can slow performance considerably.
Ceph can operate with heterogeneous systems. CRUSH supports weighting for different sized drives (e.g., 1TB, 3TB, etc), and primary affinity (the likeliness an OSD would be used as a primary) to address the performance issues introduced by dissimilar hardware in the same pool. However, using homogeneous configurations for the OSDs assigned to a pool is recommended.
2.3.1. Identical Configurations
We recommend creating pools and defining CRUSH hierarchies such that the OSD hardware within the pool is identical. That is:
- Same controller
- Same drive size
- Same RPMs
- Same seek times
- Same I/O
- Same network throughput
- Same journal configuration
Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines troubleshooting.
2.3.2. Journaling
There are also file system limitations to consider: btrfs
is not quite stable enough for production, but it has the ability to journal and write data simultaneously, whereas XFS and ext4
do not.
Since Ceph has to write all data to the journal before it can send an ACK (for XFS and EXT4 at least), having the journal and OSD performance in balance is really important!
2.3.3. Hard Disk Drives
OSDs should have plenty of hard disk drive space for object data. We recommend a minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage of larger disks. We recommend dividing the price of the hard disk drive by the number of gigabytes to arrive at a cost per gigabyte, because larger drives may have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the 1 terabyte disks would generally increase the cost per gigabyte by 40%--rendering your cluster substantially less cost efficient. Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon you will need, especially during rebalancing, backfilling and recovery. Red Hat typically recommends a baseline of 16GB of RAM, with an additional 2GB of RAM per OSD.
Running multiple OSDs on a single disk—irrespective of partitions—is NOT a good idea.
Running an OSD and a monitor on a single disk—irrespective of partitions—is NOT a good idea.
Storage drives are subject to limitations on seek time, access time, read and write times, as well as total throughput. These physical limitations affect overall system performance—especially during recovery. We recommend using a dedicated drive for the operating system and software, and one drive for each Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running an operating system, multiple OSDs, and/or multiple journals on the same drive. Since the cost of troubleshooting performance issues on a small cluster likely exceeds the cost of the extra disk drives, you can accelerate your cluster design planning by avoiding the temptation to overtax the OSD storage drives.
You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely lead to resource contention and diminish the overall throughput. You may store a journal and object data on the same drive, but this may increase the time it takes to journal a write and ACK to the client. Ceph must write to the journal before it can ACK the write. The btrfs
filesystem can write journal data and object data simultaneously, whereas XFS and ext4
cannot.
Ceph best practices dictate that you should run operating systems, OSD data and OSD journals on separate drives. SSDs for operating system drives are preferred.
2.3.4. Avoid RAID
Ceph replicates or erasure codes objects. RAID is redundant and reduces available capacity, and therefore an unnecessary expense. A degraded RAID will have a negative impact on performance. If you have systems with RAID controllers, configure them for RAID 0 (JBOD).
2.3.5. Solid State Drives
One opportunity for performance improvement is to use solid-state drives (SSDs) to reduce random access time and read latency while accelerating throughput. SSDs often cost more than 10x as much per gigabyte when compared to a hard disk drive, but SSDs often exhibit access times that are at least 100x faster than a hard disk drive.
SSDs do not have moving mechanical parts so they aren’t necessarily subject to the same types of limitations as hard disk drives. SSDs do have significant limitations though. When evaluating SSDs, it is important to consider the performance of sequential reads and writes. An SSD that has 400MB/s sequential write throughput may have much better performance than an SSD with 120MB/s of sequential write throughput when storing multiple journals for multiple OSDs.
We recommend exploring the use of SSDs to improve performance. However, before making a significant investment in SSDs, we strongly recommend both reviewing the performance metrics of an SSD and testing the SSD in a test configuration to gauge performance.
Since SSDs have no moving mechanical parts, it makes sense to use them in the areas of Ceph that do not use a lot of storage space (e.g., journals or cache-tiers). Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There are a few important performance considerations for journals and SSDs:
- Write-intensive semantics: Journaling involves write-intensive semantics, so you should ensure that the SSD you choose to deploy will perform equal to or better than a hard disk drive when writing data. Inexpensive SSDs may introduce write latency even as they accelerate access time, because sometimes high performance hard drives can write as fast or faster than some of the more economical SSDs available on the market!
- Sequential Writes: When you store multiple journals on an SSD you must consider the sequential write limitations of the SSD too, since they may be handling requests to write to multiple OSD journals simultaneously.
- Partition Alignment: A common problem with SSD performance is that people like to partition drives as a best practice, but they often overlook proper partition alignment with SSDs, which can cause SSDs to transfer data much more slowly. Ensure that SSD partitions are properly aligned.
While SSDs are cost prohibitive for object storage, OSDs may see a significant performance improvement by storing an OSD’s journal on an SSD and the OSD’s object data on a separate hard disk drive. The osd journal
configuration setting defaults to /var/lib/ceph/osd/$cluster-$id/journal
. You can mount this path to an SSD or to an SSD partition so that it is not merely a file on the same drive as the object data.
2.3.6. Controllers
Disk controllers also have a significant impact on write throughput. Carefully, consider your selection of disk controllers to ensure that they do not create a performance bottleneck.
2.3.7. Additional Considerations
Multiple OSDs per one host
You can run multiple OSDs per host, but ensure that the sum of the total throughput of your OSD hard disks does not exceed the network bandwidth required to service a client’s need to read or write data.
Also, consider what percentage of the overall data the cluster stores on each host. If the percentage on a particular host is large and the host fails, it can lead to problems such as exceeding the full ratio
, which causes Ceph to halt operations as a safety precaution that prevents data loss.
Use battery backups or separate power feed to racks
Red Hat strongly recommends to separate power feed to racks or to use battery backups for monitors to prevent data loss in case of power outage. If the majority of Ceph monitors does not start when the power is restored, the quorum cannot be obtained and the Ceph cluster is unable to recover.
Use SSDs for Monitor stores
The Monitor stores can generate a significant amount of I/O operations, therefore an ideal storage media for the stores are Solid-state Drives (SSDs).
To ensure data integrity during power loss, you must disable all caches or safeguard them by hardware mechanisms like battery backup units or super capacitors coupled with non-volatile stores.