Chapter 2. General principles for selecting hardware
As a storage administrator, you must select the appropriate hardware for running a production Red Hat Ceph Storage cluster. When selecting hardware for Red Hat Ceph Storage, review these following general principles. These principles will help save time, avoid common mistakes, save money and achieve a more effective solution.
2.1. Prerequisites
- A planned use for Red Hat Ceph Storage.
2.2. Identify performance use case
One of the most important steps in a successful Ceph deployment is identifying a price-to-performance profile suitable for the cluster’s use case and workload. It is important to choose the right hardware for the use case. For example, choosing IOPS-optimized hardware for a cold storage application increases hardware costs unnecessarily. Whereas, choosing capacity-optimized hardware for its more attractive price point in an IOPS-intensive workload will likely lead to unhappy users complaining about slow performance.
The primary use cases for Ceph are:
- IOPS optimized: IOPS optimized deployments are suitable for cloud computing operations, such as running MYSQL or MariaDB instances as virtual machines on OpenStack. IOPS optimized deployments require higher performance storage such as 15k RPM SAS drives and separate flash based BlueStore Metadata device to handle frequent write operations. Some high IOPS scenarios use all flash storage to improve IOPS and total throughput.
- Throughput optimized: Throughput-optimized deployments are suitable for serving up significant amounts of data, such as graphic, audio and video content. Throughput-optimized deployments require networking hardware, controllers and hard disk drives with acceptable total throughput characteristics. In cases where write performance is a requirement, flash based BlueStore Metadata device will substantially improve write performance.
- Capacity optimized: Capacity-optimized deployments are suitable for storing significant amounts of data as inexpensively as possible. Capacity-optimized deployments typically trade performance for a more attractive price point. For example, capacity-optimized deployments often use slower and less expensive SATA drives.
This document provides examples of Red Hat tested hardware suitable for these use cases.
2.3. Consider storage density
Hardware planning should include distributing Ceph daemons and other processes that use Ceph across many hosts to maintain high availability in the event of hardware faults. Balance storage density considerations with the need to rebalance the cluster in the event of hardware faults. A common hardware selection mistake is to use very high storage density in small clusters, which can overload networking during backfill and recovery operations.
2.4. Identical hardware configuration
Create pools and define CRUSH hierarchies such that the OSD hardware within the pool is identical.
- Same controller.
- Same drive size.
- Same RPMs.
- Same seek times.
- Same I/O.
- Same network throughput.
Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines troubleshooting.
When using multiple storage devices, sometimes during reboot, order of devices might change. For troubleshooting this issue, see Change order of Storage devices during reboot
2.5. Network considerations
Carefully consider bandwidth requirements for the cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-cluster traffic.
Red Hat recommends using 10 GB Ethernet for Ceph production deployments. 1 GB Ethernet is not suitable for production storage clusters.
In the case of a drive failure, replicating 1 TB of data across a 1Gbps network takes 3 hours, and 3 TB takes 9 hours. 3 TB is the typical drive configuration. By contrast, with a 10 GB network, the replication times would be 20 minutes and 1 hour respectively. Remember that when an OSD fails, the cluster will recover by replicating the data it contained to other OSDs within the pool.
The failure of a larger domain such as a rack means that the storage cluster will utilize considerably more bandwidth. Storage administrators usually prefer that a cluster recovers as quickly as possible.
At a minimum, a single 10 GB Ethernet link should be used for storage hardware. If the Ceph nodes have many drives each, add additional 10 GB Ethernet links for connectivity and throughput.
Set up front and backside networks on separate NICs.
Ceph supports a public (front-side) network and a cluster (back-side) network. The public network handles client traffic and communication with Ceph monitors. The cluster (back-side) network handles OSD heartbeats, replication, backfilling and recovery traffic.
Red Hat recommends allocating bandwidth to the cluster (back-side) network such that it is a multiple of the front-side network using osd_pool_default_size
as the basis for your multiple on replicated pools. Red Hat also recommends running the public and cluster networks on separate NICs.
When building a storage cluster consisting of multiple racks (common for large storage implementations), consider utilizing as much network bandwidth between switches in a "fat tree" design for optimal performance. A typical 10 GB Ethernet switch has 48 10 GB ports and four 40 GB ports. Use the 40 GB ports on the spine for maximum throughput. Alternatively, consider aggregating unused 10Gbps ports with QSFP+ and SFP+ cables into more 40 GB ports to connect to another rack and spine routers.
For network optimization, Red Hat recommends using jumbo frames for a better CPU/bandwidth ratio, and a non-blocking network switch back-plane. Red Hat Ceph Storage requires the same MTU value throughout all networking devices in the communication path, end-to-end for both public and cluster networks. Verify that the MTU value is the same on all nodes and networking equipment in the environment before using a Red Hat Ceph Storage cluster in production.
Additional Resources
- See the Verifying and configuring the MTU value section in the Red Hat Ceph Storage Configuration Guide for more details.
2.6. Avoid using RAID solutions
Ceph can replicate or erasure code objects. RAID duplicates this functionality on the block level and reduces available capacity. Consequently, RAID is an unnecessary expense. Additionally, a degraded RAID will have a negative impact on performance.
Red Hat recommends that each hard drive be exported separately from the RAID controller as a single volume with write-back caching enabled.
This requires a battery-backed, or a non-volatile flash memory device on the storage controller. It is important to make sure the battery is working, as most controllers will disable write-back caching if the memory on the controller can be lost as a result of a power failure. Periodically check the batteries and replace them if necessary, as they do degrade over time. See the storage controller vendor’s documentation for details. Typically, the storage controller vendor provides storage management utilities to monitor and adjust the storage controller configuration without any downtime.
Using Just a Bunch of Drives (JBOD) in independent drive mode with Ceph is supported when using all Solid State Drives (SSDs), or for configurations with high numbers of drives per controller. For example, 60 drives attached to one controller. In this scenario, the write-back caching can become a source of I/O contention. Since JBOD disables write-back caching, it is ideal in this scenario. One advantage of using JBOD mode is the ease of adding or replacing drives and then exposing the drive to the operating system immediately after it is physically plugged in.
2.7. Summary of common mistakes when selecting hardware
- Repurposing underpowered legacy hardware for use with Ceph.
- Using dissimilar hardware in the same pool.
- Using 1Gbps networks instead of 10Gbps or greater.
- Neglecting to setup both public and cluster networks.
- Using RAID instead of JBOD.
- Selecting drives on a price basis without regard to performance or throughput.
- Having a disk controller with insufficient throughput characteristics.
Use the examples in this document of Red Hat tested configurations for different workloads to avoid some of the foregoing hardware selection mistakes.
2.8. Additional Resources
- Supported configurations article on the Red Hat Customer Portal.