Chapter 3. Networking Recommendations
Carefully consider bandwidth requirements for your cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-cluster traffic.
On smaller clusters, 1Gbps networks may be suitable for normal operating conditions, but not for heavy loads or failure recovery scenarios. In the case of a drive failure, replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, the replication times would be 20 minutes and 1 hour respectively. Remember that when an OSD fails, the cluster will recover by replicating the data it contained to other OSDs within the pool.
failed OSD(s) ------------- total OSDs
The failure of a larger domain such as a rack means that your cluster will utilize considerably more bandwidth. Administrators usually prefer that a cluster recovers as quickly as possible.
At a minimum, a single 10Gbps Ethernet link should be used for storage hardware. If your Ceph nodes have many drives each, add additional 10Gbps Ethernet links for connectivity and throughput.
Ceph supports a public (front-side) network and a cluster (back-side) network. The public network handles client traffic and communication with Ceph monitors. The cluster (back-side) network handles OSD heartbeats, replication, backfilling and recovery traffic. We recommend allocating bandwidth to the cluster (back-side) network such that it is a multiple of the front-side network using osd pool default size
as the basis for your multiple. We also recommend running the public and cluster networks on separate NICs.
If you are building a cluster consisting of multiple racks (common for large clusters), consider utilizing as much network bandwidth between switches in a "fat tree" design for optimal performance. A typical 10Gbps Ethernet switch has 48 10Gbps ports and four 40Gbps ports. If you only use one one 40Gbps port for connectivity, you can only connect 4 servers at full speed (i.e., 10gbps x 4). Use your 40Gbps ports for maximum throughput. If you have unused 10G ports, you can aggregate them (with QSFP+ to 4x SFP+ cables) into more 40G ports to connect to other racks and to spine routers.
For network optimization, we recommend a jumbo frame for a better CPU/bandwidth ratio. We also recommend a non-blocking network switch back-plane.
You may deploy a Ceph cluster across geographic regions; however, this is NOT RECOMMENDED UNLESS you use a dedicated network connection between datacenters. Ceph prefers consistency and acknowledges writes synchronously. Using the internet (packet-switched with many hops) between geographically separate datacenters will introduce significant write latency.
3.1. Settings
You may specify multiple IP addresses and subnets for your public and cluster networks in your Ceph configuration file. For example:
public network {ip-address}/{netmask} [, {ip-address}/{netmask}] cluster network {ip-address}/{netmask} [, {ip-address}/{netmask}]
Ensure that the IP addresses/subnets within the public network can reach each other, and the IP addresses/subnets within the cluster network can reach each other. We recommend keeping the cluster network separate from the public network and not connected to the internet to prevent DDOS attacks from crippling heartbeats, replication, backfilling and recovery.
You may use IPv6 addresses; however, you must enable daemons to bind to them first. For example, you may enable IPv6 in your Ceph configuration file:
ms bind ipv6 = true
Monitors use port 6789 by default. Ensure you have the port open for each monitor host. Each Ceph OSD Daemon on a Ceph Node may use up to three ports, beginning at port 6800:
- One for talking to clients and monitors.
- One for sending data to other OSDs (replication, backfill and recovery).
- One for heartbeating.
You need to open at least three ports per OSD beginning at port 6800 on a Ceph node to ensure that the OSDs can peer. The port for talking to monitors and clients must be open on the public (front-side) network. The ports for sending data to other OSDs and heartbeating must be open on the cluster (back-side) network.
If you want to use a different port range than 6800:7100 for Ceph daemons, you must adjust the following settings in your Ceph configuration file:
ms bind port min = {min-port-num} ms bind port max = {max-port-num}
Ceph monitors bind on port 6789 by default. If you want to use a different port number than 6789, you may specify the the IP address and port in your Ceph configuration. For example:
[mon.{mon_name}] host = {hostname} mon addr = {ip_address}:{port}
We further recommend installing NTP on your Ceph nodes—especially Ceph monitor nodes. Without clock synchronization, clock drift may prevent monitors from agreeing on the state of the cluster, which means that clients lose access to data until the quorum is re-established and the monitors agree on the state of the cluster.
As a best practice, we also recommend a 1GbE copper interface for an IPMI network.