Chapter 2. Storage Cluster Quick Start
This Quick Start sets up a Red Hat Ceph Storage cluster using ceph-deploy
on your Calamari admin node. Create a small Ceph cluster so you can explore Ceph functionality. As a first exercise, create a Ceph Storage Cluster with one Ceph Monitor and some Ceph OSD Daemons, each on separate nodes. Once the cluster reaches an active + clean
state, you can use the cluster.
2.1. Executing ceph-deploy
When executing ceph-deploy
to install the Red Hat Ceph Storage, ceph-deploy
retrieves Ceph packages from the /opt/calamari/
directory on the Calamari administration host. To do so, ceph-deploy
needs to read the .cephdeploy.conf
file created by the ice_setup
utility. Therefore, ensure to execute ceph-deploy
in the local working directory created in the Create a Working Directory section, for example ~/ceph-config/
:
cd ~/ceph-config
Execute ceph-deploy
commands as a regular user not as root
or by using sudo
. The Create a Ceph Deploy User and Enable Password-less SSH steps enable ceph-deploy
to execute as root
without sudo
and without connecting to Ceph nodes as the root
user. You might still need to execute ceph
CLI commands as root
or by using sudo
.
2.2. Create a Cluster
If at any point you run into trouble and you want to start over, execute the following to purge the configuration:
ceph-deploy purge <ceph-node> [<ceph-node>] ceph-deploy purgedata <ceph-node> [<ceph-node>] ceph-deploy forgetkeys
If you execute the foregoing procedure, you must re-install Ceph.
On your Calamari admin node from the directory you created for holding your configuration details, perform the following steps using ceph-deploy
.
Create the cluster:
ceph-deploy new <initial-monitor-node(s)>
For example:
ceph-deploy new node1
Check the output of
ceph-deploy
withls
andcat
in the current directory. You should see a Ceph configuration file, a monitor secret keyring, and a log file of theceph-deploy
procedures.
2.3. Modify the Ceph Configuration File
At this stage, you may begin editing your Ceph configuration file (ceph.conf
).
If you choose not to use ceph-deploy
you will have to deploy Ceph manually or configure a deployment tool (e.g., Chef, Juju, Puppet, etc.) to perform each operation that ceph-deploy
performs for you. To deploy Ceph manually, please see our Knowledgebase article.
Add the
public_network
andcluster_network
settings under the[global]
section of your Ceph configuration file.public_network = <ip-address>/<netmask> cluster_network = <ip-address>/<netmask>
These settings distinguish which network is public (front-side) and which network is for the cluster (back-side). Ensure that your nodes have interfaces configured for these networks. We do not recommend using the same NIC for the public and cluster networks. Please see the Network Configuration Settings for details on the public and cluster networks.
Turn on IPv6 if you intend to use it.
ms_bind_ipv6 = true
Please see Bind for more details.
Add or adjust the
osd journal size
setting under the[global]
section of your Ceph configuration file.osd_journal_size = 10000
We recommend a general setting of 10GB. Ceph’s default
osd_journal_size
is0
, so you will need to set this in yourceph.conf
file. A journal size should be the product of thefilestore_max_sync_interval
option and the expected throughput, and then multiply the resulting product by two. The expected throughput number should include the expected disk throughput (i.e., sustained data transfer rate), and network throughput. For example, a 7200 RPM disk will likely have approximately 100 MB/s. Taking themin()
of the disk and network throughput should provide a reasonable expected throughput. Please see Journal Settings for more details.Set the number of copies to store (default is
3
) and the default minimum required to write data when in adegraded
state (default is2
) under the[global]
section of your Ceph configuration file. We recommend the default values for production clusters.osd_pool_default_size = 3 osd_pool_default_min_size = 2
For a quick start, you may wish to set
osd_pool_default_size
to2
, and theosd_pool_default_min_size
to 1 so that you can achieve andactive+clean
state with only two OSDs.These settings establish the networking bandwidth requirements for the cluster network, and the ability to write data with eventual consistency (i.e., you can write data to a cluster in a degraded state if it has
min_size
copies of the data already). Please see Settings for more details.Set a CRUSH leaf type to the largest serviceable failure domain for your replicas under the
[global]
section of your Ceph configuration file. The default value is1
, or host, which means that CRUSH will map replicas to OSDs on separate separate hosts. For example, if you want to make three object replicas, and you have three racks of chassis/hosts, you can setosd_crush_chooseleaf_type
to3
, and CRUSH will place each copy of an object on OSDs in different racks.osd_crush_chooseleaf_type = 3
The default CRUSH hierarchy types are:
- type 0 osd
- type 1 host
- type 2 chassis
- type 3 rack
- type 4 row
- type 5 pdu
- type 6 pod
- type 7 room
- type 8 datacenter
- type 9 region
type 10 root
Please see Settings for more details.
Set
max_open_files
so that Ceph will set the maximum open file descriptors at the OS level to help prevent Ceph OSD Daemons from running out of file descriptors.max_open_files = 131072
Please see the General Configuration Reference for more details.
In summary, your initial Ceph configuration file should have at least the following settings with appropriate values assigned after the =
sign:
[global] fsid = <cluster-id> mon_initial_members = <hostname>[, <hostname>] mon_host = <ip-address>[, <ip-address>] public_network = <network>[, <network>] cluster_network = <network>[, <network>] ms_bind_ipv6 = [true | false] max_open_files = 131072 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx osd_journal_size = <n> filestore_xattr_use_omap = true osd_pool_default_size = <n> # Write an object n times. osd_pool_default_min_size = <n> # Allow writing n copy in a degraded state. osd_crush_chooseleaf_type = <n>
2.4. Install Ceph with the ISO
To install Ceph from a local repository, use the --repo
argument first to ensure that ceph-deploy
is pointing to the .cephdeploy.conf
file generated by ice_setup
(e.g., in the exemplary ~/ceph-config
directory, the /root
directory, or ~
). Otherwise, you may not receive packages from the local repository. Specify --release=<daemon-name>
to specify the daemon package you wish to install. Then, install the packages. Ideally, you should run ceph-deploy
from the directory where you keep your configuration (e.g., the exemplary ~/ceph-config
) so that you can maintain a {cluster-name}.log
file with all the commands you have executed with ceph-deploy
.
ceph-deploy install --repo --release=[ceph-mon|ceph-osd] <ceph-node> [<ceph-node> ...] ceph-deploy install --<daemon> <ceph-node> [<ceph-node> ...]
For example:
ceph-deploy install --repo --release=ceph-mon monitor1 monitor2 monitor3 ceph-deploy install --mon monitor1 monitor2 monitor3
ceph-deploy install --repo --release=ceph-osd srv1 srv2 srv3 ceph-deploy install --osd srv1 srv2 srv3
The ceph-deploy
utility will install the appropriate Ceph daemon on each node.
If you use ceph-deploy purge
, you must re-execute this step to re-install Ceph.
2.5. Install Ceph by Using CDN
When installing Ceph on remote nodes from the CDN (not ISO), you must specify which Ceph daemon you wish to install on the node by passing one of --mon
or --osd
to ceph-deploy
.
ceph-deploy install [--mon|--osd] <ceph-node> [<ceph-node> ...]
For example:
ceph-deploy install --mon monitor1 monitor2 monitor3
ceph-deploy install --osd srv1 srv2 srv3
If you use ceph-deploy purge
, you must re-execute this step to re-install Ceph.
2.6. Install ceph-selinux
With Red Hat Ceph Storage 1.3.2 or later, a new ceph-selinux
package can be installed on Ceph nodes. This package provides SELinux support for Ceph and SELinux therefore no longer needs to be in permissive
or disabled
mode.
Once installed, ceph-selinux
adds the SELinux policy for Ceph and also relabels files on the cluster accordingly. Ceph processes are labeled with the ceph_exec_t
SELinux context.
To install ceph-selinux
, use the following command:
ceph-deploy pkg --install ceph-selinux <nodes>
For example:
ceph-deploy pkg --install ceph-selinux node1 node2 node3
All Ceph daemons will be down for the time the ceph-selinux
package is being installed. Therefore, your cluster will not be able to serve any data at this point. This operation is necessary in order to update the metadata of the files located on the underlying file system and to make Ceph daemons run with the correct context. This operation may take several minutes depending on the size and speed of the underlying storage.
If SELinux was in permissive
, run the following command as root
to set it to enforcing
again:
# setenforce 1
To configure SELinux persistently, modify the /etc/selinux/config
configuration file.
For more information about SELinux, see the SELinux User’s and Administrator’s Guide for Red Hat Enterprise Linux 7.
2.7. Add Initial Monitors
Add the initial monitor(s) and gather the keys.
ceph-deploy mon create-initial
Once you complete the process, your local directory should have the following keyrings:
-
<cluster-name>.client.admin.keyring
-
<cluster-name>.bootstrap-osd.keyring
-
<cluster-name>.bootstrap-mds.keyring
-
<cluster-name>.bootstrap-rgw.keyring
2.8. Connect Monitor Hosts to Calamari
Once you have added the initial monitor(s), you need to connect the monitor hosts to Calamari. From your admin node, execute:
ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' <ceph-node>[<ceph-node> ...]
For example, using the exemplary node1
from above, you would execute:
ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' node1
If you expand your monitor cluster with additional monitors, you will have to connect the hosts that contain them to Calamari, too.
2.9. Make your Calamari Admin Node a Ceph Admin Node
After you create your initial monitors, you can use the Ceph CLI to check on your cluster. However, you have to specify the monitor and admin keyring each time with the path to the directory holding your configuration, but you can simplify your CLI usage by making the admin node a Ceph admin client.
You will also need to install ceph-common
on the Calamari node. ceph-deploy install --cli
does this.
ceph-deploy install --cli <node-name> ceph-deploy admin <node-name>
For example:
ceph-deploy install --cli admin-node ceph-deploy admin admin-node
The ceph-deploy
utility will copy the ceph.conf
and ceph.client.admin.keyring
files to the /etc/ceph
directory. When ceph-deploy
is talking to the local admin host (admin-node
), it must be reachable by its hostname (e.g., hostname -s
). If necessary, modify /etc/hosts
to add the name of the admin host. If you do not have an /etc/ceph
directory, you should install ceph-common
.
You may then use the Ceph CLI.
Once you have added your new Ceph monitors, Ceph will begin synchronizing the monitors and form a quorum. You can check the quorum status by executing the following as root
:
# ceph quorum_status --format json-pretty
Your cluster will not achieve an active + clean
state until you add enough OSDs to facilitate object replicas. This is inclusive of CRUSH failure domains.
2.10. Adjust CRUSH Tunables
Red Hat Ceph Storage CRUSH tunables defaults to bobtail
, which refers to an older release of Ceph. This setting guarantees that older Ceph clusters are compatible with older Linux kernels. However, if you run a Ceph cluster on Red Hat Enterprise Linux 7, reset CRUSH tunables to optimal
. As root
, execute the following:
# ceph osd crush tunables optimal
See the CRUSH Tunables chapter in the Storage Strategies guides for details on the CRUSH tunables.
2.11. Add OSDs
Before creating OSDs, consider the following:
- We recommend using the XFS file system, which is the default file system.
Use the default XFS file system options that the ceph-deploy
utility uses to format the OSD disks. Deviating from the default values can cause stability problems with the storage cluster.
For example, setting the directory block size higher than the default value of 4096 bytes can cause memory allocation deadlock errors in the file system. For more details, view the Red Hat Knowledgebase article regarding these errors.
- Red Hat recommends using SSDs for journals. It is common to partition SSDs to serve multiple OSDs. Ensure that the number of SSD partitions does not exceed the SSD’s sequential write limits. Also, ensure that SSD partitions are properly aligned, or their write performance will suffer.
Red Hat recommends to delete the partition table of a Ceph OSD drive by using the
ceph-deploy disk zap
command before executing theceph-deploy osd prepare
command:ceph-deploy disk zap <ceph_node>:<disk_device>
For example:
ceph-deploy disk zap node2:/dev/sdb
From your administration node, use ceph-deploy osd prepare
to prepare the OSDs:
ceph-deploy osd prepare <ceph_node>:<disk_device> [<ceph_node>:<disk_device>]
For example:
ceph-deploy osd prepare node2:/dev/sdb
The prepare
command creates two partitions on a disk device; one partition is for OSD data, and the other is for the journal.
Once you prepare OSDs, activate the OSDs:
ceph-deploy osd activate <ceph_node>:<data_partition>
For example:
ceph-deploy osd activate node2:/dev/sdb1
In the ceph-deploy osd activate
command, specify a particular disk partition, for example /dev/sdb1
.
It is also possible to use a disk device that is wholly formatted without a partition table. In that case, a partition on an additional disk must be used to serve as the journal store:
ceph-deploy osd activate <ceph_node>:<disk_device>:<data_partition>
In the following example, sdd
is a spinning hard drive that Ceph uses entirely for OSD data. ssdb1
is a partition of an SSD drive, which Ceph uses to store the journal for the OSD:
ceph-deploy osd activate node{2,3,4}:sdd:ssdb1
To achieve the active + clean
state, you must add as many OSDs as the osd pool default size = <n>
parameter specifies in the Ceph configuration file.
For information on creating encrypted OSD nodes, see the Encrypted OSDs subsection in the Adding OSDs by Using ceph-deploy section in the Administration Guide for Red Hat Ceph Storage 2.
2.12. Connect OSD Hosts to Calamari
Once you have added the initial OSDs, you need to connect the OSD hosts to Calamari.
ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' <ceph-node>[<ceph-node> ...]
For example, using the exemplary node2
, node3
and node4
from above, you would execute:
ceph-deploy calamari connect --master '<FQDN for the Calamari admin node>' node2 node3 node4
As you expand your cluster with additional OSD hosts, you will have to connect the hosts that contain them to Calamari, too.
2.13. Create a CRUSH Hierarchy
You can run a Ceph cluster with a flat node-level hierarchy (default). This is NOT RECOMMENDED. We recommend adding named buckets of various types to your default CRUSH hierarchy. This will allow you to establish a larger-grained failure domain, usually consisting of racks, rows, rooms and data centers.
ceph osd crush add-bucket <bucket-name> <bucket-type>
For example:
ceph osd crush add-bucket dc1 datacenter ceph osd crush add-bucket room1 room ceph osd crush add-bucket row1 row ceph osd crush add-bucket rack1 rack ceph osd crush add-bucket rack2 rack ceph osd crush add-bucket rack3 rack
Then, place the buckets into a hierarchy:
ceph osd crush move dc1 root=default ceph osd crush move room1 datacenter=dc1 ceph osd crush move row1 room=room1 ceph osd crush move rack1 row=row1 ceph osd crush move node2 rack=rack1
2.14. Add OSD Hosts/Chassis to the CRUSH Hierarchy
Once you have added OSDs and created a CRUSH hierarchy, add the OSD hosts/chassis to the CRUSH hierarchy so that CRUSH can distribute objects across failure domains. For example:
ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=row1 rack=rack1 host=node2 ceph osd crush set osd.1 1.0 root=default datacenter=dc1 room=room1 row=row1 rack=rack2 host=node3 ceph osd crush set osd.2 1.0 root=default datacenter=dc1 room=room1 row=row1 rack=rack3 host=node4
The foregoing example uses three different racks for the exemplary hosts (assuming that is how they are physically configured). Since the exemplary Ceph configuration file specified "rack" as the largest failure domain by setting osd_crush_chooseleaf_type = 3
, CRUSH can write each object replica to an OSD residing in a different rack. Assuming osd_pool_default_min_size = 2
, this means (assuming sufficient storage capacity) that the Ceph cluster can continue operating if an entire rack were to fail (e.g., failure of a power distribution unit or rack router).
2.15. Check CRUSH Hierarchy
Check your work to ensure that the CRUSH hierarchy is accurate.
ceph osd tree
If you are not satisfied with the results of your CRUSH hierarchy, you may move any component of your hierarchy with the move
command.
ceph osd crush move <bucket-to-move> <bucket-type>=<parent-bucket>
If you want to remove a bucket (node) or OSD (leaf) from the CRUSH hierarchy, use the remove
command:
ceph osd crush remove <bucket-name>
2.16. Check Cluster Health
To ensure that the OSDs in your cluster are peering properly, execute:
ceph health
You may also check on the health of your cluster using the Calamari dashboard.
2.17. List and Create a Pool
You can manage pools using Calamari, or using the Ceph command line. Verify that you have pools for writing and reading data:
ceph osd lspools
You can bind to any of the pools listed using the admin
user and client.admin
key. To create a pool, use the following syntax:
ceph osd pool create <pool-name> <pg-num> [<pgp-num>] [replicated] [crush-ruleset-name]
For example:
ceph osd pool create mypool 512 512 replicated replicated_ruleset
To find the rule set names available, execute ceph osd crush rule list
. To calculate the pg-num
and pgp-num
see Ceph Placement Groups (PGs) per Pool Calculator.
2.18. Storing and Retrieving Object Data
To perform storage operations with Ceph Storage Cluster, all Ceph clients regardless of type must:
- Connect to the cluster.
- Create an I/O contest to a pool.
- Set an object name.
- Execute a read or write operation for the object.
The Ceph Client retrieves the latest cluster map and the CRUSH algorithm calculates how to map the object to a placement-group, and then calculates how to assign the placement group to a Ceph OSD Daemon dynamically. Client types such as Ceph Block Device and the Ceph Object Gateway perform the last two steps transparently.
To find the object location, all you need is the object name and the pool name. For example:
ceph osd map <poolname> <object-name>
The rados
CLI tool in the following example is for Ceph administrators only.
Exercise: Locate an Object
As an exercise, lets create an object. Specify an object name, a path to a test file containing some object data and a pool name using the rados put
command on the command line. For example:
echo <Test-data> > testfile.txt rados put <object-name> <file-path> --pool=<pool-name> rados put test-object-1 testfile.txt --pool=data
To verify that the Ceph Storage Cluster stored the object, execute the following:
rados -p data ls
Now, identify the object location:
ceph osd map <pool-name> <object-name> ceph osd map data test-object-1
Ceph should output the object’s location. For example:
osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0]
To remove the test object, simply delete it using the rados rm
command. For example:
rados rm test-object-1 --pool=data
As the cluster size changes, the object location may change dynamically. One benefit of Ceph’s dynamic rebalancing is that Ceph relieves you from having to perform the migration manually.