Chapter 1. Managing the storage cluster size
As a storage administrator, you can manage the storage cluster size by adding or removing Ceph Monitors or OSDs as storage capacity expands or shrinks.
If you are bootstrapping a storage cluster for the first time, see the Red Hat Ceph Storage 3 Installation Guide for Red Hat Enterprise Linux or Ubuntu.
1.1. Prerequisites
- A running Red Hat Ceph Storage cluster.
1.2. Ceph Monitors
Ceph monitors are light-weight processes that maintain a master copy of the cluster map. All Ceph clients contact a Ceph monitor and retrieve the current copy of the cluster map, enabling clients to bind to a pool and read and write data.
Ceph monitors use a variation of the Paxos protocol to establish consensus about maps and other critical information across the cluster. Due to the nature of Paxos, Ceph requires a majority of monitors running to establish a quorum thus establishing consensus.
Red Hat requires at least three monitors on separate hosts to receive support for a production cluster.
Red Hat recommends deploying an odd number of monitors. An odd number of monitors has a higher resiliency to failures than an even number of monitors. For example, to maintain a quorum on a two monitor deployment, Ceph cannot tolerate any failures; with three monitors, one failure; with four monitors, one failure; with five monitors, two failures. This is why an odd number is advisable. Summarizing, Ceph needs a majority of monitors to be running and to be able to communicate with each other, two out of three, three out of four, and so on.
For an initial deployment of a multi-node Ceph storage cluster, Red Hat requires three monitors, increasing the number two at a time if a valid need for more than three monitors exists.
Since monitors are light-weight, it is possible to run them on the same host as OpenStack nodes. However, Red Hat recommends running monitors on separate hosts.
Red Hat does NOT support collocating Ceph Monitors and OSDs on the same node. Doing this can have a negative impact to storage cluster performance.
Red Hat ONLY supports collocating Ceph services in containerized environments.
When you remove monitors from a storage cluster, consider that Ceph monitors use the Paxos protocol to establish a consensus about the master storage cluster map. You must have a sufficient number of monitors to establish a quorum.
Additional Resources
- See the Red Hat Ceph Storage Supported configurations Knowledgebase article for all the supported Ceph configurations.
1.2.1. Preparing a new Ceph Monitor node
When adding a new Ceph Monitor to a storage cluster, deploy them on a separate node. The node hardware must be uniform for all monitor nodes in the storage cluster.
Prerequisites
- Network connectivity.
-
Having
root
access to the new node. - Review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
Procedure
- Add the new node to the server rack.
- Connect the new node to the network.
- Install either Red Hat Enterprise Linux 7 or Ubuntu 16.04 on the new node.
Install NTP and configure a reliable time source:
[root@monitor ~]# yum install ntp
If using a firewall, open TCP port 6789:
Red Hat Enterprise Linux
[root@monitor ~]# firewall-cmd --zone=public --add-port=6789/tcp [root@monitor ~]# firewall-cmd --zone=public --add-port=6789/tcp --permanent
Ubuntu
iptables -I INPUT 1 -i $NIC_NAME -p tcp -s $IP_ADDR/$NETMASK_PREFIX --dport 6789 -j ACCEPT
Ubuntu example
[user@monitor ~]$ sudo iptables -I INPUT 1 -i enp6s0 -p tcp -s 192.168.0.11/24 --dport 6789 -j ACCEPT
1.2.2. Adding a Ceph Monitor using Ansible
Red Hat recommends adding two monitors at a time to maintain an odd number of monitors. For example, if you have a three monitors in the storage cluster, Red Hat recommends expanding it to a five monitors.
Prerequisites
- A running Red Hat Ceph Storage cluster.
-
Having
root
access to the new nodes.
Procedure
Add the new Ceph Monitor nodes to the
/etc/ansible/hosts
Ansible inventory file, under a[mons]
section:Example
[mons] monitor01 monitor02 monitor03 $NEW_MONITOR_NODE_NAME $NEW_MONITOR_NODE_NAME
Verify that Ansible can contact the Ceph nodes:
# ansible all -m ping
Change directory to the Ansible configuration directory:
# cd /usr/share/ceph-ansible
Run the Ansible playbook:
$ ansible-playbook site.yml
If adding new monitors to a containerized deployment of Ceph, run the
site-docker.yml
playbook:$ ansible-playbook site-docker.yml
- After the Ansible playbook is finish, the new monitor nodes will be in the storage cluster.
1.2.3. Adding a Ceph Monitor using the command-line interface
Red Hat recommends adding two monitors at a time to maintain an odd number of monitors. For example, if you have three monitors in the storage cluster, Red Hat recommends expanding too five monitors.
Red Hat recommends only running one Ceph monitor daemon per node.
Prerequisites
- A running Red Hat Ceph Storage cluster.
-
Having
root
access to a running Ceph Monitor node and to the new monitor nodes.
Procedure
Add the Red Hat Ceph Storage 3 monitor repository.
Red Hat Enterprise Linux
[root@monitor ~]# subscription-manager repos --enable=rhel-7-server-rhceph-3-mon-els-rpms
Ubuntu
[user@monitor ~]$ sudo bash -c 'umask 0077; echo deb https://$CUSTOMER_NAME:$CUSTOMER_PASSWORD@rhcs.download.redhat.com/3-updates/Tools $(lsb_release -sc) main | tee /etc/apt/sources.list.d/Tools.list' [user@monitor ~]$ sudo bash -c 'wget -O - https://www.redhat.com/security/fd431d51.txt | apt-key add -'
Install the
ceph-mon
package on the new Ceph Monitor nodes:Red Hat Enterprise Linux
[root@monitor ~]# yum install ceph-mon
Ubuntu
[user@monitor ~]$ sudo apt-get install ceph-mon
To ensure the storage cluster identifies the monitor on start or restart, add the monitor’s IP address to the Ceph configuration file.
To add the new monitors to the
[mon]
or[global]
section of the Ceph configuration file on an existing monitor node in the storage cluster. Themon_host
setting, which is a list of DNS-resolvable host names or IP addresses, separated by "," or ";" or " ". Optionally, you can also create a specific section in the Ceph configuration file for the new monitor nodes:Syntax
[mon] mon host = $MONITOR_IP:$PORT $MONITOR_IP:$PORT ... $NEW_MONITOR_IP:$PORT
or
[mon.$MONITOR_ID] host = $MONITOR_ID mon addr = $MONITOR_IP
To make the monitors part of the initial quorum group, you must also add the host name to the
mon_initial_members
parameter in the[global]
section of the Ceph configuration file.Example
[global] mon initial members = node1 node2 node3 node4 node5 ... [mon] mon host = 192.168.0.1:6789 192.168.0.2:6789 192.168.0.3:6789 192.168.0.4:6789 192.168.0.5:6789 ... [mon.node4] host = node4 mon addr = 192.168.0.4 [mon.node5] host = node5 mon addr = 192.168.0.5
ImportantProduction storage clusters REQUIRE at least three monitors set in
mon_initial_members
andmon_host
to ensure high availability. If a storage cluster with only one initial monitor adds two more monitors, but does not add them tomon_initial_members
andmon_host
, the failure of the initial monitor will cause the storage cluster to lock up. If the monitors you are adding are replacing monitors that are part ofmon_initial_members
andmon_host
, the new monitors must be added tomon_initial_members
andmon_host
too.Copy the updated Ceph configuration file to all Ceph nodes and Ceph clients:
Syntax
scp /etc/ceph/$CLUSTER_NAME.conf $TARGET_NODE_NAME:/etc/ceph
Example
[root@monitor ~]# scp /etc/ceph/ceph.conf node4:/etc/ceph
Create the monitor’s data directory on the new monitor nodes:
Syntax
mkdir /var/lib/ceph/mon/$CLUSTER_NAME-$MONITOR_ID
Example
[root@monitor ~]# mkdir /var/lib/ceph/mon/ceph-node4
Create a temporary directory on a running monitor node and on the new monitor nodes to keep the files needed for this procedure. This directory should be different from the monitor’s default directory created in the previous step, and can be removed after all the steps are completed:
Syntax
mkdir $TEMP_DIRECTORY
Example
[root@monitor ~]# mkdir /tmp/ceph
Copy the admin key from a running monitor node to the new monitor nodes so that you can run
ceph
commands:Syntax
scp /etc/ceph/$CLUSTER_NAME.client.admin.keyring $TARGET_NODE_NAME:/etc/ceph
Example
[root@monitor ~]# scp /etc/ceph/ceph.client.admin.keyring node4:/etc/ceph
From a running monitor node, retrieve the monitor keyring:
Syntax
ceph auth get mon. -o /$TEMP_DIRECTORY/$KEY_FILE_NAME
Example
[root@monitor ~]# ceph auth get mon. -o /tmp/ceph/ceph_keyring.out
From a running monitor node, retrieve the monitor map:
Syntax
ceph mon getmap -o /$TEMP_DIRECTORY/$MONITOR_MAP_FILE
Example
[root@monitor ~]# ceph mon getmap -o /tmp/ceph/ceph_mon_map.out
Copy the collected monitor data to the new monitor nodes:
Syntax
scp /tmp/ceph $TARGET_NODE_NAME:/tmp/ceph
Example
[root@monitor ~]# scp /tmp/ceph node4:/tmp/ceph
Prepare the new monitors' data directory from the data you collected earlier. You must specify the path to the monitor map to retrieve quorum information from the monitors, along with their
fsid
. You must also specify a path to the monitor keyring:Syntax
ceph-mon -i $MONITOR_ID --mkfs --monmap /$TEMP_DIRECTORY/$MONITOR_MAP_FILE --keyring /$TEMP_DIRECTORY/$KEY_FILE_NAME
Example
[root@monitor ~]# ceph-mon -i node4 --mkfs --monmap /tmp/ceph/ceph_mon_map.out --keyring /tmp/ceph/ceph_keyring.out
For storage clusters with custom names, add the the following line to the
/etc/sysconfig/ceph
file:Red Hat Enterprise Linux
[root@monitor ~]# echo "CLUSTER=<custom_cluster_name>" >> /etc/sysconfig/ceph
Ubuntu
[user@monitor ~]$ sudo echo "CLUSTER=<custom_cluster_name>" >> /etc/default/ceph
Update the owner and group permissions on the new monitor nodes:
Syntax
chown -R $OWNER:$GROUP $DIRECTORY_PATH
Example
[root@monitor ~]# chown -R ceph:ceph /var/lib/ceph/mon [root@monitor ~]# chown -R ceph:ceph /var/log/ceph [root@monitor ~]# chown -R ceph:ceph /var/run/ceph [root@monitor ~]# chown -R ceph:ceph /etc/ceph
Enable and start the
ceph-mon
process on the new monitor nodes:Syntax
systemctl enable ceph-mon.target systemctl enable ceph-mon@$MONITOR_ID systemctl start ceph-mon@$MONITOR_ID
Example
[root@monitor ~]# systemctl enable ceph-mon.target [root@monitor ~]# systemctl enable ceph-mon@node4 [root@monitor ~]# systemctl start ceph-mon@node4
Additional Resources
- See the Enabling the Red Hat Ceph Storage Repositories section in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
1.2.4. Removing a Ceph Monitor using Ansible
To remove a Ceph Monitor with Ansible, use the shrink-mon.yml
playbook.
Prerequisites
- An Ansible administration node.
- A running Red Hat Ceph Storage cluster deployed by Ansible.
Procedure
Change to the
/usr/share/ceph-ansible/
directory.[user@admin ~]$ cd /usr/share/ceph-ansible
Copy the
shrink-mon.yml
playbook from theinfrastructure-playbooks
directory to the current directory.[root@admin ceph-ansible]# cp infrastructure-playbooks/shrink-mon.yml .
Run the
shrink-mon.yml
playbook for either normal or containerized deployments of Red Hat Ceph Storage:[user@admin ceph-ansible]$ ansible-playbook shrink-mon.yml -e mon_to_kill=<hostname> -u <ansible-user>
Replace:
-
<hostname>
with the short host name of the Monitor node. To remove more Monitors, separate their host names with comma. -
<ansible-user>
with the name of the Ansible user
For example, to remove a Monitor that is located on a node with
monitor1
host name:[user@admin ceph-ansible]$ ansible-playbook shrink-mon.yml -e mon_to_kill=monitor1 -u user
-
- Remove the Monitor entry from all Ceph configuration files in the cluster.
Ensure that the Monitor has been successfully removed.
[root@monitor ~]# ceph -s
Additional Resources
- For more information on installing Red Hat Ceph Storage, see the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
1.2.5. Removing a Ceph Monitor using the command-line interface
Removing a Ceph Monitor involves removing a ceph-mon
daemon from the storage cluster and updating the storage cluster map.
Prerequisites
- A running Red Hat Ceph Storage cluster.
-
Having
root
access to the monitor node.
Procedure
Stop the monitor service:
Syntax
systemctl stop ceph-mon@$MONITOR_ID
Example
[root@monitor ~]# systemctl stop ceph-mon@node3
Remove the monitor from the storage cluster:
Syntax
ceph mon remove $MONITOR_ID
Example
[root@monitor ~]# ceph mon remove node3
-
Remove the monitor entry from the Ceph configuration file, by default
/etc/ceph/ceph.conf
. Redistribute the Ceph configuration file to all remaining Ceph nodes in the storage cluster:
Syntax
scp /etc/ceph/$CLUSTER_NAME.conf $USER_NAME@$TARGET_NODE_NAME:/etc/ceph/
Example
[root@monitor ~]# scp /etc/ceph/ceph.conf root@$node1:/etc/ceph/
Containers only. Disable the monitor service:
NotePerform steps 5-9 only if using containers.
Syntax
systemctl disable ceph-mon@$MONITOR_ID
Example
[root@monitor ~]# systemctl disable ceph-mon@node3
Containers only. Remove the service from systemd:
[root@monitor ~]# rm /etc/systemd/system/ceph-mon@.service
Containers only. Reload the systemd manager configuration:
[root@monitor ~]# systemctl daemon-reload
Containers only. Reset the state of the failed monitor unit:
[root@monitor ~]# systemctl reset-failed
Containers only. Remove the
ceph-mon
RPM:[root@monitor ~]# docker exec node3 yum remove ceph-mon
Archive the monitor data:
Syntax
mv /var/lib/ceph/mon/$CLUSTER_NAME-$MONITOR_ID /var/lib/ceph/mon/removed-$CLUSTER_NAME-$MONITOR_ID
Example
[root@monitor ~]# mv /var/lib/ceph/mon/ceph-node3 /var/lib/ceph/mon/removed-ceph-node3
Delete the monitor data:
Syntax
rm -r /var/lib/ceph/mon/$CLUSTER_NAME-$MONITOR_ID
Example
[root@monitor ~]# rm -r /var/lib/ceph/mon/ceph-node3
Additional Resources
- See the Knowledgebase solution How to re-deploy Ceph Monitor in a director deployed Ceph cluster for more information.
1.2.6. Removing a Ceph Monitor from an unhealthy storage cluster
This procedure removes a ceph-mon
daemon from an unhealthy storage cluster. An unhealthy storage cluster that has placement groups persistently not active + clean
.
Prerequisites
- A running Red Hat Ceph Storage cluster.
-
Having
root
access to the monitor node. - At least one running Ceph Monitor node.
Procedure
Identify a surviving monitor and log in to that node:
[root@monitor ~]# ceph mon dump [root@monitor ~]# ssh $MONITOR_HOST_NAME
Stop the
ceph-mon
daemon and extract a copy of themonmap
file. :Syntax
systemctl stop ceph-mon@$MONITOR_ID ceph-mon -i $MONITOR_ID --extract-monmap $TEMPORARY_PATH
Example
[root@monitor ~]# systemctl stop ceph-mon@node1 [root@monitor ~]# ceph-mon -i node1 --extract-monmap /tmp/monmap
Remove the non-surviving monitor(s):
Syntax
monmaptool $TEMPORARY_PATH --rm $MONITOR_ID
Example
[root@monitor ~]# monmaptool /tmp/monmap --rm node2
Inject the surviving monitor map with the removed monitor(s) into the surviving monitor:
Syntax
ceph-mon -i $MONITOR_ID --inject-monmap $TEMPORARY_PATH
Example
[root@monitor ~]# ceph-mon -i node1 --inject-monmap /tmp/monmap
1.3. Ceph OSDs
When a Red Hat Ceph Storage cluster is up and running, you can add OSDs to the storage cluster at runtime.
A Ceph OSD generally consists of one ceph-osd
daemon for one storage drive and its associated journal within a node. If a node has multiple storage drives, then map one ceph-osd
daemon for each drive.
Red Hat recommends checking the capacity of a cluster regularly to see if it is reaching the upper end of its storage capacity. As a storage cluster reaches its near full
ratio, add one or more OSDs to expand the storage cluster’s capacity.
When you want to reduce the size of a Red Hat Ceph Storage cluster or replace the hardware, you can also remove an OSD at runtime. If the node has multiple storage drives, you might also need to remove one of the ceph-osd
daemon for that drive. Generally, it’s a good idea to check the capacity of the storage cluster to see if you are reaching the upper end of its capacity. Ensure that when you remove an OSD that the storage cluster is not at its near full
ratio.
Do not let a storage cluster reach the full
ratio before adding an OSD. OSD failures that occur after the storage cluster reaches the near full
ratio can cause the storage cluster to exceed the full
ratio. Ceph blocks write access to protect the data until you resolve the storage capacity issues. Do not remove OSDs without considering the impact on the full
ratio first.
1.3.1. Ceph OSD node configuration
Ceph OSDs and their supporting hardware should be similarly configured as a storage strategy for the pool(s) that will use the OSDs. Ceph prefers uniform hardware across pools for a consistent performance profile. For best performance, consider a CRUSH hierarchy with drives of the same type or size. See the Storage Strategies guide for more details.
If you add drives of dissimilar size, then you will need to adjust their weights accordingly. When you add the OSD to the CRUSH map, consider the weight for the new OSD. Hard drive capacity grows approximately 40% per year, so newer OSD nodes might have larger hard drives than older nodes in the storage cluster, that is, they might have a greater weight.
Before doing a new installation, review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
1.3.2. Mapping a container OSD ID to a drive
Sometimes it is necessary to identify which drive a containerized OSD is using. For example, if an OSD has an issue you might need to know which drive it uses to verify the drive status. Also, for a non-containerized OSD you reference the OSD ID to start and stop it, but to start and stop a containerized OSD you must reference the drive it uses.
Prerequisites
- A running Red Hat Ceph Storage cluster in a containerized environment.
-
Having
root
access to the container host.
Procedure
Find a container name. For example, to identify the drive associated with
osd.5
, open a terminal on the container node whereosd.5
is running, and then rundocker ps
to list all containers:Example
[root@ceph3 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3a866f927b74 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" About an hour ago Up About an hour ceph-osd-ceph3-sdd 91f3d4829079 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 22 hours ago Up 22 hours ceph-osd-ceph3-sdb 73dfe4021a49 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 7 days ago Up 7 days ceph-osd-ceph3-sdf 90f6d756af39 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 7 days ago Up 7 days ceph-osd-ceph3-sde e66d6e33b306 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 7 days ago Up 7 days ceph-mgr-ceph3 733f37aafd23 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 7 days ago Up 7 days ceph-mon-ceph3
Use
docker exec
to runceph-volume lvm list
on any OSD container name from the previous output:Example
[root@ceph3 ~]# docker exec ceph-osd-ceph3-sdb ceph-volume lvm list ====== osd.5 ======= [journal] /dev/journals/journal1 journal uuid C65n7d-B1gy-cqX3-vZKY-ZoE0-IEYM-HnIJzs osd id 1 cluster fsid ce454d91-d748-4751-a318-ff7f7aa18ffd type journal osd fsid 661b24f8-e062-482b-8110-826ffe7f13fa data uuid SlEgHe-jX1H-QBQk-Sce0-RUls-8KlY-g8HgcZ journal device /dev/journals/journal1 data device /dev/test_group/data-lv2 devices /dev/sda [data] /dev/test_group/data-lv2 journal uuid C65n7d-B1gy-cqX3-vZKY-ZoE0-IEYM-HnIJzs osd id 1 cluster fsid ce454d91-d748-4751-a318-ff7f7aa18ffd type data osd fsid 661b24f8-e062-482b-8110-826ffe7f13fa data uuid SlEgHe-jX1H-QBQk-Sce0-RUls-8KlY-g8HgcZ journal device /dev/journals/journal1 data device /dev/test_group/data-lv2 devices /dev/sdb
From this output you can see
osd.5
is associated with/dev/sdb
.
Additional Resources
- See Replacing a failed OSD disk for more information
1.3.3. Adding a Ceph OSD using Ansible with the same disk topology
For Ceph OSDs with the same disk topology, Ansible will add the same number of OSDs as other OSD nodes using the same device paths specified in the devices:
section of the /usr/share/ceph-ansible/group_vars/osds
file.
The new Ceph OSD node(s) will have the same configuration as the rest of the OSDs.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
-
Having
root
access to the new nodes. - The same number of OSD data drives as other OSD nodes in the storage cluster.
Procedure
Add the Ceph OSD node(s) to the
/etc/ansible/hosts
file, under the[osds]
section:Example
[osds] ... osd06 $NEW_OSD_NODE_NAME
Verify that Ansible can reach the Ceph nodes:
[user@admin ~]$ ansible all -m ping
Navigate to the Ansible configuration directory:
[user@admin ~]$ cd /usr/share/ceph-ansible
Copy the
add-osd.yml
file to the/usr/share/ceph-ansible/
directory:[user@admin ceph-ansible]$ sudo cp infrastructure-playbooks/add-osd.yml .
Run the Ansible playbook for either normal or containerized deployments of Ceph:
[user@admin ceph-ansible]$ ansible-playbook add-osd.yml
NoteWhen adding an OSD, if the playbook fails with
PGs were not reported as active+clean
, configure the following variables in theall.yml
file to adjust the retries and delay:# OSD handler checks handler_health_osd_check_retries: 50 handler_health_osd_check_delay: 30
1.3.4. Adding a Ceph OSD using Ansible with different disk topologies
For Ceph OSDs with different disk topologies, there are two approaches for adding the new OSD node(s) to an existing storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
-
Having
root
access to the new nodes.
Procedure
First Approach
Add the new Ceph OSD node(s) to the
/etc/ansible/hosts
file, under the[osds]
section:Example
[osds] ... osd06 $NEW_OSD_NODE_NAME
Create a new file for each new Ceph OSD node added to the storage cluster, under the
/etc/ansible/host_vars/
directory:Syntax
touch /etc/ansible/host_vars/$NEW_OSD_NODE_NAME
Example
[root@admin ~]# touch /etc/ansible/host_vars/osd07
Edit the new file, and add the
devices:
anddedicated_devices:
sections to the file. Under each of these sections add a-
, space, then the full path to the block device names for this OSD node:Example
devices: - /dev/sdc - /dev/sdd - /dev/sde - /dev/sdf dedicated_devices: - /dev/sda - /dev/sda - /dev/sdb - /dev/sdb
Verify that Ansible can reach all the Ceph nodes:
[user@admin ~]$ ansible all -m ping
Change directory to the Ansible configuration directory:
[user@admin ~]$ cd /usr/share/ceph-ansible
Copy the
add-osd.yml
file to the/usr/share/ceph-ansible/
directory:[user@admin ceph-ansible]$ sudo cp infrastructure-playbooks/add-osd.yml .
Run the Ansible playbook:
[user@admin ceph-ansible]$ ansible-playbook add-osd.yml
Second Approach
Add the new OSD node name to the
/etc/ansible/hosts
file, and use thedevices
anddedicated_devices
options, specifying the different disk topology:Example
[osds] ... osd07 devices="['/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf']" dedicated_devices="['/dev/sda', '/dev/sda', '/dev/sdb', '/dev/sdb']"
Verify that Ansible can reach the all Ceph nodes:
[user@admin ~]$ ansible all -m ping
Change directory to the Ansible configuration directory:
[user@admin ~]$ cd /usr/share/ceph-ansible
Copy the
add-osd.yml
file to the/usr/share/ceph-ansible/
directory:[user@admin ceph-ansible]$ sudo cp infrastructure-playbooks/add-osd.yml .
Run the Ansible playbook:
[user@admin ceph-ansible]$ ansible-playbook add-osd.yml
1.3.5. Adding a Ceph OSD using the command-line interface
Here is the high-level workflow for manually adding an OSD to a Red Hat Ceph Storage:
-
Install the
ceph-osd
package and create a new OSD instance - Prepare and mount the OSD data and journal drives
- Add the new OSD node to the CRUSH map
- Update the owner and group permissions
-
Enable and start the
ceph-osd
daemon
The ceph-disk
command is deprecated. The ceph-volume
command is now the preferred method for deploying OSDs from the command-line interface. Currently, the ceph-volume
command only supports the lvm
plugin. Red Hat will provide examples throughout this guide using both commands as a reference, allowing time for storage administrators to convert any custom scripts that rely on ceph-disk
to ceph-volume
instead.
See the Red Hat Ceph Storage Administration Guide, for more information on using the ceph-volume
command.
For custom storage cluster names, use the --cluster $CLUSTER_NAME
option with the ceph
and ceph-osd
commands.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide for Red Hat Enterprise Linux or Ubuntu.
-
Having
root
access to the new nodes.
Procedure
Enable the Red Hat Ceph Storage 3 OSD software repository.
Red Hat Enterprise Linux
[root@osd ~]# subscription-manager repos --enable=rhel-7-server-rhceph-3-osd-els-rpms
Ubuntu
[user@osd ~]$ sudo bash -c 'umask 0077; echo deb https://customername:customerpasswd@rhcs.download.redhat.com/3-updates/Tools $(lsb_release -sc) main | tee /etc/apt/sources.list.d/Tools.list' [user@osd ~]$ sudo bash -c 'wget -O - https://www.redhat.com/security/fd431d51.txt | apt-key add -'
Create the
/etc/ceph/
directory:# mkdir /etc/ceph
On the new OSD node, copy the Ceph administration keyring and configuration files from one of the Ceph Monitor nodes:
Syntax
scp $USER_NAME@$MONITOR_HOST_NAME:/etc/ceph/$CLUSTER_NAME.client.admin.keyring /etc/ceph scp $USER_NAME@$MONITOR_HOST_NAME:/etc/ceph/$CLUSTER_NAME.conf /etc/ceph
Example
[root@osd ~]# scp root@node1:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ [root@osd ~]# scp root@node1:/etc/ceph/ceph.conf /etc/ceph/
Install the
ceph-osd
package on the new Ceph OSD node:Red Hat Enterprise Linux
[root@osd ~]# yum install ceph-osd
Ubuntu
[user@osd ~]$ sudo apt-get install ceph-osd
Decide if you want to collocate a journal or use a dedicated journal for the new OSDs.
NoteThe
--filestore
option is required.For OSDs with a collocated journal:
Syntax
[root@osd ~]# ceph-disk --setuser ceph --setgroup ceph prepare --filestore /dev/$DEVICE_NAME
Examples
[root@osd ~]# ceph-disk --setuser ceph --setgroup ceph prepare --filestore /dev/sda
For OSDs with a dedicated journal:
Syntax
[root@osd ~]# ceph-disk --setuser ceph --setgroup ceph prepare --filestore /dev/$DEVICE_NAME /dev/$JOURNAL_DEVICE_NAME
or
[root@osd ~]# ceph-volume lvm prepare --filestore --data /dev/$DEVICE_NAME --journal /dev/$JOURNAL_DEVICE_NAME
Examples
[root@osd ~]# ceph-disk --setuser ceph --setgroup ceph prepare --filestore /dev/sda /dev/sdb
[root@osd ~]# ceph-volume lvm prepare --filestore --data /dev/vg00/lvol1 --journal /dev/sdb
Set the
noup
option:[root@osd ~]# ceph osd set noup
Activate the new OSD:
Syntax
[root@osd ~]# ceph-disk activate /dev/$DEVICE_NAME
or
[root@osd ~]# ceph-volume lvm activate --filestore $OSD_ID $OSD_FSID
Example
[root@osd ~]# ceph-disk activate /dev/sda
[root@osd ~]# ceph-volume lvm activate --filestore 0 6cc43680-4f6e-4feb-92ff-9c7ba204120e
Add the OSD to the CRUSH map:
Syntax
ceph osd crush add $OSD_ID $WEIGHT [$BUCKET_TYPE=$BUCKET_NAME ...]
Example
[root@osd ~]# ceph osd crush add 4 1 host=node4
NoteIf you specify more than one bucket, the command places the OSD into the most specific bucket out of those you specified, and it moves the bucket underneath any other buckets you specified.
NoteYou can also edit the CRUSH map manually. See the Editing a CRUSH map section in the Storage Strategies guide for Red Hat Ceph Storage 3.
ImportantIf you specify only the root bucket, then the OSD attaches directly to the root, but the CRUSH rules expect OSDs to be inside of the host bucket.
Unset the
noup
option:[root@osd ~]# ceph osd unset noup
Update the owner and group permissions for the newly created directories:
Syntax
chown -R $OWNER:$GROUP $PATH_TO_DIRECTORY
Example
[root@osd ~]# chown -R ceph:ceph /var/lib/ceph/osd [root@osd ~]# chown -R ceph:ceph /var/log/ceph [root@osd ~]# chown -R ceph:ceph /var/run/ceph [root@osd ~]# chown -R ceph:ceph /etc/ceph
If you use clusters with custom names, then add the following line to the appropriate file:
Red Hat Enterprise Linux
[root@osd ~]# echo "CLUSTER=$CLUSTER_NAME" >> /etc/sysconfig/ceph
Ubuntu
[user@osd ~]$ sudo echo "CLUSTER=$CLUSTER_NAME" >> /etc/default/ceph
Replace
$CLUSTER_NAME
with the custom cluster name.To ensure that the new OSD is
up
and ready to receive data, enable and start the OSD service:Syntax
systemctl enable ceph-osd@$OSD_ID systemctl start ceph-osd@$OSD_ID
Example
[root@osd ~]# systemctl enable ceph-osd@4 [root@osd ~]# systemctl start ceph-osd@4
1.3.6. Removing a Ceph OSD using Ansible
At times, you might need to scale down the capacity of a Red Hat Ceph Storage cluster. To remove an OSD from a Red Hat Ceph Storage cluster using Ansible, and depending on which OSD scenario is used, run the shrink-osd.yml
or shrink-osd-ceph-disk.yml
playbook. If osd_scenario
is set to collocated
or non-collocated
, then use the shrink-osd-ceph-disk.yml
playbook. If osd_scenario
is set to lvm
, then use the shrink-osd.yml
playbook.
Removing an OSD from the storage cluster will destroy all the data contained on that OSD.
Prerequisites
- A running Red Hat Ceph Storage deployed by Ansible.
- A running Ansible administration node.
- Root-level access to the Ansible administration node.
Procedure
Change to the
/usr/share/ceph-ansible/
directory.[user@admin ~]$ cd /usr/share/ceph-ansible
-
Copy the admin keyring from
/etc/ceph/
on the Ceph Monitor node to the node that includes the OSD that you want to remove. Copy the appropriate playbook from the
infrastructure-playbooks
directory to the current directory.[root@admin ceph-ansible]# cp infrastructure-playbooks/shrink-osd.yml .
or
[root@admin ceph-ansible]# cp infrastructure-playbooks/shrink-osd-ceph-disk.yml .
For bare-metal or containers deployments, run the appropriate Ansible playbook:
Syntax
ansible-playbook shrink-osd.yml -e osd_to_kill=$ID -u $ANSIBLE_USER
or
ansible-playbook shrink-osd-ceph-disk.yml -e osd_to_kill=$ID -u $ANSIBLE_USER
Replace:
-
$ID
with the ID of the OSD. To remove more OSDs, separate the OSD IDs with a comma. -
$ANSIBLE_USER
with the name of the Ansible user
Example
[user@admin ceph-ansible]$ ansible-playbook shrink-osd.yml -e osd_to_kill=1 -u user
or
[user@admin ceph-ansible]$ ansible-playbook shrink-osd-ceph-disk.yml -e osd_to_kill=1 -u user
-
Verify that the OSD has been successfully removed:
[root@mon ~]# ceph osd tree
Additional Resources
- See the Red Hat Ceph Storage Installation Guide for Red Hat Enterprise Linux or Ubuntu for details.
1.3.7. Removing a Ceph OSD using the command-line interface
Removing an OSD from a storage cluster involves updating the cluster map, removing its authentication key, removing the OSD from the OSD map, and removing the OSD from the ceph.conf
file. If the node has multiple drives, you might need to remove an OSD for each drive by repeating this procedure.
Prerequisites
- A running Red Hat Ceph Storage cluster.
-
Enough available OSDs so that the storage cluster is not at its
near full
ratio. -
Having
root
access to the OSD node.
Procedure
Disable and stop the OSD service:
Syntax
systemctl disable ceph-osd@$OSD_ID systemctl stop ceph-osd@$OSD_ID
Example
[root@osd ~]# systemctl disable ceph-osd@4 [root@osd ~]# systemctl stop ceph-osd@4
Once the OSD is stopped, it is
down
.Remove the OSD from the storage cluster:
Syntax
ceph osd out $OSD_ID
Example
[root@osd ~]# ceph osd out 4
ImportantOnce the OSD is out, Ceph will start rebalancing and copying data to other OSDs in the storage cluster. Red Hat recommends waiting until the storage cluster becomes
active+clean
before proceeding to the next step. To observe the data migration, run the following command:[root@monitor ~]# ceph -w
Remove the OSD from the CRUSH map so that it no longer receives data.
Syntax
ceph osd crush remove $OSD_NAME
Example
[root@osd ~]# ceph osd crush remove osd.4
NoteYou can also decompile the CRUSH map, remove the OSD from the device list, remove the device as an item in the host bucket or remove the host bucket. If it is in the CRUSH map and you intend to remove the host, recompile the map and set it. See the Storage Strategies Guide for details.
Remove the OSD authentication key:
Syntax
ceph auth del osd.$OSD_ID
Example
[root@osd ~]# ceph auth del osd.4
Remove the OSD:
Syntax
ceph osd rm $OSD_ID
Example
[root@osd ~]# ceph osd rm 4
Edit the storage cluster’s configuration file, by default
/etc/ceph/ceph.conf
, and remove the OSD entry, if it exists:Example
[osd.4] host = $HOST_NAME
-
Remove the reference to the OSD in the
/etc/fstab
file, if the OSD was added manually. Copy the updated configuration file to the
/etc/ceph/
directory of all other nodes in the storage cluster.Syntax
scp /etc/ceph/$CLUSTER_NAME.conf $USER_NAME@$HOST_NAME:/etc/ceph/
Example
[root@osd ~]# scp /etc/ceph/ceph.conf root@node4:/etc/ceph/
1.3.8. Replacing a journal using the command-line interface
The procedure to replace a journal when the journal and data devices are on the same physical device, for example when using osd_scenario: collocated
, requires replacing the whole OSD. However, on an OSD where the journal is on a separate physical device from the data device, for example when using osd_scenario: non-collocated
, you can replace just the journal device.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A new partition or storage device.
Procedure
Set the cluster to
noout
to prevent backfilling:[root@osd1 ~]# ceph osd set noout
Stop the OSD where the journal will be changed:
[root@osd1 ~]# systemctl stop ceph-osd@$OSD_ID
Flush the journal on the OSD:
[root@osd1 ~]# ceph-osd -i $OSD_ID --flush-journal
Remove the old journal partition to prevent partition UUID collision with the new partition:
sgdisk --delete=$OLD_PART_NUM -- $OLD_DEV_PATH
- Replace
-
$OLD_PART_NUM
with the partition number of the old journal device. -
$OLD_DEV_PATH
with the path to the old journal device.
-
Example
[root@osd1 ~]# sgdisk --delete=1 -- /dev/sda
Create the new journal partition on the new device. This
sgdisk
command will use the next available partition number automatically:sgdisk --new=0:0:$JOURNAL_SIZE -- $NEW_DEV_PATH
- Replace
-
$JOURNAL_SIZE
with the journal size appropriate for the environment, for example10240M
. -
NEW_DEV_PATH
with the path to the device to be used for the new journal.
-
NoteThe minimum default size for a journal is 5 GB. Values over 10 GB are typically not needed. For additional details, contact Red Hat Support.
Example
[root@osd1 ~]# sgdisk --new=0:0:10240M -- /dev/sda
Set the proper parameters on the new partition:
sgdisk --change-name=0:"ceph journal" --partition-guid=0:$OLD_PART_UUID --typecode=0:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- $NEW_DEV_PATH
- Replace
-
$OLD_PART_UUID
with UUID in thejournal_uuid
file for the relevant OSD. For example, for OSD0
use the UUID in/var/lib/ceph/osd/ceph-0/journal_uuid
. -
NEW_DEV_PATH
with the path to the device to be used for the new journal.
-
Example
[root@osd1 ~]# sgdisk --change-name=0:"ceph journal" --partition-guid=0:a1279726-a32d-4101-880d-e8573bb11c16 --typecode=0:097c058d-0758-4199-a787-ce9bacb13f48 --mbrtogpt -- /dev/sda
After running the above
sgdisk
commands the new journal partition is prepared for Ceph and the journal can be created on it.ImportantThis command cannot be combined with the partition creation command due to limitations in
sgdisk
causing the partition to not be created correctly.Create the new journal:
[root@osd1 ~]# ceph-osd -i $OSD_ID --mkjournal
Start the OSD:
[root@osd1 ~]# systemctl start ceph-osd@$OSD_ID
Remove the
noout
flag on the OSDs:[root@osd1 ~]# ceph osd unset noout
Confirm the journal is associated with the correct device:
[root@osd1 ~]# ceph-disk list
1.3.9. Observing the data migration
When you add or remove an OSD to the CRUSH map, Ceph begins rebalancing the data by migrating placement groups to the new or existing OSD(s).
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Recently added or removed an OSD.
Procedure
To observe the data migration:
[root@monitor ~]# ceph -w
-
Watch as the placement group states change from
active+clean
toactive, some degraded objects
, and finallyactive+clean
when migration completes. -
To exit the utility, press
Ctrl + C
.
1.4. Recalculating the placement groups
Placement groups (PGs) define the spread of any pool data across the available OSDs. A placement group is build upon the given redundancy algorithm to be used. For a 3-way replication, the redundancy is defined to use three different OSDs. For erasure-coded pools, the number of OSDs to use is defined by the number of chunks.
When defining a pool the number of placement groups defines the grade of granularity the data is spread with across all available OSDs. The higher the number the better the equalization of capacity load can be. However, since handling the placement groups is also important in case of reconstruction of data, the number is significant to be carefully chosen upfront. To support calculation a tool is available to produce agile environments.
During lifetime of a storage cluster a pool may grow above the initially anticipated limits. With the growing number of drives a recalculation is recommended. The number of placement groups per OSD should be around 100. When adding more OSDs to the storage cluster the number of PGs per OSD will lower over time. Starting with 120 drives initially in the storage cluster and setting the pg_num
of the pool to 4000 will end up in 100 PGs per OSD, given with the replication factor of three. Over time, when growing to ten times the number of OSDs, the number of PGs per OSD will go down to ten only. Because small number of PGs per OSD will tend to an unevenly distributed capacity, consider adjusting the PGs per pool.
Adjusting the number of placement groups can be done online. Recalculating is not only a recalculation of the PG numbers, but will involve data relocation, which will be a lengthy process. However, the data availability will be maintained at any time.
Very high numbers of PGs per OSD should be avoided, because reconstruction of all PGs on a failed OSD will start at once. A high number of IOPS is required to perform reconstruction in a timely manner, which might not be available. This would lead to deep I/O queues and high latency rendering the storage cluster unusable or will result in long healing times.
Additional Resources
- See the PG calculator for calculating the values by a given use case.
- See the Erasure Code Pools chapter in the Red Hat Ceph Storage Strategies Guide for more information.
1.5. Using the Ceph Manager balancer module
The balancer is a module for Ceph Manager that optimizes the placement of placement groups, or PGs, across OSDs in order to achieve a balanced distribution, either automatically or in a supervised fashion.
Prerequisites
- A running Red Hat Ceph Storage cluster
Start the balancer
Ensure the balancer module is enabled:
[root@mon ~]# ceph mgr module enable balancer
Turn on the balancer module:
[root@mon ~]# ceph balancer on
Modes
There are currently two supported balancer modes:
crush-compat: The CRUSH compat mode uses the compat
weight-set
feature, introduced in Ceph Luminous, to manage an alternative set of weights for devices in the CRUSH hierarchy. The normal weights should remain set to the size of the device to reflect the target amount of data that you want to store on the device. The balancer then optimizes theweight-set
values, adjusting them up or down in small increments in order to achieve a distribution that matches the target distribution as closely as possible. Because PG placement is a pseudorandom process, there is a natural amount of variation in the placement; by optimizing the weights, the balancer counter-acts that natural variation.This mode is fully backwards compatible with older clients. When an OSDMap and CRUSH map are shared with older clients, the balancer presents the optimized weights as the real weights.
The primary restriction of this mode is that the balancer cannot handle multiple CRUSH hierarchies with different placement rules if the subtrees of the hierarchy share any OSDs. Because this configuration makes managing space utilization on the shared OSDs difficult, it is generally not recommended. As such, this restriction is normally not an issue.
upmap: Starting with Luminous, the OSDMap can store explicit mappings for individual OSDs as exceptions to the normal CRUSH placement calculation. These
upmap
entries provide fine-grained control over the PG mapping. This CRUSH mode will optimize the placement of individual PGs in order to achieve a balanced distribution. In most cases, this distribution is "perfect," with an equal number of PGs on each OSD +/-1 PG, as they might not divide evenly.ImportantUsing upmap requires that all clients be running Red Hat Ceph Storage 3.x or later, and Red Hat Enterprise Linux 7.5 or later.
To allow use of this feature, you must tell the cluster that it only needs to support luminous or later clients with:
[root@admin ~]# ceph osd set-require-min-compat-client luminous
This command fails if any pre-luminous clients or daemons are connected to the monitors.
Due to a known issue, kernel CephFS clients report themselves as jewel clients. To work around this issue, use the
--yes-i-really-mean-it
flag:[root@admin ~]# ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it
You can check what client versions are in use with:
[root@admin ~]# ceph features
WarningIn Red Hat Ceph Storage 3.x, the upmap feature is only supported for use by the Ceph Manager balancer module for balancing of PGs as the cluster is used. Manual rebalancing of PGs using the upmap feature is not supported in Red Hat Ceph Storage 3.x.
The default mode is crush-compat
. The mode can be changed with:
[root@mon ~]# ceph balancer mode upmap
or:
[root@mon ~]# ceph balancer mode crush-compat
Status
The current status of the balancer can be checked at any time with:
[root@mon ~]# ceph balancer status
Automatic balancing
By default, when turning on the balancer module, automatic balancing is used:
[root@mon ~]# ceph balancer on
The balancer can be turned back off again with:
[root@mon ~]# ceph balancer off
This will use the crush-compat
mode, which is backward compatible with older clients and will make small changes to the data distribution over time to ensure that OSDs are equally utilized.
Throttling
No adjustments will be made to the PG distribution if the cluster is degraded, for example, if an OSD has failed and the system has not yet healed itself.
When the cluster is healthy, the balancer throttles its changes such that the percentage of PGs that are misplaced, or need to be moved, is below a threshold of 5% by default. This percentage can be adjusted using the max_misplaced
setting. For example, to increase the threshold to 7%:
[root@mon ~]# ceph config-key set mgr/balancer/max_misplaced .07
Supervised optimization
The balancer operation is broken into a few distinct phases:
-
Building a
plan
-
Evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a
plan
Executing the
plan
To evaluate and score the current distribution:
[root@mon ~]# ceph balancer eval
To evaluate the distribution for a single pool:
[root@mon ~]# ceph balancer eval <pool-name>
To see greater detail for the evaluation:
[root@mon ~]# ceph balancer eval-verbose ...
To generate a plan using the currently configured mode:
[root@mon ~]# ceph balancer optimize <plan-name>
Replace
<plan-name>
with a custom plan name.To see the contents of a plan:
[root@mon ~]# ceph balancer show <plan-name>
To discard old plans:
[root@mon ~]# ceph balancer rm <plan-name>
To see currently recorded plans use the status command:
[root@mon ~]# ceph balancer status
To calculate the quality of the distribution that would result after executing a plan:
[root@mon ~]# ceph balancer eval <plan-name>
To execute the plan:
[root@mon ~]# ceph balancer execute <plan-name>
[NOTE]: Only execute the plan if is expected to improve the distribution. After execution, the plan will be discarded.
1.6. Additional Resources
- See the Placement Groups (PGs) chapter in the Red Hat Ceph Storage Strategies Guide for more information.