Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 2. The Ceph File System Metadata Server
As a storage administrator, you can learn about the different states of the Ceph File System (CephFS) Metadata Server (MDS), along with learning about CephFS MDS ranking mechanics, configuring the MDS standby daemon, and cache size limits. Knowing these concepts can enable you to configure the MDS daemons for a storage environment.
2.1. Prerequisites
- A running, and healthy Red Hat Ceph Storage cluster.
-
Installation of the Ceph Metadata Server daemons (
ceph-mds
). See the Management of MDS service using the Ceph Orchestrator section in the Red Hat Ceph Storage File System Guide for details on configuring MDS daemons.
2.2. Metadata Server daemon states
The Metadata Server (MDS) daemons operate in two states:
- Active — manages metadata for files and directories stores on the Ceph File System.
- Standby — serves as a backup, and becomes active when an active MDS daemon becomes unresponsive.
By default, a Ceph File System uses only one active MDS daemon. However, systems with many clients benefit from multiple active MDS daemons.
You can configure the file system to use multiple active MDS daemons so that you can scale metadata performance for larger workloads. The active MDS daemons dynamically share the metadata workload when metadata load patterns change. Note that systems with multiple active MDS daemons still require standby MDS daemons to remain highly available.
What Happens When the Active MDS Daemon Fails
When the active MDS becomes unresponsive, a Ceph Monitor daemon waits a number of seconds equal to the value specified in the mds_beacon_grace
option. If the active MDS is still unresponsive after the specified time period has passed, the Ceph Monitor marks the MDS daemon as laggy
. One of the standby daemons becomes active, depending on the configuration.
To change the value of mds_beacon_grace
, add this option to the Ceph configuration file and specify the new value.
2.3. Metadata Server ranks
Each Ceph File System (CephFS) has a number of ranks, one by default, which starts at zero.
Ranks define how the metadata workload is shared between multiple Metadata Server (MDS) daemons. The number of ranks is the maximum number of MDS daemons that can be active at one time. Each MDS daemon handles a subset of the CephFS metadata that is assigned to that rank.
Each MDS daemon initially starts without a rank. The Ceph Monitor assigns a rank to the daemon. The MDS daemon can only hold one rank at a time. Daemons only lose ranks when they are stopped.
The max_mds
setting controls how many ranks will be created.
The actual number of ranks in the CephFS is only increased if a spare daemon is available to accept the new rank.
Rank States
Ranks can be:
- Up - A rank that is assigned to the MDS daemon.
- Failed - A rank that is not associated with any MDS daemon.
-
Damaged - A rank that is damaged; its metadata is corrupted or missing. Damaged ranks are not assigned to any MDS daemons until the operator fixes the problem, and uses the
ceph mds repaired
command on the damaged rank.
2.4. Metadata Server cache size limits
You can limit the size of the Ceph File System (CephFS) Metadata Server (MDS) cache by:
A memory limit: Use the
mds_cache_memory_limit
option. Red Hat recommends a value between 8 GB and 64 GB formds_cache_memory_limit
. Setting more cache can cause issues with recovery. This limit is approximately 66% of the desired maximum memory use of the MDS.ImportantRed Hat recommends using memory limits instead of inode count limits.
-
Inode count: Use the
mds_cache_size
option. By default, limiting the MDS cache by inode count is disabled.
In addition, you can specify a cache reservation by using the mds_cache_reservation
option for MDS operations. The cache reservation is limited as a percentage of the memory or inode limit and is set to 5% by default. The intent of this parameter is to have the MDS maintain an extra reserve of memory for its cache for new metadata operations to use. As a consequence, the MDS should in general operate below its memory limit because it will recall old state from clients to drop unused metadata in its cache.
The mds_cache_reservation
option replaces the mds_health_cache_threshold
option in all situations, except when MDS nodes send a health alert to the Ceph Monitors indicating the cache is too large. By default, mds_health_cache_threshold
is 150% of the maximum cache size.
Be aware that the cache limit is not a hard limit. Potential bugs in the CephFS client or MDS or misbehaving applications might cause the MDS to exceed its cache size. The mds_health_cache_threshold
option configures the storage cluster health warning message, so that operators can investigate why the MDS cannot shrink its cache.
Additional Resources
- See the Metadata Server daemon configuration reference section in the Red Hat Ceph Storage File System Guide for more information.
2.5. File system affinity
You can configure a Ceph File System (CephFS) to prefer a particular Ceph Metadata Server (MDS) over another Ceph MDS. For example, you have MDS running on newer, faster hardware that you want to give preference to over a standby MDS running on older, maybe slower hardware. You can specify this preference by setting the mds_join_fs
option, which enforces this file system affinity. Ceph Monitors give preference to MDS standby daemons with mds_join_fs
equal to the file system name with the failed rank. The standby-replay daemons are selected before choosing another standby daemon. If no standby daemon exists with the mds_join_fs
option, then the Ceph Monitors will choose an ordinary standby for replacement or any other available standby as a last resort. The Ceph Monitors will periodically examine Ceph File Systems to see if a standby with a stronger affinity is available to replace the Ceph MDS that has a lower affinity.
Additional Resources
- See the Configuring file system affinity section in the Red Hat Ceph Storage File System Guide for details.
2.6. Management of MDS service using the Ceph Orchestrator
As a storage administrator, you can use the Ceph Orchestrator with Cephadm in the backend to deploy the MDS service. By default, a Ceph File System (CephFS) uses only one active MDS daemon. However, systems with many clients benefit from multiple active MDS daemons.
This section covers the following administrative tasks:
2.6.1. Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to all the nodes.
- Hosts are added to the cluster.
- All manager, monitor, and OSD daemons are deployed.
2.6.2. Deploying the MDS service using the command line interface
Using the Ceph Orchestrator, you can deploy the Metadata Server (MDS) service using the placement
specification in the command line interface. Ceph File System (CephFS) requires one or more MDS.
Ensure you have at least two pools, one for Ceph file system (CephFS) data and one for CephFS metadata.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager, monitor, and OSD daemons are deployed.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
- There are two ways of deploying MDS daemons using placement specification:
Method 1
Use
ceph fs volume
to create the MDS daemons. This creates the CephFS volume and pools associated with the CephFS, and also starts the MDS service on the hosts.Syntax
ceph fs volume create FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"
NoteBy default, replicated pools are created for this command.
Example
[ceph: root@host01 /]# ceph fs volume create test --placement="2 host01 host02"
Method 2
Create the pools, CephFS, and then deploy MDS service using placement specification:
Create the pools for CephFS:
Syntax
ceph osd pool create DATA_POOL [PG_NUM] ceph osd pool create METADATA_POOL [PG_NUM]
Example
[ceph: root@host01 /]# ceph osd pool create cephfs_data 64 [ceph: root@host01 /]# ceph osd pool create cephfs_metadata 64
Typically, the metadata pool can start with a conservative number of Placement Groups (PGs) as it generally has far fewer objects than the data pool. It is possible to increase the number of PGs if needed. The pool sizes range from 64 PGs to 512 PGs. Size the data pool is proportional to the number and sizes of files you expect in the file system.
ImportantFor the metadata pool, consider to use:
- A higher replication level because any data loss to this pool can make the whole file system inaccessible.
- Storage with lower latency such as Solid-State Drive (SSD) disks because this directly affects the observed latency of file system operations on clients.
Create the file system for the data pools and metadata pools:
Syntax
ceph fs new FILESYSTEM_NAME METADATA_POOL DATA_POOL
Example
[ceph: root@host01 /]# ceph fs new test cephfs_metadata cephfs_data
Deploy MDS service using the
ceph orch apply
command:Syntax
ceph orch apply mds FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"
Example
[ceph: root@host01 /]# ceph orch apply mds test --placement="2 host01 host02"
Verification
List the service:
Example
[ceph: root@host01 /]# ceph orch ls
Check the CephFS status:
Example
[ceph: root@host01 /]# ceph fs ls [ceph: root@host01 /]# ceph fs status
List the hosts, daemons, and processes:
Syntax
ceph orch ps --daemon_type=DAEMON_NAME
Example
[ceph: root@host01 /]# ceph orch ps --daemon_type=mds
Additional Resources
- See the Red Hat Ceph Storage File System Guide for more information about creating the Ceph File System (CephFS).
- For information on pools, see Pools.
2.6.3. Deploying the MDS service using the service specification
Using the Ceph Orchestrator, you can deploy the MDS service using the service specification.
Ensure you have at least two pools, one for the Ceph File System (CephFS) data and one for the CephFS metadata.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager, monitor, and OSD daemons are deployed.
Procedure
Create the
mds.yaml
file:Example
[root@host01 ~]# touch mds.yaml
Edit the
mds.yaml
file to include the following details:Syntax
service_type: mds service_id: FILESYSTEM_NAME placement: hosts: - HOST_NAME_1 - HOST_NAME_2 - HOST_NAME_3
Example
service_type: mds service_id: fs_name placement: hosts: - host01 - host02
Mount the YAML file under a directory in the container:
Example
[root@host01 ~]# cephadm shell --mount mds.yaml:/var/lib/ceph/mds/mds.yaml
Navigate to the directory:
Example
[ceph: root@host01 /]# cd /var/lib/ceph/mds/
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Navigate to the following directory:
Example
[ceph: root@host01 /]# cd /var/lib/ceph/mds/
Deploy MDS service using service specification:
Syntax
ceph orch apply -i FILE_NAME.yaml
Example
[ceph: root@host01 mds]# ceph orch apply -i mds.yaml
Once the MDS services is deployed and functional, create the CephFS:
Syntax
ceph fs new CEPHFS_NAME METADATA_POOL DATA_POOL
Example
[ceph: root@host01 /]# ceph fs new test metadata_pool data_pool
Verification
List the service:
Example
[ceph: root@host01 /]# ceph orch ls
List the hosts, daemons, and processes:
Syntax
ceph orch ps --daemon_type=DAEMON_NAME
Example
[ceph: root@host01 /]# ceph orch ps --daemon_type=mds
Additional Resources
- See the Red Hat Ceph Storage File System Guide for more information about creating the Ceph File System (CephFS).
2.6.4. Removing the MDS service using the Ceph Orchestrator
You can remove the service using the ceph orch rm
command. Alternatively, you can remove the file system and the associated pools.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to all the nodes.
- Hosts are added to the cluster.
- At least one MDS daemon deployed on the hosts.
Procedure
- There are two ways of removing MDS daemons from the cluster:
Method 1
Remove the CephFS volume, associated pools, and the services:
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Set the configuration parameter
mon_allow_pool_delete
totrue
:Example
[ceph: root@host01 /]# ceph config set mon mon_allow_pool_delete true
Remove the file system:
Syntax
ceph fs volume rm FILESYSTEM_NAME --yes-i-really-mean-it
Example
[ceph: root@host01 /]# ceph fs volume rm cephfs-new --yes-i-really-mean-it
This command will remove the file system, its data, and metadata pools. It also tries to remove the MDS using the enabled
ceph-mgr
Orchestrator module.
Method 2
Use the
ceph orch rm
command to remove the MDS service from the entire cluster:List the service:
Example
[ceph: root@host01 /]# ceph orch ls
Remove the service
Syntax
ceph orch rm SERVICE_NAME
Example
[ceph: root@host01 /]# ceph orch rm mds.test
Verification
List the hosts, daemons, and processes:
Syntax
ceph orch ps
Example
[ceph: root@host01 /]# ceph orch ps
Additional Resources
- See Deploying the MDS service using the command line interface section in the Red Hat Ceph Storage Operations Guide for more information.
- See Deploying the MDS service using the service specification section in the Red Hat Ceph Storage Operations Guide for more information.
2.7. Configuring file system affinity
Set the Ceph File System (CephFS) affinity for a particular Ceph Metadata Server (MDS).
Prerequisites
- A healthy, and running Ceph File System.
- Root-level access to a Ceph Monitor node.
Procedure
Check the current state of a Ceph File System:
Example
[root@mon ~]# ceph fs dump dumped fsmap epoch 399 ... Filesystem 'cephfs01' (27) ... e399 max_mds 1 in 0 up {0=20384} failed damaged stopped ... [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]] Standby daemons: [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
Set the file system affinity:
Syntax
ceph config set STANDBY_DAEMON mds_join_fs FILE_SYSTEM_NAME
Example
[root@mon ~]# ceph config set mds.b mds_join_fs cephfs01
After a Ceph MDS failover event, the file system favors the standby daemon for which the affinity is set.
Example
[root@mon ~]# ceph fs dump dumped fsmap epoch 405 e405 ... Filesystem 'cephfs01' (27) ... max_mds 1 in 0 up {0=10420} failed damaged stopped ... [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]] 1 Standby daemons: [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
- 1
- The
mds.b
daemon now has thejoin_fscid=27
in the file system dump output.
ImportantIf a file system is in a degraded or undersized state, then no failover will occur to enforce the file system affinity.
Additional Resources
- See the File system affinity section in the Red Hat Ceph Storage File System Guide for more details.
2.8. Configuring multiple active Metadata Server daemons
Configure multiple active Metadata Server (MDS) daemons to scale metadata performance for large systems.
Do not convert all standby MDS daemons to active ones. A Ceph File System (CephFS) requires at least one standby MDS daemon to remain highly available.
Prerequisites
- Ceph administration capabilities on the MDS node.
- Root-level access to a Ceph Monitor node.
Procedure
Set the
max_mds
parameter to the desired number of active MDS daemons:Syntax
ceph fs set NAME max_mds NUMBER
Example
[root@mon ~]# ceph fs set cephfs max_mds 2
This example increases the number of active MDS daemons to two in the CephFS called
cephfs
NoteCeph only increases the actual number of ranks in the CephFS if a spare MDS daemon is available to take the new rank.
Verify the number of active MDS daemons:
Syntax
ceph fs status NAME
Example
[root@mon ~]# ceph fs status cephfs cephfs - 0 clients ====== +------+--------+-------+---------------+-------+-------+--------+--------+ | RANK | STATE | MDS | ACTIVITY | DNS | INOS | DIRS | CAPS | +------+--------+-------+---------------+-------+-------+--------+--------+ | 0 | active | node1 | Reqs: 0 /s | 10 | 12 | 12 | 0 | | 1 | active | node2 | Reqs: 0 /s | 10 | 12 | 12 | 0 | +------+--------+-------+---------------+-------+-------+--------+--------+ +-----------------+----------+-------+-------+ | POOL | TYPE | USED | AVAIL | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 4638 | 26.7G | | cephfs_data | data | 0 | 26.7G | +-----------------+----------+-------+-------+ +-------------+ | STANDBY MDS | +-------------+ | node3 | +-------------+
Additional Resources
- See the Metadata Server daemons states section in the Red Hat Ceph Storage File System Guide for more details.
- See the Decreasing the number of active MDS Daemons section in the Red Hat Ceph Storage File System Guide for more details.
- See the Managing Ceph users section in the Red Hat Ceph Storage Administration Guide for more details.
2.9. Configuring the number of standby daemons
Each Ceph File System (CephFS) can specify the required number of standby daemons to be considered healthy. This number also includes the standby-replay daemon waiting for a rank failure.
Prerequisites
- Root-level access to a Ceph Monitor node.
Procedure
Set the expected number of standby daemons for a particular CephFS:
Syntax
ceph fs set FS_NAME standby_count_wanted NUMBER
NoteSetting the NUMBER to zero disables the daemon health check.
Example
[root@mon ~]# ceph fs set cephfs standby_count_wanted 2
This example sets the expected standby daemon count to two.
2.10. Configuring the standby-replay Metadata Server
Configure each Ceph File System (CephFS) by adding a standby-replay Metadata Server (MDS) daemon. Doing this reduces failover time if the active MDS becomes unavailable.
This specific standby-replay daemon follows the active MDS’s metadata journal. The standby-replay daemon is only used by the active MDS of the same rank, and is not available to other ranks.
If using standby-replay, then every active MDS must have a standby-replay daemon.
Prerequisites
- Root-level access to a Ceph Monitor node.
Procedure
Set the standby-replay for a particular CephFS:
Syntax
ceph fs set FS_NAME allow_standby_replay 1
Example
[root@mon ~]# ceph fs set cephfs allow_standby_replay 1
In this example, the Boolean value is
1
, which enables the standby-replay daemons to be assigned to the active Ceph MDS daemons.
Additional Resources
- See the Using the ceph mds fail command section in the Red Hat Ceph Storage File System Guide for details.
2.11. Ephemeral pinning policies
An ephemeral pin is a static partition of subtrees, and can be set with a policy using extended attributes. A policy can automatically set ephemeral pins to directories. When setting an ephemeral pin to a directory, it is automatically assigned to a particular rank, as to be uniformly distributed across all Ceph MDS ranks. Determining which rank gets assigned is done by a consistent hash and the directory’s inode number. Ephemeral pins do not persist when the directory’s inode is dropped from file system cache. When failing over a Ceph Metadata Server (MDS), the ephemeral pin is recorded in its journal so the Ceph MDS standby server does not lose this information. There are two types of policies for using ephemeral pins:
Note: Installation of the attr
package is a prerequisite for the ephemeral pinning policies.
- Distributed
This policy enforces that all of a directory’s immediate children must be ephemerally pinned. For example, use a distributed policy to spread a user’s home directory across the entire Ceph File System cluster. Enable this policy by setting the
ceph.dir.pin.distributed
extended attribute.setfattr -n ceph.dir.pin.distributed -v 1 DIRECTORY_PATH
- Random
This policy enforces a chance that any descendent subdirectory might be ephemerally pinned. You can customize the percent of directories that can be ephemerally pinned. Enable this policy by setting the
ceph.dir.pin.random
and setting a percentage. Red Hat recommends setting this percentage to a value smaller than 1% (0.01
). Having too many subtree partitions can cause slow performance. You can set the maximum percentage by setting themds_export_ephemeral_random_max
Ceph MDS configuration option. The parametersmds_export_ephemeral_distributed
andmds_export_ephemeral_random
are already enabled.setfattr -n ceph.dir.pin.random -v PERCENTAGE DIRECTORY_PATH
Additional Resources
- See the Manually pinning directory trees to a particular rank section in the Red Hat Ceph Storage File System Guide for details on manually setting pins.
2.12. Manually pinning directory trees to a particular rank
Sometimes it might be desirable to override the dynamic balancer with explicit mappings of metadata to a particular Ceph Metadata Server (MDS) rank. You can do this manually to evenly spread the load of an application or to limit the impact of users' metadata requests on the Ceph File System cluster. Manually pinning directories is also known as an export pin by setting the ceph.dir.pin
extended attribute.
A directory’s export pin is inherited from its closest parent directory, but can be overwritten by setting an export pin on that directory. Setting an export pin on a directory affects all of its sub-directories, for example:
[root@client ~]# mkdir -p a/b 1 [root@client ~]# setfattr -n ceph.dir.pin -v 1 a/ 2 [root@client ~]# setfattr -n ceph.dir.pin -v 0 a/b 3
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph File System.
- Root-level access to the CephFS client.
-
Installation of the
attr
package.
Procedure
Set the export pin on a directory:
Syntax
setfattr -n ceph.dir.pin -v RANK PATH_TO_DIRECTORY
Example
[root@client ~]# setfattr -n ceph.dir.pin -v 2 cephfs/home
Additional Resources
- See the Ephemeral pinning policies section in the Red Hat Ceph Storage File System Guide for details on automatically setting pins.
2.13. Decreasing the number of active Metadata Server daemons
How to decrease the number of active Ceph File System (CephFS) Metadata Server (MDS) daemons.
Prerequisites
-
The rank that you will remove must be active first, meaning that you must have the same number of MDS daemons as specified by the
max_mds
parameter. - Root-level access to a Ceph Monitor node.
Procedure
Set the same number of MDS daemons as specified by the
max_mds
parameter:Syntax
ceph fs status NAME
Example
[root@mon ~]# ceph fs status cephfs cephfs - 0 clients +------+--------+-------+---------------+-------+-------+--------+--------+ | RANK | STATE | MDS | ACTIVITY | DNS | INOS | DIRS | CAPS | +------+--------+-------+---------------+-------+-------+--------+--------+ | 0 | active | node1 | Reqs: 0 /s | 10 | 12 | 12 | 0 | | 1 | active | node2 | Reqs: 0 /s | 10 | 12 | 12 | 0 | +------+--------+-------+---------------+-------+-------+--------+--------+ +-----------------+----------+-------+-------+ | POOL | TYPE | USED | AVAIL | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 4638 | 26.7G | | cephfs_data | data | 0 | 26.7G | +-----------------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ | node3 | +-------------+
On a node with administration capabilities, change the
max_mds
parameter to the desired number of active MDS daemons:Syntax
ceph fs set NAME max_mds NUMBER
Example
[root@mon ~]# ceph fs set cephfs max_mds 1
-
Wait for the storage cluster to stabilize to the new
max_mds
value by watching the Ceph File System status. Verify the number of active MDS daemons:
Syntax
ceph fs status NAME
Example
[root@mon ~]# ceph fs status cephfs cephfs - 0 clients +------+--------+-------+---------------+-------+-------+--------+--------+ | RANK | STATE | MDS | ACTIVITY | DNS | INOS | DIRS | CAPS | +------+--------+-------+---------------+-------+-------+--------+--------+ | 0 | active | node1 | Reqs: 0 /s | 10 | 12 | 12 | 0 | +------+--------+-------+---------------+-------+-------+--------|--------+ +-----------------+----------+-------+-------+ | POOl | TYPE | USED | AVAIL | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 4638 | 26.7G | | cephfs_data | data | 0 | 26.7G | +-----------------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ | node3 | | node2 | +-------------+
Additional Resources
- See the Metadata Server daemons states section in the Red Hat Ceph Storage File System Guide.
- See the Configuring multiple active Metadata Server daemons section in the Red Hat Ceph Storage File System Guide.
2.14. Additional Resources
- See the Red Hat Ceph Storage Installation Guide for details on installing a Red Hat Ceph Storage cluster.