Chapter 3. New features
This section lists all major updates, enhancements, and new features introduced in this release of Red Hat Ceph Storage.
3.1. The Cephadm utility
Users can now configure various NFS options in idmap.conf
With this enhancement, the ability to configure NFS options, such as "Domain", "Nobody-User", "Nobody-Group" and the like, in idmap.conf
is introduced.
Client IP restriction is now possible over the new haproxy protocol mode for NFS
Previously, client IP restriction did not work in setups using haproxy over NFS.
With this enhancement, Cephadm deployed NFS supports the haproxy protocol. If users add enable_haproxy_protocol: True
to both their ingress and haproxy specification or pass --ingress-mode haproxy-protocol
to the ceph nfs cluster create
command, the NFS daemon will make use of the haproxy protocol.
Users must now enter a username and password to access the Grafana API URL
Previously, anyone who could connect to the Grafana API URL would have access to it without needing any credentials.
With this enhancement, Cephadm deployed Grafana is set up with a username and password for users to access the Grafana API URL.
Ingress service with NFS backend can now be set up to use only keepalived
to create a virtual IP (VIP) for the NFS daemon to bind to, without the HAProxy layer involved
With this enhancement, ingress service with an NFS backend can be set up to only use keepalived
to create a virtual IP for the NFS daemon to bind to, without the HAProxy layer involved. This is useful in cases where the NFS daemon is moved around and clients need not use a different IP to connect to it.
Cephadm deploys keepalived
to set up a VIP and then have the NFS daemon bind to that VIP. This can also be setup using the NFS module via the ceph nfs cluster create
command, using the flags --ingress --ingress-mode keepalive-only --virtual-ip <VIP>
.
The specification file looks as follows:
service_type: ingress service_id: nfs.nfsganesha service_name: ingress.nfs.nfsganesha placement: count: 1 label: foo spec: backend_service: nfs.nfsganesha frontend_port: 12049 monitor_port: 9049 virtual_ip: 10.8.128.234/24 virtual_interface_networks: 10.8.128.0/24 keepalive_only: true
that includes the keepalive_ony: true
setting.
An NFS specification looks as below:
networks: - 10.8.128.0/21 service_type: nfs service_id: nfsganesha placement: count: 1 label: foo spec: virtual_ip: 10.8.128.234 port: 2049
that includes the virtual_ip
field that should match the VIP in the ingress specification.
The HAProxy daemon binds to its front-end port only on the VIP created by the accompanying keepalived
With this enhancement, the HAProxy daemon will bind to its front-end port only on the VIP created by the accompanying keepalived, rather than on 0.0.0.0. Cephadm deployed HAProxy will bind its front-end port to the VIP, allowing other services, such as an NFS daemon, to potentially bind to port 2049 on other IPs on the same node.
Haproxy health check interval for ingress service is now customizable
Previously, in some cases, the two second default health check interval was too frequent and it caused unnecessary traffic.
With this enhancement, HAProxy health check interval for ingress service is customizable. By applying an ingress specification that includes the health_check_interval
field, the HAProxy configuration generated by Cephadm for each HAProxy daemon for the service will include that value for the health check interval.
Ingress specification file:
service_type: ingress service_id: rgw.my-rgw placement: hosts: ['ceph-mobisht-7-1-07lum9-node2', 'ceph-mobisht-7-1-07lum9-node3'] spec: backend_service: rgw.my-rgw virtual_ip: 10.0.208.0/22 frontend_port: 8000 monitor_port: 1967 health_check_interval: 3m
Valid units for the interval are: us
: microseconds ms
: milliseconds s
: seconds m
: minutes h
: hours d
: days
Grafana now binds to an IP within a specific network on a host, rather that always binding to 0.0.0.0
With this enhancement, using a Grafana specification file that includes both the networks' section with the network that Grafana binds to an IP on, and only_bind_port_on_networks: true
included in the "spec" section of the specification, Cephadm configures the Grafana daemon to bind to an IP within that network rather than 0.0.0.0. This enables users to use the same port that Grafana uses for another service but on a different IP on the host. If it is a specification update that does not cause them all to be moved, ceph orch redeploy grafana
can be run to pick up the changes to the settings.
Grafana specification file:
service_type: grafana service_name: grafana placement: count: 1 networks: 192.168.122.0/24 spec: anonymous_access: true protocol: https only_bind_port_on_networks: true
All bootstrap CLI parameters are now made available for usage in the cephadm-ansible
module
Previously, only a subset of the bootstrap CLI parameters were available and it was limiting the module usage.
With this enhancement, all bootstrap CLI parameters are made available for usage in the cephadm-ansible
module.
Prometheus scrape configuration is added to the nfs-ganesha exporter
With this enhancement, the Prometheus scrape configuration is added to the nfs-ganesha exporter. This is done to scrape the metrics exposed by nfs-ganesha prometheus exporter into the Prometheus instance running in Ceph, which would be further consumed by Grafana Dashboards.
Prometheus now binds to an IP within a specific network on a host, rather that always binding to 0.0.0.0
With this enhancement, using a Prometheus specification file that includes both the networks section with the network that Prometheus binds to an IP on, and only_bind_port_on_networks: true
included in the "spec" section of the specification, Cephadm configures the Prometheus daemon to bind to an IP within that network rather than 0.0.0.0. This enables users to use the same port that Prometheus uses for another service but on a different IP on the host. If it is a specification update that does not cause them all to be moved, ceph orch redeploy prometheus
can be run to pick up the changes to the settings.
Prometheus specification file:
service_type: prometheus service_name: prometheus placement: count: 1 networks: - 10.0.208.0/22 spec: only_bind_port_on_networks: true
Users can now mount snapshots (exports within .snap directory)
With this enhancement, users can mount snapshots (exports within .snap
directory) to look at in a RO mode. NFS exports created with the NFS MGR module now include the cmount_path
setting (this cannot be configured and should be left as "/") which allows snapshots to be mounted.
Zonegroup hostnames can now be set using the specification file provided in the ceph rgw realm bootstrap…
command
With this release, in continuation to the automation of Ceph Object gateway multi-site setup, users can now set zonegroup hostnames through the initial specification file passed in the bootstrap command ceph rgw realm bootstrap…
instead of requiring additional steps.
For example,
zonegroup_hostnames: - host1 - host2
If users add the above section to the "specification" section of the Ceph Object gateway specification file passed in the realm bootstrap command, Cephadm will automatically add those hostnames to the zonegroup defined in the specification after the Ceph Object gateway module finishes creation of the realm/zonegroup/zone. Note that this may take a few minutes to occur depending on what other activity the Cephadm module is currently completing.
3.2. Ceph Dashboard
CephFS snapshot schedules management on the Ceph dashboard
Previously, CephFS snapshot schedules could only be managed through the command-line interface.
With this enhancement, CephFS snapshot schedules can be listed, created, edited, activated, deactivated, and removed from the Ceph dashboard.
Ceph dashboard now supports NFSv3-based exports in Ceph dashboard
With this enhancement, support is enabled for NFSv3-based export management in the Ceph dashboard.
Ability to manage Ceph users for CephFS is added
With this enhancement, the ability to manage the Ceph users for CephFS is added. This provides the ability to manage the users' permissions for volumes, subvolume groups, and subvolumes from the File System view.
A new API endpoint for multi-site sync status is added
Previously, multi-site sync status was available only via the CLI command.
With this enhancement, multi-site status is added via an API in the Ceph dashboard. The new API endpoint for multi-site sync status is api/rgw/multisite/sync_status
.
Improved monitoring of NVMe-oF gateway
With this enhancement, to improve monitoring of NVMe-oF gateway, alerts of NVMe-oF gateway are added based on the metrics emitted and also, metrics from the embedded prometheus exporter are scraped in the NVMe-oF gateway.
CephFS clone management in Ceph dashboard
With this enhancement, CephFS clone management functionality is provided in the Ceph dashboard. Users can create and delete subvolume clone through the Ceph dashboard.
CephFS snapshot management in Ceph dashboard
With this enhancement, CephFS snapshot management functionality is provided in the Ceph dashboard. Users can create and delete subvolume snapshot through the Ceph dashboard.
Labeled Performance Counters per user/bucket
With this enhancement, users can not only obtain information on the operations happening per Ceph Object Gateway node, but can also view the Ceph Object Gateway performance counters per-user and per-bucket in the Ceph dashboard.
Labeled Sync Performance Counters into Prometheus
With this enhancement, users can gather real-time information from Prometheus about the replication health between zones for increased observability of the Ceph Object Gateway multi-site sync operations.
Add and edit bucket in Ceph dashboard
With this enhancement, as part of the Ceph Object Gateway improvements to the Ceph dashboard, the capability to apply, list and edit Buckets from the Ceph dashboard is added.
- ACL(Public, Private)
- Tags(adding/removing)
Add, List, Delete, and Apply bucket policies in Ceph dashboard
With this enhancement, as part of the Ceph Object Gateway improvements to the Ceph dashboard, the capability to add, list, delete, and apply bucket policies from the Ceph dashboard is added.
3.3. Ceph File System
MDS dynamic metadata balancer is off by default
Previously, poor balancer behavior would fragment trees in undesirable ways by increasing the max_mds
file system setting.
With this enhancement, MDS dynamic metadata balancer is off, by default. Operators must turn on the balancer explicitly to use it.
CephFS supports quiescing of subvolumes or directory trees
Previously, multiple clients would interleave reads and writes across a consistent snapshot barrier where out-of-band communication existed between clients. This communication led to clients wrongly believing they have reached a checkpoint that is mutually recoverable via a snapshot.
With this enhancement, CephFS supports quiescing of subvolumes or directory trees to enable the execution of crash-consistent snapshots. Clients are now forced to quiesce all I/O before the MDS executes the snapshot. This enforces a checkpoint across all clients of the subtree.
MDS Resident Segment Size (RSS) performance counter is tracked with a higher priority
With this enhancement, the MDS Resident Segment Size performance counter is tracked with a higher priority to allow callers to consume its value to generate useful warnings. This allows Rook to identify the MDS RSS size and act accordingly.
Laggy clients are now evicted only if there are no laggy OSDs
Previously, monitoring performance dumps from the MDS would sometimes show that the OSDs were laggy, objecter.op_laggy
and objecter.osd_laggy
, causing laggy clients (dirty data could not be flushed for cap revokes).
With this enhancement, if the defer_client_eviction_on_laggy_osds
option is set to true and a client gets laggy because of a laggy OSD then client eviction will not take place until OSDs are no longer laggy.
cephfs-mirror daemon exports snapshot synchronization performance counters via perf dump
command
ceph-mds daemon export per-client performance counters included in the already existing perf dump
command.
A new dump dir
command is introduced to dump the directory information
With this enhancement, the dump dir
command is introduced to dump the directory information and print the output.
Snapshot scheduling support for subvolumes
With this enhancement, snapshot scheduling support is provided for subvolumes. All snapshot scheduling commands accept --subvol
and --group
arguments to refer to appropriate subvolumes and subvolume groups. If a subvolume is specified without a subvolume group argument, then the default subvolume group is considered. Also, a valid path need not be specified when referring to subvolumes and just a placeholder string is sufficient due to the nature of argument parsing employed.
Example
# ceph fs snap-schedule add - 15m --subvol sv1 --group g1 # ceph fs snap-schedule status - --subvol sv1 --group g1
Ceph commands that add or modify MDS caps give an explanation about why the MDS caps passed by user was rejected
Previously, Ceph commands that add or modify MDS caps printed "Error EINVAL: mds capability parse failed, stopped at 'allow w' of 'allow w'".
With this enhancement, the commands give an explanation about why the MDS caps passed by user were rejected and print Error EINVAL: Permission flags in MDS caps must start with 'r' or 'rw' or be '*' or 'all'.
3.4. Ceph Object Gateway
Admin interface is now added to manage bucket notification
Previously, the S3 REST APIs were used to manage bucket notifications. However, if an admin wanted to override them, there was no easy way to do that over the radosgw-admin tool.
With this enhancement, an admin interface with the following commands is added to manage bucket notifications:
radosgw-admin notification get --bucket <bucket name> --notification-id <notification id> radosgw-admin notification list --bucket <bucket name> radosgw-admin notification rm --bucket <bucket name> [--notification-id <notification id>]
RGW labeled user and bucket operation counters are now in different sections when the ceph counter dump
is run
Previously, all RGW labeled operation counters were in the rgw_op`
section of the output of the ceph counter dump
command but would either have a user label or a bucket label.
With this enhancement, RGW labeled user and bucket operation counters are in rgw_op_per_user
or rgw_op_per_bucket
sections respectively when the ceph counter dump
command is executed.
Users can now place temporary files into a directory using the -t
command-line option
Previously, the /usr/bin/rgw-restore-bucket-index
tool just used /tmp
and that directory sometimes did not have enough free space to hold all the temporary files.
With this enhancement, the user can specify a directory into which the temporary files can be placed using the -t
command-line option and will be notified if they run out of space, thereby knowing what adjustments to make to re-run the tool. Also, users can periodically check if the tool’s temporary files have exhausted the available space on the file system where the temporary files are present.
Copying of encrypted objects using copy-object
APIs is now supported
Previously, in Ceph Object gateway, copying of encrypted objects using copy-object APIs was unsupported since the inception of its server-side encryption support.
With this enhancement, copying of encrypted objects using copy-object APIs is supported and workloads that rely on copy-object operations can also use server-side encryption.
A new Ceph Object Gateway admin-ops capability is added to allow reading user metadata but not their associated authorization keys
With this enhancement, a new Ceph Object Gateway admin-ops capability is added to allow reading Ceph Object gateway user metadata but not their associated authorization keys. This is to reduce the privileges of automation and reporting tools and to avoid impersonating users or view their keys.
Cloud Transition: add new supported S3-compatible platforms
With this release, to be able to move object storage to the cloud or other on-premise S3 endpoints, the current lifecycle transition and storage class model is extended. S3-compatible platforms, such as IBM Cloud Object Store (COS) and IBM Storage Ceph are now supported for the cloud archival feature.
NFS with RGW backend
With this release, NFS with Ceph Object Gateway backend is re-GAed with the existing functionalities.
3.5. Multi-site Ceph Object Gateway
A retry mechanism is introduced in the radosgw-admin sync status
command
Previously, when the multisite sync sent requests to a remote zone, it used a round robin strategy to choose one of its zone endpoints. If that endpoint was not available, the http client logic used by the radosgw-admin sync status
command would not provide a retry mechanism, and thus report input/output error.
With this enhancement, a retry mechanism is introduced in the sync status command by virtue of which, if the chosen endpoint is unavailable, a different endpoint is selected to serve the request.
NewerNoncurrentVersions
, ObjectSizeGreaterThan
, and ObjectSizeLessThan
filters are added to the lifecycle
With this enhancement, support for NewerNoncurrentVersions
, ObjectSizeGreaterThan
, and ObjectSizeLessThan
filters are added to the lifecycle.
User S3 replication APIs are now supported
With this enhancement, user S3 replication APIs are now supported. With these APIs, users can set replication policies at bucket-level. The API is extended to include additional parameters to specify source and destination zone names.
Bucket Granular Sync Replication GA (Part 3)
With this release, the ability to replicate a bucket or a group of buckets to a different Red Hat Ceph Storage cluster is added with bucket granular support. The usability requirements are as Ceph Object Gateway multi-site.
Red Hat Ceph Storage now supports object storage archive zones
Object storage archive zones were previously available as limited release. This enhancement provides full availability for new and existing customers in production environments. The archive zone receives all objects from the production zones and keeps every version for every object, providing the user with an object catalogue that contains the full history of the object. This provides a secured object storage deployment that guarantees data retrieval even if the object/buckets in the production zones have been lost or compromised.
For more information, see Configuring the archive zone in the Object Gateway Guide.
3.6. RADOS
Setting the noautoscale
flag on/off retains each pool’s original autoscale mode configuration
Previously, the pg_autoscaler
did not persist in each pool’s autoscale mode
configuration when the noautoscale
flag was set. Due to this, whenever the noautoscale
flag was set, the autoscale
mode had to be set for each pool repeatedly.
With this enhancement, the pg_autoscaler
module persists individual pool configuration for the autoscaler mode after the noautoscale flag
is set. Setting the noautoscale
flag on/off still retains each pool’s original autoscale mode configuration.
reset_purged_snaps_last
OSD command is introduced
With this enhancement, reset_purged_snaps_last
OSD command is introduced to resolve cases in which the purged_snaps
keys (PSN) are missing in the OSD and exist in the monitor. The purged_snaps_last
command will be zeroed and as a result, the monitor will share all its purged_snaps
information with the OSD on the next boot.
BlueStore’s RocksDB compression enabled
With this enhancement, to ensure that the metadata (especially OMAP) takes less space, RocksDB configuration is modified to enable internal compression of its data.
As a result, * database size is smaller * write amplification during compaction is smaller * average I/O is higher * CPU usage is higher
OSD is now more resilient to fatal corruption
Previously, special OSD layer object "superblock" would be overwritten due to being located at the beginning of the disk, resulting in a fatal corruption.
With this enhancement, OSD "superblock" is redundant and is migrating on disk. Its copy is stored in the database. OSD is now more resilient to fatal corruption.
3.7. RADOS Block Devices (RBD)
Improved rbd_diff_iterate2()
API performance
Previously, RBD diff-iterate was not guaranteed to execute locally if exclusive lock was available when diffing against the beginning of time (fromsnapname == NULL
) in fast-diff mode (whole_object == true
with fast-diff
image feature enabled and valid).
With this enhancement, rbd_diff_iterate2()
API performance is improved, thereby increasing the performance for QEMU live disk synchronization and backup use cases, where the fast-diff
image feature is enabled.