Chapter 7. Known issues
This section documents known issues found in this release of Red Hat Ceph Storage.
7.1. cephadm utility Copy linkLink copied to clipboard!
Get to know the known issues for cephadm utility found in this release.
NFS daemon fails to start for NFSv3
The NFS daemon fails to start when the rpcbind and rpc.statd services are missing or not running. These services are required for NFSv3, and by default, cephadm creates the NFS service for both NFSv3 and NFSv4 protocols. When these services are unavailable, the NFS daemon does not come online and emits the Cannot register NFS V3 on TCP error.
As a workaround, install the the rpcbind and rpc.statd packages and start the services. After these services are running, the NFS daemon starts successfully.
Grafana certificate does not migrate during upgrade
When you upgrade from Red Hat Ceph Storage 8.1 to 9.0, the existing user-signed Grafana certificate is not migrated. Instead, Grafana switches to a cephadm-signed certificate. As a result, duplicate certificate entries may appear, and certificate-related health warnings can persist. Manual reconfiguration is required if you want to use custom TLS certificates.
Data services remain unaffected.
To work without custom TLS certificates, you can continue using the cephadm-signed certificate.
As a workaround to use custom TLS certificates, complete the following steps:
-
Change the Grafana specification to use
certificate_source: reference. -
Use
certmgrto upload a valid user-signed certificate and key for each host. -
Run the
ceph orch reconfig grafanacommand.
Management gateway does not open HTTPS port during deployment
When the management gateway (mgmt-gateway) is deployed with default settings and firewalld is active, the default HTTPS port (443) is not opened in firewalld. The gateway listens on port 443 and is reachable locally, but remote access to the dashboard fails until the firewall is manually adjusted.
As a workaround, use one of the following options:
-
Explicitly configure a port for
mgmt-gatewayby using the--portoption or settingspec.port. This ensures thatcephadmopens the correct port infirewalld. -
Manually open HTTPS (443) in
firewalld. For example,
firewall-cmd --add-service=https
firewall-cmd --add-port=443/tcp
Cephadm operations may fail when interactive shell aliases are present
In Red Hat Ceph Storage 7, cephadm uses the shell mv command on remote hosts. If the cephadm SSH user has interactive aliases such as mv='mv -i' (and similar for rm or cp), these aliases trigger prompts and block cephadm operations. As a result, commands like ceph orch upgrade, cephadm bootstrap, or adding hosts may hang or fail because mv waits for user confirmation instead of running non-interactively.
Currently there is no workaround. To avoid this issue, remove or disable interactive aliases for mv, rm, and cp for the cephadm SSH user. For example, comment them out in .bashrc or define them only for interactive shells, then rerun the cephadm operation.
Promtail image remains visible after migration to Alloy
During the transition from Promtail to Alloy, cephadm continues to register the Promtail container image to maintain backward compatibility and ensure a smooth migration path. As a result, Promtail still appears in the cephadm list-images output after upgrading, even though Alloy is the new default. The behavior is intentional to prevent breaking log collection on clusters that have not fully migrated.
No workaround is required. Ignore the Promtail image entry during the supported transition phase. If log collection has fully migrated to Alloy and is verified, you can optionally remove legacy Promtail daemons and images manually. This cleanup is not required for cluster operation.
Bugzilla:2418617 == Ceph build Get to know the known issues for Ceph build found in this release.
HAProxy deployment fails when QAT is enabled with ingress
Deploying HAProxy with the QAT feature enabled fails on Red Hat Ceph Storage 9.0 container images when using the ingress feature.
This occurs because HAProxy no longer supports ssl_engine in default builds. In addition, newer OpenSSL versions have removed the legacy engine used by QAT, making them incompatible. Attempts to use older OpenSSL versions or build a QAT provider for newer versions also lead to compatibility issues.
As a result, HAProxy cannot run with QAT enabled, and deployment fails.
There is no way to enable QAT with HAProxy. To continue using HAProxy without QAT, update the HAProxy configuration file (typically located at /var/lib/haproxy/haproxy.cfg) as follows:
haproxy_qat_support: false
ssl: true
QAT cannot be used for TLS offload or acceleration mode together with SSL set
Enabling QAT on HAProxy with SSL enabled injects legacy OpenSSL engine directives. The legacy OpenSSL engine path breaks the TLS handshake, emitting the tlsv1 alert internal error error. With the TLS handshake broken, the TLS termination fails.
As a workaround, disable the QAT at HAProxy in order to keep the TLS handshake.
Set the configuration file specifications as follows:
-
haproxy_qat_support: false -
ssl: true
As a result, QAT is disabled and the HAProxy TLS works as expected.
Under heavy connection rates higher CPU usage may be seen versus QAT-offloaded handshakes.
7.2. Ceph Dashboard Copy linkLink copied to clipboard!
Get to know the known issues for Ceph Dashboard found in this release.
Active alert displays even when Prometheus module is active
In some cases, the Ceph Dashboard shows an active alert for CephMgrPrometheusModuleInactive even though the Prometheus module is enabled. This can happen due to a cluster misconfiguration that causes the Ceph target to go down, falsely triggering the alert.
The alert remains visible unless silenced, even when the Prometheus module is functioning correctly.
As a workaround, to suppress the alert, from the Ceph Dashboard, select the CephMgrPrometheusModuleInactive alert and create a silence.
Observability
For more information, see Managing alerts on the Ceph dashboard.
Dashboard cannot delete non-default zone groups or zones
Users cannot delete non-default zone groups or zones from the Ceph Dashboard. Attempts to delete them fail.
As a workaround, delete non-default zone groups and zones through the command-line interface by using the appropriate radosgw-admin commands.
7.3. Ceph File System (CephFS) Copy linkLink copied to clipboard!
Get to know the known issues for Ceph File System (CephFS) found in this release.
Subvolume operations delayed due to GIL contention during asynchronous cloning
When the asynchronous cloner in the volumes module (mgr/volumes) uses the CephFS Python binding, it invokes the Ceph client library API while holding the Python Global Interpreter Lock (GIL). During asynchronous clone operations, the GIL remains locked for an extended period, which prevents other CephFS subvolume operations such as create and delete from acquiring the GIL in time. As a result, customers may experience delayed responses when performing subvolume operations.
As a workaround, temporarily pause cloning to allow other subvolume operations to proceed.
This workaround is not practical in most production environments and should be used only in exceptional cases.
7.4. Ceph Object Gateway Copy linkLink copied to clipboard!
Get to know the known issues for Ceph Object Gateway found in this release.
Lifecycle processing stuck in PROCESSING state for a given bucket
If a Ceph Object Gateway server is unexpectedly restarted when the lifecycle processing is in progress for a given bucket, that bucket does not resume processing lifecycle work for at least two scheduling cycles and is stuck in PROCESSING state. This is an expected behavior as it is intended to avoid multiple Ceph Object gateway instances or threads from processing the same bucket simultaneously, especially when the debugging is in progress in production.
Currently there is no workaround.
Ceph Object Gateway services down after upgrade
After upgrading, Ceph Object Gateway services may fail to start. The service fails to start as the rgw service now enforces the rgw_realm configuration but no realm exists in the Ceph Object Gateway configuration. As a result, the following outputs occur:
-
The Ceph Object Gateway logs show the following error:
rgw main: failed to load zone: (2) No such file or directory -
The
ceph orch ps | grep rgwoutput displays Ceph Object Gateway in an error state. -
Ceph Object Gateways are missing from
ceph versions.
As a workaround, removing the rgw_realm entry and restart all Ceph Object Gateway services.
Verify if the Ceph Object Gateways are configured with no realm indicated while the Ceph configuration database specifies a realm.
Check the Ceph Object Gateway realm list.
radosgw-admin realm listThe following is an example with an empty realm list:
[ceph: root@host01 /]# radosgw-admin realm list { "default_info": "", "realms": [] }Check the Ceph configuration database.
ceph config dump | egrep "^WHO|rgw_realm”Example
[ceph: root@host01 /]# `ceph config dump | egrep "^WHO|rgw_realm WHO MASK LEVEL OPTION VALUE xxxxx.yyyyy advanced rgw_realm defaultIf step 1a matches step 1b, continue to step 2 to remove the
rgw_realmfrom the Ceph configuration database.If the two steps do not match, contact Support.
-
Remove the
rgw_realmfrom the Ceph configuration database.
ceph config rm xxxxx.yyyyy rgw_realm
- Restart all Ceph Object Gateway services.
ceph orch restart rgw
7.5. Ceph Object Gateway multi-site Copy linkLink copied to clipboard!
Get to know the known issues for Ceph Object Gateway multi-site found in this release.
Sync failure occurs after renaming a zone or zone group
Renaming a zone or zone group in the Primary zone in the master_zonegroup can cause sync failures. When sync failures occur, the following sync status error is may be emitted and further sync operations are affected:
failed to retrieve sync info: (2200) Unknown error 2200
As a workaround, before renaming a zone or zone group in the master_zonegroup, remove the old zone or zone group name from the Ceph configuration file. For more information, see Renaming a zone group and Removing a zone from a zone group.
Secondary site continues to display old zone group name after rename
In some cases, when a zone group is renamed on the Primary site, the Secondary site may still display the old zone group name. This occurs because the old name is not removed from the .rgw.root pool after the rename operation.
As a result, both the old and new zone groups appear under the radosgw-admin zonegroup list command, and sync operations may be impacted.
As a workaround, complete the following steps.
Verify that the new zone group name exists.
radosgw-admin zonegroup listList the
.rgw.rootpool and locate the old zone group name.rados -p .rgw.root lsThe old name appears in the format:
zonegroups_names.OLD_ZONEGROUP_NAMERemove the old zone group name from the pool.
rados -p .rgw.root rm zonegroups_names.OLD_ZONEGROUP_NAME
Removing the old zone group name restores normal sync operations.
Multi-site lifecycle expiration does not clean OLH entries in versioned buckets
Multi-site lifecycle expiration may fail to remove object log header (OLH) entries in versioned buckets. The system leaves stale data in the bucket index. This issue occurs when lifecycle expiration runs on multi-site deployments for versioned buckets. As a result, stale OLH entries remain after object deletion. This causes bucket index bloat and may impact bucket operations for customers.
As a workaround, administrators can manually detect and repair the affected buckets.
Detect stale entries.
radosgw-admin bucket check olh --dump-keys --bucket=BUCKET_NAME --hide-progressRepair the bucket index.
radosgw-admin bucket check olh --fix --bucket=BUCKET_NAME
After repair, stale entries are purged.
7.6. Ceph Block Device (RBD) Copy linkLink copied to clipboard!
Get to know the known issues for Ceph Block Device found in this release.
Kernel client does not support pg-upmap-primary
The kernel client currently does not support the pg-upmap-primary feature. As a result, users may encounter issues when attempting to mount images or filesystems using the kernel client in environments where pg-upmap-primary is configured.
If issues occur during mounting with the kernel client, verify that the issue is due to this support issue.
Confirm that your cluster contains
pg-upmap-primarymappings.ceph osd dump | grep "pg_upmap_primary"Check the kernel log for the following error message.
$ dmesg | tail [73393.901029] libceph: mon2 (1)10.64.24.186:6789 feature set mismatch, my 2f018fb87aa4aafe < server's 2f018fb8faa4aafe, missing 80000000 [73393.901037] libceph: mon2 (1)10.64.24.186:6789 missing required protocol featuresThese error details confirm that the cluster is using features that the kernel client does not currently support.
- If this error message is not emitted, contact Support.
- With this error message continue by removing the related mappings.
As a workaround, remove the related pg-upmap-primary mappings.
If using the balancer module, change the mode back to one that does not use
pg-upmap-primary. This prevents additional mappings from being made.ceph balancer mode upmapRemove all
pg-upmap-primarymappings.ceph osd rm-pg-upmap-primary-all
7.7. RADOS Copy linkLink copied to clipboard!
Get to know the known issues for RADOS found in this release.
Placement groups are not scaled down in upmap-read and read balancer modes
Currently, pg-upmap-primary entries are not properly removed for placement groups (PGs) that are pending merge. For example, when the bulk flag is removed on a pool, or any case where the number of PGs in a pool decreases. As a result, the PG scale-down process gets stuck and the number of PGs in the affected pool do not decrease as expected.
As a workaround, remove the pg_upmap_primary entries in the OSD map of the affected pool. To view the entries, run the ceph osd dump command and then run ceph osd rm-pg-upmap-primary PG_ID for reach PG in the affected pool.
After using the workaround, the PG scale-down process resumes as expected.