Chapter 7. Known issues

This section documents known issues found in this release of Red Hat Ceph Storage.

7.1. cephadm utility
Copy link

Get to know the known issues for cephadm utility found in this release.

NFS daemon fails to start for NFSv3

The NFS daemon fails to start when the rpcbind and rpc.statd services are missing or not running. These services are required for NFSv3, and by default, cephadm creates the NFS service for both NFSv3 and NFSv4 protocols. When these services are unavailable, the NFS daemon does not come online and emits the Cannot register NFS V3 on TCP error.

As a workaround, install the the rpcbind and rpc.statd packages and start the services. After these services are running, the NFS daemon starts successfully.

Bugzilla:2356836

Grafana certificate does not migrate during upgrade

When you upgrade from Red Hat Ceph Storage 8.1 to 9.0, the existing user-signed Grafana certificate is not migrated. Instead, Grafana switches to a cephadm-signed certificate. As a result, duplicate certificate entries may appear, and certificate-related health warnings can persist. Manual reconfiguration is required if you want to use custom TLS certificates.

Note

Data services remain unaffected.

To work without custom TLS certificates, you can continue using the cephadm-signed certificate.

As a workaround to use custom TLS certificates, complete the following steps:

Change the Grafana specification to use certificate_source: reference.
Use certmgr to upload a valid user-signed certificate and key for each host.
Run the ceph orch reconfig grafana command.

Bugzilla:2414999

Management gateway does not open HTTPS port during deployment

When the management gateway (mgmt-gateway) is deployed with default settings and firewalld is active, the default HTTPS port (443) is not opened in firewalld. The gateway listens on port 443 and is reachable locally, but remote access to the dashboard fails until the firewall is manually adjusted.

As a workaround, use one of the following options:

Explicitly configure a port for mgmt-gateway by using the --port option or setting spec.port. This ensures that cephadm opens the correct port in firewalld.
Manually open HTTPS (443) in firewalld. For example,

firewall-cmd --add-service=https
firewall-cmd --add-port=443/tcp

Bugzilla:2417683

Cephadm operations may fail when interactive shell aliases are present

In Red Hat Ceph Storage 7, cephadm uses the shell mv command on remote hosts. If the cephadm SSH user has interactive aliases such as mv='mv -i' (and similar for rm or cp), these aliases trigger prompts and block cephadm operations. As a result, commands like ceph orch upgrade, cephadm bootstrap, or adding hosts may hang or fail because mv waits for user confirmation instead of running non-interactively.

Currently there is no workaround. To avoid this issue, remove or disable interactive aliases for mv, rm, and cp for the cephadm SSH user. For example, comment them out in .bashrc or define them only for interactive shells, then rerun the cephadm operation.

Bugzilla:2360008

Promtail image remains visible after migration to Alloy

During the transition from Promtail to Alloy, cephadm continues to register the Promtail container image to maintain backward compatibility and ensure a smooth migration path. As a result, Promtail still appears in the cephadm list-images output after upgrading, even though Alloy is the new default. The behavior is intentional to prevent breaking log collection on clusters that have not fully migrated.

No workaround is required. Ignore the Promtail image entry during the supported transition phase. If log collection has fully migrated to Alloy and is verified, you can optionally remove legacy Promtail daemons and images manually. This cleanup is not required for cluster operation.

Bugzilla:2418617 == Ceph build Get to know the known issues for Ceph build found in this release.

HAProxy deployment fails when QAT is enabled with ingress

Deploying HAProxy with the QAT feature enabled fails on Red Hat Ceph Storage 9.0 container images when using the ingress feature.

This occurs because HAProxy no longer supports ssl_engine in default builds. In addition, newer OpenSSL versions have removed the legacy engine used by QAT, making them incompatible. Attempts to use older OpenSSL versions or build a QAT provider for newer versions also lead to compatibility issues.

As a result, HAProxy cannot run with QAT enabled, and deployment fails.

There is no way to enable QAT with HAProxy. To continue using HAProxy without QAT, update the HAProxy configuration file (typically located at /var/lib/haproxy/haproxy.cfg) as follows:

haproxy_qat_support: false
ssl: true

Bugzilla:2406166

QAT cannot be used for TLS offload or acceleration mode together with SSL set

Enabling QAT on HAProxy with SSL enabled injects legacy OpenSSL engine directives. The legacy OpenSSL engine path breaks the TLS handshake, emitting the tlsv1 alert internal error error. With the TLS handshake broken, the TLS termination fails.

As a workaround, disable the QAT at HAProxy in order to keep the TLS handshake.

Set the configuration file specifications as follows:

haproxy_qat_support: false
ssl: true

As a result, QAT is disabled and the HAProxy TLS works as expected.

Note

Under heavy connection rates higher CPU usage may be seen versus QAT-offloaded handshakes.

Bugzilla:2373189

7.2. Ceph Dashboard
Copy link

Get to know the known issues for Ceph Dashboard found in this release.

Active alert displays even when Prometheus module is active

In some cases, the Ceph Dashboard shows an active alert for CephMgrPrometheusModuleInactive even though the Prometheus module is enabled. This can happen due to a cluster misconfiguration that causes the Ceph target to go down, falsely triggering the alert.

The alert remains visible unless silenced, even when the Prometheus module is functioning correctly.

As a workaround, to suppress the alert, from the Ceph Dashboard, select the CephMgrPrometheusModuleInactive alert and create a silence.

Observability Alerts CephMgrPrometheusModuleInactive Create Silence.

For more information, see Managing alerts on the Ceph dashboard.

Bugzilla:2187272

Dashboard cannot delete non-default zone groups or zones

Users cannot delete non-default zone groups or zones from the Ceph Dashboard. Attempts to delete them fail.

As a workaround, delete non-default zone groups and zones through the command-line interface by using the appropriate radosgw-admin commands.

Bugzilla:2406519

7.3. Ceph File System (CephFS)
Copy link

Get to know the known issues for Ceph File System (CephFS) found in this release.

Subvolume operations delayed due to GIL contention during asynchronous cloning

When the asynchronous cloner in the volumes module (mgr/volumes) uses the CephFS Python binding, it invokes the Ceph client library API while holding the Python Global Interpreter Lock (GIL). During asynchronous clone operations, the GIL remains locked for an extended period, which prevents other CephFS subvolume operations such as create and delete from acquiring the GIL in time. As a result, customers may experience delayed responses when performing subvolume operations.

As a workaround, temporarily pause cloning to allow other subvolume operations to proceed.

Important

This workaround is not practical in most production environments and should be used only in exceptional cases.

Bugzilla:2429623

7.4. Ceph Object Gateway
Copy link

Get to know the known issues for Ceph Object Gateway found in this release.

Lifecycle processing stuck in PROCESSING state for a given bucket

If a Ceph Object Gateway server is unexpectedly restarted when the lifecycle processing is in progress for a given bucket, that bucket does not resume processing lifecycle work for at least two scheduling cycles and is stuck in PROCESSING state. This is an expected behavior as it is intended to avoid multiple Ceph Object gateway instances or threads from processing the same bucket simultaneously, especially when the debugging is in progress in production.

Currently there is no workaround.

Bugzilla:2401203

Ceph Object Gateway services down after upgrade

After upgrading, Ceph Object Gateway services may fail to start. The service fails to start as the rgw service now enforces the rgw_realm configuration but no realm exists in the Ceph Object Gateway configuration. As a result, the following outputs occur:

The Ceph Object Gateway logs show the following error: rgw main: failed to load zone: (2) No such file or directory
The ceph orch ps | grep rgw output displays Ceph Object Gateway in an error state.
Ceph Object Gateways are missing from ceph versions.

As a workaround, removing the rgw_realm entry and restart all Ceph Object Gateway services.

Verify if the Ceph Object Gateways are configured with no realm indicated while the Ceph configuration database specifies a realm.
1. Check the Ceph Object Gateway realm list.
  radosgw-admin realm list
  The following is an example with an empty realm list:
  [ceph: root@host01 /]# radosgw-admin realm list { "default_info": "", "realms": [] }
2. Check the Ceph configuration database.
  ceph config dump | egrep "^WHO|rgw_realm”
  Example
  [ceph: root@host01 /]# `ceph config dump | egrep "^WHO|rgw_realm WHO MASK LEVEL OPTION VALUE xxxxx.yyyyy advanced rgw_realm default
  If step 1a matches step 1b, continue to step 2 to remove the rgw_realm from the Ceph configuration database.
  If the two steps do not match, contact Support.
Remove the rgw_realm from the Ceph configuration database.

ceph config rm xxxxx.yyyyy rgw_realm

Restart all Ceph Object Gateway services.

ceph orch restart rgw

Bugzilla:2365888

7.5. Ceph Object Gateway multi-site
Copy link

Get to know the known issues for Ceph Object Gateway multi-site found in this release.

Sync failure occurs after renaming a zone or zone group

Renaming a zone or zone group in the Primary zone in the master_zonegroup can cause sync failures. When sync failures occur, the following sync status error is may be emitted and further sync operations are affected:

failed to retrieve sync info: (2200) Unknown error 2200

As a workaround, before renaming a zone or zone group in the master_zonegroup, remove the old zone or zone group name from the Ceph configuration file. For more information, see Renaming a zone group and Removing a zone from a zone group.

Bugzilla:2210695

Secondary site continues to display old zone group name after rename

In some cases, when a zone group is renamed on the Primary site, the Secondary site may still display the old zone group name. This occurs because the old name is not removed from the .rgw.root pool after the rename operation.

As a result, both the old and new zone groups appear under the radosgw-admin zonegroup list command, and sync operations may be impacted.

As a workaround, complete the following steps.

Verify that the new zone group name exists.
```
radosgw-admin zonegroup list
```
List the .rgw.root pool and locate the old zone group name.
```
rados -p .rgw.root ls
```
The old name appears in the format:
```
zonegroups_names.OLD_ZONEGROUP_NAME
```

Remove the old zone group name from the pool.

rados -p .rgw.root rm zonegroups_names.OLD_ZONEGROUP_NAME

Removing the old zone group name restores normal sync operations.

Bugzilla:2416993

Multi-site lifecycle expiration does not clean OLH entries in versioned buckets

Multi-site lifecycle expiration may fail to remove object log header (OLH) entries in versioned buckets. The system leaves stale data in the bucket index. This issue occurs when lifecycle expiration runs on multi-site deployments for versioned buckets. As a result, stale OLH entries remain after object deletion. This causes bucket index bloat and may impact bucket operations for customers.

As a workaround, administrators can manually detect and repair the affected buckets.

Detect stale entries.

radosgw-admin bucket check olh --dump-keys --bucket=BUCKET_NAME --hide-progress

Repair the bucket index.

radosgw-admin bucket check olh --fix --bucket=BUCKET_NAME

After repair, stale entries are purged.

Bugzilla:2430821

7.6. Ceph Block Device (RBD)
Copy link

Get to know the known issues for Ceph Block Device found in this release.

Kernel client does not support pg-upmap-primary

The kernel client currently does not support the pg-upmap-primary feature. As a result, users may encounter issues when attempting to mount images or filesystems using the kernel client in environments where pg-upmap-primary is configured.

If issues occur during mounting with the kernel client, verify that the issue is due to this support issue.

Confirm that your cluster contains pg-upmap-primary mappings.
```
ceph osd dump | grep "pg_upmap_primary"
```
Check the kernel log for the following error message.
```
$ dmesg | tail

[73393.901029] libceph: mon2 (1)10.64.24.186:6789 feature set mismatch, my 2f018fb87aa4aafe < server's 2f018fb8faa4aafe, missing 80000000
[73393.901037] libceph: mon2 (1)10.64.24.186:6789 missing required protocol features
```
These error details confirm that the cluster is using features that the kernel client does not currently support.
- If this error message is not emitted, contact Support.
- With this error message continue by removing the related mappings.

As a workaround, remove the related pg-upmap-primary mappings.

If using the balancer module, change the mode back to one that does not use pg-upmap-primary. This prevents additional mappings from being made.
```
ceph balancer mode upmap
```
Remove all pg-upmap-primary mappings.
```
ceph osd rm-pg-upmap-primary-all
```

Bugzilla:2304313

7.7. RADOS
Copy link

Get to know the known issues for RADOS found in this release.

Placement groups are not scaled down in upmap-read and read balancer modes

Currently, pg-upmap-primary entries are not properly removed for placement groups (PGs) that are pending merge. For example, when the bulk flag is removed on a pool, or any case where the number of PGs in a pool decreases. As a result, the PG scale-down process gets stuck and the number of PGs in the affected pool do not decrease as expected.

As a workaround, remove the pg_upmap_primary entries in the OSD map of the affected pool. To view the entries, run the ceph osd dump command and then run ceph osd rm-pg-upmap-primary PG_ID for reach PG in the affected pool.

After using the workaround, the PG scale-down process resumes as expected.

Bugzilla:2302230

Chapter 7. Known issues

7.1. cephadm utility
Copy link

7.2. Ceph Dashboard
Copy link

7.3. Ceph File System (CephFS)
Copy link

7.4. Ceph Object Gateway
Copy link

7.5. Ceph Object Gateway multi-site
Copy link

7.6. Ceph Block Device (RBD)
Copy link

7.7. RADOS
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 7. Known issues

7.1. cephadm utilityCopy linkLink copied to clipboard!

7.2. Ceph DashboardCopy linkLink copied to clipboard!

7.3. Ceph File System (CephFS)Copy linkLink copied to clipboard!

7.4. Ceph Object GatewayCopy linkLink copied to clipboard!

7.5. Ceph Object Gateway multi-siteCopy linkLink copied to clipboard!

7.6. Ceph Block Device (RBD)Copy linkLink copied to clipboard!

7.7. RADOSCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. cephadm utility
Copy link

7.2. Ceph Dashboard
Copy link

7.3. Ceph File System (CephFS)
Copy link

7.4. Ceph Object Gateway
Copy link

7.5. Ceph Object Gateway multi-site
Copy link

7.6. Ceph Block Device (RBD)
Copy link

7.7. RADOS
Copy link