Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 4. Bug fixes


This section describes bugs with significant impact on users that were fixed in this release of Red Hat Ceph Storage. In addition, the section includes descriptions of fixed known issues found in previous versions.

4.1. Ceph manager plug ins

Python tasks no longer wait for the GIL

Previously, the Ceph manager daemon held the Python global interpreter lock (GIL) during some RPCs with the Ceph MDS, due to which, other Python tasks are starved waiting for the GIL.

With this fix, the GIL is released during all libcephfs/librbd calls and other Python tasks may acquire the GIL normally.

Bugzilla:2219093

4.2. The Cephadm utility

cephadm can differentiate between a duplicated hostname and no longer adds the same host to a cluster

Previously, cephadm would consider a host with a shortname and a host with its FQDN as two separate hosts, causing the same host to be added twice to a cluster.

With this fix, cephadm now recognizes the difference between a host shortname and the FQDN, and does not add the host again to the system.

Bugzilla:2049445

cephadm no longer reports that a non-existing label is removed from the host

Previously, in cephadm, there was no check to verify if a label existed before removing it from a host. Due to this, the ceph orch host label rm command would report that a label was removed from the host, even when the label was non-existent. For example, a misspelled label.

With this fix, the command now provides clear feedback whether the label specified was successfully removed or not to the user.

Bugzilla:2113901

The keepalive daemons communicate and enter the main/primary state

Previously, keepalive configurations were populated with IPs that matched the host IP reported from the ceph orch host ls command. As a result, if the VIP was configured on a different subnet than the host IP listed, the keepalive daemons were not able to communicate, resulting in the keepalive daemons to enter a primary state.

With this fix, the IPs of keepalive peers in the keepalive configuration are now chosen to match the subnet of the VIP. The keepalive daemons can now communicate even if the VIP is in a different subnet than the host IP from ceph orch host ls command. In this case, only one keepalive daemon enters primary state.

Bugzilla:2222010

Stopped crash daemons now have the correct state

Previously, when a crash daemon stopped, the return code gave an error state, rather than the expected stopped state, causing systemd to think that the service had failed.

With this fix, the return code gives the expected stopped state.

Bugzilla:2126465

HA proxy now binds to the frontend port on the VIP

Previously, in Cephadm, multiple ingress services could not be deployed on the same host with the same frontend port as the port binding occurred across all host networks.

With this fix, multiple ingress services can now be present on the same host with the same frontend port as long as the services use different VIPs and different monitoring ports are set for the ingress service in the specification.

Bugzilla:2231452

4.3. Ceph File System

User-space Ceph File System (CephFS) work as expected post upgrade

Previously, the user-space CephFS client would sometimes crash during a cluster upgrade. This would occur due to stale feature bits on the MDS side that were held on the user-space side.

With this fix, ensure that the user-space CephFS client has updated MDS feature bits that allows the clients to work as expected after a cluster upgrade.

Bugzilla:2247174

Blocklist and evict client for large session metadata

Previously, large client metadata buildup in the MDS would sometimes cause the MDS to switch to read-only mode.

With this fix, the client that is causing the buildup is blocklisted and evicted, allowing the MDS to work as expected.

Bugzilla:2238663

Deadlocks no longer occur between the unlink and reintegration requests

Previously, when fixing async dirop bug, a regression was introduced by previous commits, causing deadlocks between the unlink and reintegration request.

With this fix, the old commits are reverted and there is no longer a deadlock between unlink and reintegration requests.

Bugzilla:2228635

Client always sends a caps revocation acknowledgement to the MDS daemon

Previously, whenever an MDS daemon sent a caps revocation request to a client and during this time, if the client released the caps and removed the inode, then the client would drop the request directly, but the MDS daemon would need to wait for a caps revoking acknowledgement from the client. Due to this, even when there was no need for caps revocation, the MDS daemon would continue waiting for an acknowledgement from the client, causing a warning in MDS Daemon health status.

With this fix, the client always sends a caps revocation acknowledgement to the MDS Daemon, even when there is no inode existing and the MDS Daemon no longer stays stuck.

Bugzilla:2228000

MDS locks are obtained in the correct order

Previously, MDS would acquire metadata tree locks in the wrong order, resulting in a create and getattr RPC request to deadlock.

With this fix, locks are obtained in the correct order in MDS and the requests no longer deadlock.

Bugzilla:2235338

Sending split_realms information is skipped from CephFS MDS

Previously, the split_realms information would be incorrectly sent from the CephFS MDS which could not be correctly decoded by kclient. Due to this, the clients would not care about the split_realms and treat it as a corrupted snaptrace.

With this fix, split_realms are not sent to kclient and no crashes take place.

Bugzilla:2228003

Snapshot data is no longer lost after setting writing flags

Previously, in clients, if the writing flag was set to ‘1’ when the Fb caps were used, it would be skipped in case of any dirty caps and reuse the existing capsnap, which is incorrect. Due to this, two consecutive snapshots would be overwritten and lose data.

With this fix, the writing flags are correctly set and no snapshot data is lost.

Bugzilla:2224241

Thread renaming no longer fails

Previously, in a few rare cases, during renaming, if another thread tried to lookup the dst dentry, there were chances for it to get inconsistent result, wherein both the src dentry and dst dentry would link to the same inode simultaneously. Due to this,the rename request would fail as two different dentries were being linked to the same inode.

With this fix, the thread waits for the renaming action to finish and everything works as expected.

Bugzilla:2227987

Revocation requests no longer get stuck

Previously, before the revoke request was sent out, which would increase the 'seq', if the clients released the corresponding caps and sent out the cap update request with the old seq, the MDS would miss checking the seq(s) and cap calculation. Due to this, the revocation requests would be stuck infinitely and would throw warnings about the revocation requests not responding from clients.

With this fix, an acknowledgement is always sent for revocation requests and they no longer get stuck.

Bugzilla:2227992

Errors are handled gracefully in MDLog::_recovery_thread

Previously, a write would fail if the MDS was already blocklisted due to the fs fail issued by the QA tests. For instance, the QA test test_rebuild_moved_file (tasks/data-scan) would fail due to this reason.

With this fix, the write failures are gracefully handled in MDLog::_recovery_thread.

Bugzilla:2228358

Ceph client now verifies the cause of lagging before sending out an alarm

Previously, Ceph would sometimes send out false alerts warning of laggy OSDs. For example, X client(s) laggy due to laggy OSDs. These alerts were sent out without verifying that the lagginess was actually due to the OSD, and not due to some other cause.

With this fix, the X client(s) laggy due to laggy OSDs message is only sent out if some clients and an OSD is laggy.

Bugzilla:2247187

4.4. Ceph Dashboard

Grafana panels for performance of daemons in the Ceph Dashboard now show correct data

Previously, the labels exporter were not compatible with the queries used in the Grafana dashboard. Due to this, the Grafana panels were empty for Ceph daemons performance in the Ceph Dashboard.

With this fix, the label names are made compatible with the Grafana dashboard queries and the Grafana panels for performance of daemons show correct data.

Bugzilla:2241309

Edit layering and deep-flatten features disabled on the Dashboard

Previously, in the Ceph dashboard, it was possible to allow editing the layering & deep-flatten features, which are immutable, resulting in an error - rbd: failed to update image features: (22) Invalid argument.

With this fix, editing the layering & deep-flatten features are disabled and everything works as expected.

Bugzilla:2166708

ceph_daemon label is added to the labeled performance counters in Ceph exporter

Previously, in Ceph exporter, adding the ceph_daemon label to the labeled performance counters was missed.

With this fix, ceph_daemon label is added to the labeled performance counters in Ceph exporter. ceph daemon label is now present on all Ceph daemons performance metrics and instance_id label for Ceph Object Gateway performance metrics.

Bugzilla:2240972

Protecting snapshot is enabled only if layering for its parent image is enabled

Previously, protecting snapshot was enabled even if layering was disabled for its parent image. This caused errors when trying to protect the snapshot of an image for which layering was disabled.

With this fix, protecting snapshot is disabled if layering for an image is disabled. Protecting snapshot is enabled only if layering for its parent image is enabled.

Bugzilla:2166705

Newly added host details are now visible on the cluster expansion review page

Previously, users could not see the information about the hosts that were added in the previous step.

With this fix, hosts that were added in the previous step are now visible on the cluster expansion review page.

Bugzilla:2232567

Ceph Object Gateway page now loads properly on the Ceph dashboard.

Previously, an incorrect regex matching caused the dashboard to break when trying to load the Ceph Object Gateway page. The Ceph Object Gateway page would not load with specific configurations like rgw_frontends like beast port=80 ssl_port=443.

With this fix, the regex matching in the codebase is updated and the Ceph Object Gateway page loads without any issues.

Bugzilla:2238470

4.5. Ceph Object Gateway

Ceph Object Gateway daemon no longer crashes where phoneNumbers.addr is NULL

Previously, due to a syntax error, the query for select * from s3object[*].phonenumbers where phoneNumbers.addr is NULL; would cause the Ceph Object Gateway daemon to crash.

With this fix the wrong syntax is identified and reported, no longer causing the daemon to crash.

Bugzilla:2230234

Ceph Object Gateway daemon no longer crashes with cast( trim) queries

Previously, due to the trim skip type checking within the query for select cast( trim( leading 132140533849470.72 from _3 ) as float) from s3object;, the Ceph Object Gateway daemon would crash.

With this fix the type is checked and is identified if wrong and reported, no longer causing the daemon to crash.

Bugzilla:2248866

Ceph Object Gateway daemon no longer crashes with “where” clause in an s3select JSON query.

Previously, due to a syntax error, an s3select JSON query with a “where” clause would cause the the Ceph Object Gateway daemon to crash.

With this fix the wrong syntax is identified and reported, no longer causing the daemon to crash.

Bugzilla:2225434

Ceph Object Gateway daemon no longer crashes with s3 select phonenumbers.type query

Previously, due to a syntax error, the query for select phonenumbers.type from s3object[*].phonenumbers; would cause the Ceph Object Gateway daemon to crash.

With this fix the wrong syntax is identified and reported, no longer causing the daemon to crash.

Bugzilla:2230230

Ceph Object Gateway daemon validates arguments and no longer crashes

Previously, due to an operator with missing arguments, the daemon would crash when trying to access the nonexistent arguments.

With this fix the daemon validates the number of arguments per operator and the daemon no longer crashes.

Bugzilla:2230233

Ceph Object Gateway daemon no longer crashes with the trim command

Previously, due to the trim skip type checking within the query for select trim(LEADING '1' from '111abcdef111') from s3object;, the Ceph Object Gateway daemon would crash.

With this fix, the type is checked and is identified if wrong and reported, no longer causing the daemon to crash.

Bugzilla:2248862

Ceph Object Gateway daemon no longer crashes if a big value is entered

Previously, due to too large of a value entry, the query for select DATE_DIFF(SECOND, utcnow(),date_add(year,1111111111111111111, utcnow())) from s3object; would cause the Ceph Object Gateway daemon to crash.

With this fix, the crash is identified and an error is reported.

Bugzilla:2245145

Ceph Object Gateway now parses the CSV objects without processing failures

Previously, Ceph Object Gateway failed to properly parse CSV objects. When the process failed, the requests would stop without a proper error message.

With this fix, the CSV parser works as expected and processes the CSV objects with no failures.

Bugzilla:2241907

Object version instance IDs beginning with a hyphen are restored

Previously, when restoring the index on a versioned bucket, object versions with an instance ID beginning with a hyphen would not be properly restored into the bucket index.

With this fix, instance IDs beginning with a hyphen are now recognized and restored into the bucket index, as expected.

Bugzilla:2247138

Multi-delete function notifications work as expected

Previously, due to internal errors, such as a race condition in the code, the Ceph Object Gateway would crash or react unexpectedly when multi-delete functions were performed and the notifications were set for bucket deletions.

With this fix, notifications for multi-delete function work as expected.

Bugzilla:2239173

RADOS object multipart upload workflows complete properly

Previously, in some cases, a RADOS object that was part of a multipart upload workflow objects that were created on a previous upload would cause certain parts to not complete or stop in the middle of the upload.

With this fix, all parts upload correctly, once the multipart upload workflow is complete.

Bugzilla:2008835

Users belonging to a different tenant than the bucket owner can now manage notifications

Previously, a user that belonged to a different tenant than the bucket owner was not able to manage notifications. For example, modify, get, or delete.

With this fix, any user with the correct permissions can manage the notifications for the buckets.

Bugzilla:2180415

Ability to perform NFS setattr on buckets is removed

Previously, changing the attributes stored on a bucket via export as an NFS directory triggered an inconsistency in the Ceph Object gateway bucket information cache. Due to this, subsequent accesses to the bucket via NFS failed.

With this fix, the ability to perform NFS setattr on buckets is removed and attempts to perform NFS setattr on a bucket, for example, chown on the directory, have no effect.

Note

This might change in future releases.

Bugzilla:2241145

Testing for reshardable bucket layouts is added to prevent crashes

Previously, with the added bucket layout code to enable dynamic bucket resharding with multi-site, there was no check to verify if the bucket layout supported resharding during dynamic, immediate, or rescheduled resharding. Due to this, the Ceph Object gateway daemon would crash in case of dynamic bucket resharding and the radosgw-admin command would crash in case of immediate or scheduled resharding.

With this fix, a test for reshardable bucket layouts is added and the crashes no longer occur. When immediate and scheduled resharding occurs, an error message is displayed. When dynamic bucket resharding occurs, the bucket is skipped.

Bugzilla:2242987

The user modify -placement-id command can now be used with an empty --storage-class argument

Previously, if the --storage-class argument was not used when running the 'user modify --placement-id' command, the command would fail.

With this fix, the --storage-class argument can be left empty without causing the command to fail.

Bugzilla:2228157

Initialization now only unregisters watches that were previously registered

Previously, in some cases, an error in initialization could cause an attempt to unregister a watch that was never registered. This would result in some command line tools crashing unpredictably.

With this fix, only previously registered watches are unregistered.

Bugzilla:2224078

Multi-site replication now maintains consistent states between zones and prevents overwriting deleted objects

Previously, a race condition in multi-site replication would allow objects that should be deleted to be copied back from another site, resulting in an inconsistent state between zones. As a result, the zone which is receiving the workload ends up with some objects which should be deleted still present.

With this fix, a custom header is added to pass the destination zone’s trace string and is then checked against the object’s replication trace. If there is a match, a 304 response is returned, preventing the full sync from overwriting a deleted object.

Bugzilla:2219427

The memory footprint of Ceph Object Gateway has significantly been reduced

Previously, in some cases, a memory leak associated with Lua scripting integration caused excessive RGW memory growth.

With this fix, the leak is fixed and the memory footprint for Ceph Object Gateway is significantly reduced.

Bugzilla:2032001

Bucket index performance no longer impacted during versioned object operations

Previously, in some cases, space leaks would occur and reduce bucket index performance. This was caused by a race condition related to updates of object logical head (OLH), which relates to versioned bucket current version calculations during updates.

With this fix, logic errors in OLH update operations are fixed and space is no longer being leaked during versioned object operations.

Bugzilla:2219467

Delete markers are working correctly with the LC rule

Previously, optimization was attempted to reuse a sal object handle. Due to this, delete markers were not being generated as expected.

With this fix, the change to re-use sal object handle for get-object-attributes is reverted and delete markers are created correctly.

Bugzilla:2248116

SQL engine no longer causes Ceph Object Gateway crash with illegal calculations

Previously, in some cases, the SQL engine would throw an exception that was not handled, causing a Ceph Object Gateway crash. This was caused due to an illegal SQL calculation of a date-time operation.

With this fix, the exception is handled with an emitted error message, instead of crashing.

Bugzilla:2246150

The select trim (LEADING '1' from '111abcdef111') from s3object; query now works when capitals are used in query

Previously, if LEADING or TRAILING were written in all capitals, the string would not properly read, causing a float type to be referred to as a string type, thus leading to a wrong output.

With this fix, type checking is introduced before completing the query, and LEADING and TRAILING work written either capitalized or in lower case.

Bugzilla:2245575

JSON parsing now works for select _1.authors.name from s3object[*] limit 1 query

Previously, an anonymous array given in the select _1.authors.name from s3object[*] limit 1 would give the wrong value output.

With this fix, JSON parsing works, even if an anonymous array is provided to the query.

Bugzilla:2236462

4.6. Multi-site Ceph Object Gateway

Client no longer resets the connection for an incorrect Content-Length header field value

Previously, when returning an error page to the client, for example, a 404 or 403 condition, the </body> and </html> closing tags were missing, although their presence was accounted for in the request’s Content-Length header field value. Due to this, depending on the client, the TCP connection between the client and the Rados Gateway would be closed by an RST packet from the client on account of incorrect Content-Length header field value, instead of a FIN packet under normal circumstances.

With this fix, send the </body> and </html> closing tags to the client under all the required conditions. The value of the Content-Length header field correctly represents the length of data sent to the client, and the client no longer resets the connection for an incorrect Content-Length reason.

Bugzilla:2189412

Sync notification are sent with the correct object size

Previously, when an object was synced between zones, and sync notifications were configured, the notification was sent with zero as the size of the object.

With this fix, sync notifications are sent with the correct object size.

Bugzilla:2238921

Multi-site sync properly filters and checks according to allowed zones and filters

Previously, when using the multi-site sync policy, certain commands, such as radosgw-admin sync status, would not filter restricted zones or empty sync group names. The lack of filter caused the output of these commands to be misleading.

With this fix, restricted zones are no longer checked or reported and empty sync group names are filtered out of the status results.

Bugzilla:2159966

4.7. RADOS

The ceph version command no longer returns the empty version list

Previously, if the MDS daemon was not deployed in the cluster then the ceph version command returned an empty version list for MDS daemons that represented version inconsistency. This should not be shown if the daemon is not deployed in the cluster.

With this fix, the daemon version information is skipped if the daemon version map is empty and the ceph version command returns the version information only for the Ceph daemons which are deployed in the cluster.

Bugzilla:2110933

ms_osd_compression_algorithm now displays the correct value

Previously, an incorrect value in ms_osd_compression_algorithm displayed a list of algorithms instead of the default value, causing a discrepancy by listing a set of algorithms instead of one.

With this fix, only the default value is displayed when using the ms_osd_compression_algorithm command.

Bugzilla:2155380

MGR no longer disconnects from the cluster without retries

Previously, during network issues, clusters would disconnect with MGR without retries and the authentication of monclient would fail.

With this fix, retries are added in scenarios where hunting and connection would both fail.

Bugzilla:2106031

Increased timeout retry value for client_mount_timeout

Previously, due to the mishandling of the client_mount_timeout configurable, the timeout for authenticating a client to monitors could reach up to 10 retries disregarding its high default value of 5 minutes.

With this fix, the previous single-retry behavior of the configurable is restored and the authentication timeout works as expected.

Bugzilla:2233800

4.8. RBD Mirroring

Demoted mirror snapshot is removed following the promotion of the image

Previously, due to an implementation defect, the demoted mirror snapshots would not be removed following the promotion of the image, whether on the secondary image or on the primary image. Due to this, demoted mirror snapshots would pile up and consume storage space.

With this fix, the implementation defect is fixed and the appropriate demoted mirror snapshot is removed following the promotion of the image.

Bugzilla:2237304

Non-primary images are now deleted when the primary image is deleted

Previously, a race condition in the rbd-mirror daemon image replayer prevented a non-primary image from being deleted when the primary was deleted. Due to this, the non-primary image would not be deleted and the storage space was used.

With this fix, the rbd-mirror image replayer is modified to eliminate the race condition. Non-primary images are now deleted when the primary image is deleted.

Bugzilla:2230056

The librbd client correctly propagates the block-listing error to the caller

Previously, when the rbd_support module’s RADOS client was block-listed, the module’s mirror_snapshot_schedule handler would not always shut down correctly. The handler’s librbd client would not propagate the block-list error, thereby stalling the handler’s shutdown. This lead to the failures of the mirror_snapshot_schedule handler and the rbd_support module to automatically recover from repeated client block-listing. The rbd_support module stopped scheduling mirror snapshots after its client was repeatedly block-listed.

With this fix, the race in the librbd client between its exclusive lock acquisition and handling of block-listing is fixed. This allows the librbd client to propagate the block-listing error correctly to the caller, for example, the mirror_snapshot_schedule handler, while waiting to acquire an exclusive lock. The mirror_snapshot_schedule handler and the rbd_support_module automatically recovers from repeated client block-listing.

Bugzilla:2237303

Red Hat logoGithubRedditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez leBlog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

© 2024 Red Hat, Inc.