Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 6. Bug fixes
This section describes bugs with significant user impact, which were fixed in this release of Red Hat Ceph Storage. In addition, the section includes descriptions of fixed known issues found in previous versions.
6.1. The Cephadm utility
Container process number limit set to max
Previously, the process number limit, 2048, on the containers prevented new processes from being forked beyond the limit.
With this release, the process number limit is set to max
, which allows you to create as many luns as required per target. However, the number is still limited by the server resources.
Unavailable devices are no longer passed when creating OSDs in a batch
Previously, devices with GPT headers were not marked as unavailable. Cephadm would attempt to create OSDs on those devices, along with other valid devices, in a batch leading to failure of the batch OSD creation, since OSDs cannot be created on devices with GPT headers. This would not create OSDs.
With this fix, unavailable devices are no longer passed when creating OSDs in a batch and having devices with GPT headers no longer blocks creating OSDs on valid devices.
Users providing --format
argument with unsupported formats received a traceback
Previously, the orchestrator would throw an exception whenever it received a --format
argument that it did not support, causing users who passed --format
with unsupported formats to receive a traceback.
With this fix, unsupported formats are now properly handled and users providing an unsupported format get a message explaining that the format is unsupported.
The ceph-common
packages can now be installed without dependency errors
Previously, after upgrading Red Hat Ceph Storage 4 to Red Hat Ceph Storage 5, a few packages were left out which caused dependency errors.
With this fix, the left out Red Hat Ceph Storage 4 packages are removed and the ceph-common
packages can now be installed during preflight playbook execution without any errors.
The tcmu-runner
daemons are no longer reported as stray daemons
Previously, tcmu-runner
daemons were not actively tracked by cephadm
as they were considered part of iSCSI. This resulted in tcmu-runner
daemons getting reported as stray daemons since cephadm
was not tracking them.
With this fix, when a tcmu-runner
daemon matches up with a known iSCSI daemon, it is not marked as a stray daemon.
Users can re-add host with an active manager without an explicit IP
Previously, whenever cephadm
attempted to resolve the IP address of the current host from within a container, there was a chance of it resolving to a loopback address. An explicit IP was required if the user wished to re-add the host with the active Ceph Manager, and users would receive an error message if they did not provide it.
With the current fix, cephadm
reuses the old IP when re-adding the host if it is not explicitly provided and name resolution returns a loopback address. Users can now re-add the host with the active manager without an explicit IP.
cephadm
verifies if the fsid
of the daemon it was inferring a config from matches the expected fsid
Previously, in cephadm
, there was no check to verify if the fsid
of the daemon it was inferring a configuration from matched the expected fsid
. Due to this, if users had a /var/lib/ceph/FSID/DAEMON_NAME
directory with an fsid
other than the expected one, the configuration from that daemon directory would still be inferred.
With this fix, checking is done to verify if the fsid
matches what is expected and users no longer get a “failed to probe daemons or devices" error.
cephadm
supports copying client keyrings with different names
Previously, cephadm
would enforce a file name at the destination, when copying the client keyring ceph.keyring
.
With the current fix, cephadm
supports copying the client keyring with a different name, eliminating the issue of automatic renaming when copied.
User can bootstrap a cluster with multiple public networks with -c ceph.conf
option
Previously, cephadm
would not parse multiple public networks during bootstrap, when they were provided as part of the -c ceph.conf
option. Due to this, it was not possible to bootstrap a cluster with multiple public networks.
With the current fix, from the provided ceph.conf
file, the public network
field is correctly parsed and can now be used to populate the public_network mon config
field, enabling the user to bootstrap a cluster providing multiple public networks by using the -c ceph.conf
option.
Setting up a MDS service with a numeric service ID throws an error to alert user
Previously, setting up a MDS service with a numeric service ID would result in crashing of the MDS daemons.
With this fix, if an attempt is made to create a MDS service with a numeric service ID, an error is immediately thrown to alert and warn the users to not use a numeric service ID.
The ceph orch redeploy mgr
command redeploys the active Manager daemon last
Previously, the ceph orch redeploy mgr
command would cause the Ceph Manager daemons to continually redeploy themselves without clearing the scheduled redeploy action which would result in the Ceph Manager daemons endlessly flapping.
With this release, the ordering of the redeployment was adjusted so that the active manager daemon is always redeployed last and the command ceph orch redeploy mgr
now only redeploys each Ceph Manager once.
Adopting clusters with custom name is now supported
Previously, adopting Ceph OSD containers from a Ceph cluster with custom name failed as cephadm
would not propagate custom clusters in the unit.run
file.
With this release, cephadm
changes the LVM metadata and enforces the default cluster name “Ceph” thereby adopting a cluster with custom cluster names works as expected.
cephadm
no longer adds docker.io
to the image name provided for the ceph orch upgrade start
command
Previously, cephadm
would add docker.io
to any image from an unqualified registry, thereby it was impossible to pass an image from an unqualified registry, such as a local registry, to upgrade, as it would fail to pull this image.
Starting with Red Hat Ceph Storage 5.2, , docker.io
is no longer added to the image name, unless the name is a match for an upstream ceph image such as ceph/ceph:v17
. On running the ceph orch upgrade
command, users can pass images from local registries and Cephadm
can upgrade to that image.
This is ONLY applicable to upgrades starting from 5.2. Upgrading from 5.1 to 5.2 is still affected by this issue.
Cephadm
no longer infers configuration files from legacy daemons
Previously, Cephadm
would infer config files from legacy daemons, regardless of whether the daemons were still present, based on the existence of a /var/lib/ceph/{mon|osd|mgr}
directory. This caused certain tasks, such as refreshing the disks, to fail on nodes where these directories existed, as Cephadm
would throw an error when it attempts to infer the non-existent configuration file.
With the current fix, Cephadm
no longer infers configuration files from legacy daemons; instead it checks for existing configuration files before inferring. Cephadm
no longer encounters issues when refreshing daemons or devices on a host, due to the existence of a legacy daemon directory.
.rgw.root
pool is no longer created automatically
Previously, an additional check for Ceph Object Gateway for multi-site existed, which caused the automatic creation of the .rgw.root
pool even when the user had deleted it.
Starting with Red Hat Ceph Storage 5.2 the multi-site check is removed and the .rgw.root
pool is no longer automatically created, unless the user takes Ceph Object Gateway -related actions that results in its creation.
The Ceph Manager daemon is removed from a host that is no longer specified in the placement specification in cephadm
Previously, the current active manager daemon would not be removed from cephadm
even if it no longer matched the placement specified in the manager service specification. Whenever users changed the service specification to exclude the host where the current active manager was, they would end up with an extra manager until they caused a failover.
With this fix, cephadm
fails over the manager if a standby is available and the active manager is on a host that no longer matches the service specification. Ceph Manager daemon is removed from a host that is no longer specified in the placement specification in cephadm
even if the manager is the active one.
A 404 error due to a malformed URL was causing tracebacks in the logs.
Previously, cephadm
would incorrectly form the URL for the prometheus receiver, causing a traceback to be printed in the log due to a 404 error that would occur when trying to access the malformed URL.
With this fix, the URL formatting has been fixed and the 404 error is avoided. Tracebacks are no longer logged.
cephadm
no longer removes osd_memory_target
config settings at host level
Previously, if osd_memory_target_autotune
was turned off globally, cephadm
would remove the values that the user set for osd_memory_target
at the host level. Additionally, for hosts with FQDN name, even though the CRUSH map uses a short name, cephadm
would still set the config option using the FQDN. Due to this, users could not manually set osd_memory_target
at the host level and osd_memory_target
auto tuning would not work with FQDN hosts.
With this fix, the osd_memory_target
config settings is not removed from cephadm
at the host level if osd_memory_target_autotune
is set to false
. It also always users a short name for hosts when setting host level osd_memory_target
. If at the host level osd_memory_target_autotune
is set to false
, users can manually set the osd_memory_target
and have the options not be removed by cephadm
. Additionally, autotuning should now work with hosts added to cephadm
with FQDN names.
Cephadm
uses the FQDN to build the alertmanager webhook URLs
Previously, Cephadm
picked alertmanager webhook URLs based on the IP address it had stored for the hosts. This caused issues since these webhook URLs would not work for certain deployments.
With this fix, Cephadm
uses FQDNs to build the alertmanager webhook URLs, enabling webhook URLs to work for some deployment situations which were previously broken.
6.2. Ceph Dashboard
Drain action on the Ceph dashboard ensures safe removal of host
Previously, whenever a user removed a host on the Ceph dashboard without moving out all the daemons, the host transitioned to an unusable state or a ghost state.
With this fix,users can use the drain action on the dashboard to move all the daemons out from the host. Upon successful completion of the drain action, the host can be safely removed.
Performance details graphs show the required data on the Ceph Dashboard
Previously, due to related metrics being outdated, performance details graphs for a daemon were showing no data even when put
/get
operations were being performed.
With this fix, the related metrics are up-to-date and performance details graphs show the required data.
Alertmanager
shows the correct MTU mismatch
alerts
Previously, Alertmanager
was showing false MTU mismatch
alerts for cards that were in down
state as well.
With this fix, Alertmanager
shows the correct MTU mismatch
alerts.
(BZ#2057307)
PG status chart no longer displays unknown placement group status
Previously, snaptrim_wait
placement group (PG) state was incorrectly parsed and split into 2 states, snaptrim
and wait
, which are not valid PG states. This caused the PG status chart to incorrectly show a few PGs in unknown states, even though all of them were in known states.
With this fix, snaptrim_wait
and all states containing an underscore are correctly parsed and the unknown PG status is no longer displayed in the PG states chart.
Ceph Dashboard improved user interface
Previously, the following issues were identified in the Ceph Dashboard user interface, causing it to be unusable when tested with multi-path storage clusters:
- In clusters, with multi-path storage devices, if a disk was selected in the Physical Disks page, multiple disks would be selected and the selection count of the table would start incrementing, until the table stopped responding within a minute.
- The Device Health page showed errors while fetching the SMART data.
- Services column in the Hosts page showed a lot of entries, thereby reducing readability.
With this release, the following fixes are implemented, resulting in improved user interface:
- Fixed the disk selection issue in the Physical Disks page.
- An option to fetch the scsi devices SMART data is added.
- Services column is renamed as Service Instances and just the instance name and instance count of that service is displayed in a badge.
6.3. Ceph File System
Fetching ceph.dir.layout
for any directory returns the closest inherited layout
Previously, the directory paths did not traverse to the root to find the closest inherited layout causing the system to return a “No such attribute” message for directories that did not have a layout set specifically on them.
With this fix, the directory paths traverse to the root to find the closest inherited layout and fetches the ceph.dir.layout
for any directory from the directory hierarchy.
The subvolumegroup ls
API filters the internal trash directory _deleting
Previously, the subvolumegroup ls
API would not filter internal trash directory _deleting
, causing it to be listed as a subvolumegroup
.
With this fix, the subvolumegroup ls
API filters the internal trash directory _deleting
and the subvolumegroup ls
API doesn’t show the internal trash directory _deleting
.
Race condition no longer causes confusion among MDS in a cluster
Previously, a race condition in MDS, during messenger setup, would result in confusion among other MDS in the cluster, causing other MDS to refuse communication.
With this fix, the race condition is rectified, establishing successful communication among the MDS.
MDS can now trigger stray reintegration with online scrub
Previously, stray reintegrations were triggered only on client requests, resulting in the process of clearing out stray inodes to require expensive recursive directory listings by a client.
With this fix, MDS can now trigger stray reintegration with online scrub.
MDS reintegrates strays if target directories are full
Previously, MDS would not reintegrate strays if the target directory of the link was full causing the stray directory to fill up in degenerate situations.
With this fix, MDS proceeds with stray integration even when target directories are full as no change in size occurs.
Quota is enforced on the clone after the data is copied
Previously, the quota on the clone would be set prior to copying the data from the source snapshot and the quota would be enforced before copying the entire data from the source. This would cause the subvolume snapshot clone to fail if the quota on the source exceeded. Since the quota is not strictly enforced at the byte range, this is a possibility.
With this fix, the quota is enforced on the clone after the data is copied. The snapshot clone always succeeds irrespective of the quota.
Disaster recovery automation and planning resumes after ceph-mgr
restart
Previously, schedules would not start during ceph-mgr
startup which affected the disaster recovery plans of users who presumed that the snapshot schedule would resume at ceph-mgr
restart time.
With this fix, schedules start on ceph-mgr
restart and the disaster recovery automation and planning, such as snapshot replication, immediately resumes after ceph-mgr
is restarted, without the need for manual intervention.
The mdlog
is flushed immediately when opening a file for reading
Previously, when opening a file for reading, MDS would revoke the Fw capability from the other clients and when the Fw capability was released, the MDS could not flush the mdlog
immediately and would block the Fr capability. This would cause the process that requested for a file to be stuck for about 5 seconds until the mdlog
was flushed by MDS periodically every 5 seconds.
With this release, the mdlog
flush is triggered immediately when there is any capability wanted when releasing the Fw capability and you can open the file for reading quickly.
Deleting a subvolume clone is no longer allowed for certain clone states
Previously, if you tried to remove a subvolume clone with the force option when the clone was not in a COMPLETED
or CANCELLED
state, the clone was not removed from the index tracking the ongoing clones. This caused the corresponding cloner thread to retry the cloning indefinitely, eventually resulting in an ENOENT
failure. With the default number of cloner threads set to four, attempts to delete four clones resulted in all four threads entering a blocked state allowing none of the pending clones to complete.
With this release, unless a clone is either in a COMPLETED
or CANCELLED
state, it is not removed. The cloner threads no longer block because the clones are deleted, along with their entry from the index tracking the ongoing clones. As a result, pending clones continue to complete as expected.
New clients are compatible with old Ceph cluster
Previously, new clients were incompatible with old Ceph clusters causing the old clusters to trigger abort()
to crash the MDS daemons when receiving unknown metrics.
With this fix, ensure to check the feature bits in the client and collect and send only those metrics that are supported by MDSs. New clients are compatible with old cephs.
Ceph Metadata Server no longer crashes during concurrent lookup and unlink operations
Previously, an incorrect assumption of an assert placed in the code, which gets hit on concurrent lookup and unlink operations from a Ceph client, caused Ceph Metadata Server crash.
The latest fix moves the assertion to the relevant place where the assumption, during concurrent lookup and unlink operation, is valid, resulting in the continuation of Ceph Metadata Server serving the Ceph client operations without crashing.
MDSs no longer crash when fetching unlinked directories
Previously, when fetching unlinked directories, the projected version would be incorrectly initialized, causing MDSs to crash when performing sanity checks.
With this fix, the projected version and the inode version are initialized when fetching an unlinked directory, allowing the MDSs to perform sanity checks without crashing.
6.4. Ceph Manager plugins
The missing pointer is added to the PriorityCache
perf counters builder and perf output returns the prioritycache
key name
Previously, the PriorityCache perf counters builder was missing a necessary pointer, causing the perf counter output, ceph tell DAEMON_TYPE.DAEMON_ID perf dump
and ceph tell DAEMON_TYPE.DAEMON_ID perf schema
, to return an empty string instead of the prioritycache
key. This missing key caused a failure in the collectd-ceph
plugin.
With this fix, the missing pointer is added to the PriorityCache
perf counters builder. The perf output returns the prioritycache
key name.
Vulnerability with OpenStack 16.x Manila with Native CephFS and external Red Hat Ceph Storage 5
Previously, customers who were running OpenStack 16.x (with Manila) and external Red Hat Ceph Storage 4, who upgraded to Red Hat Ceph Storage 5.0, 5.0.x, 5.1, or 5.1.x, were potentially impacted by a vulnerability. The vulnerability allowed an OpenStack Manila user/tenant (owner of a Ceph File System share) to maliciously obtain access (read/write) to any Manila share backed by CephFS, or even the entire CephFS file system. The vulnerability is due to a bug in the "volumes" plugin in Ceph Manager. This plugin is responsible for managing Ceph File System subvolumes which are used by OpenStack Manila services as a way to provide shares to Manila users.
With this release, this vulnerability is fixed. Customers running OpenStack 16.x (with Manila providing native CephFS access) who upgraded to external Red Hat Ceph Storage 5.0, 5.0.x, 5.1, or 5.1.x should upgrade to Red Hat Ceph Storage 5.2. Customers who only provided access via NFS are not impacted.
6.5. The Ceph Volume utility
Missing backport is added and OSDs can be activated
Previously, OSDs could not be activated due to a regression caused by a missing backport.
With this fix, the missing backport is added and OSDs can be activated.
6.6. Ceph Object Gateway
Lifecycle policy for a versioned bucket no longer fails in between reshards
Previously, due to an internal logic error, lifecycle processing on a bucket would be disabled during bucket resharding causing the lifecycle policies for an affected bucket to not be processed.
With this fix, the bug has been rectified and the lifecycle policy for a versioned bucket no longer fails in between reshards.
Deleted objects are no longer listed in the bucket index
Previously, objects would be listed in the bucket index if the delete object operations did not complete normally, causing the objects that should have been deleted to still be listed.
With this release, the internal "dir_suggest" that finalizes incomplete transactions is fixed and deleted objects are no longer listed.
Zone group of the Ceph Object Gateway is sent as the awsRegion
value
Previously, the value of awsRegion
was not populated with the zonegroup in the event record.
With this fix, the zone group of the Ceph Object Gateway is sent as the awsRegion
value.
Ceph Object Gateway deletes all notification topics when an empty list of topics is provided
Previously, in Ceph Object Gateway, notification topics were deleted accurately by name, but would not follow AWS behavior to delete all topics when given an empty topic name, causing a few customer bucket notification workflows to be unusable with Ceph Object Gateway.
With this fix, explicit handling for empty topic lists has changed and Ceph Object Gateway deletes all the notification topics when an empty list of topics is provided.
Crashes in bucket listing, bucket stats, and similar operations are not seen for indexless buckets
Previously, several operations, including general bucket listing, would incorrectly attempt to access index information from indexless buckets causing a crash.
With this fix, new checks for indexless buckets are added, thereby crashes in bucket listing, bucket stats, and similar operations are not seen.
Internal table index is prevented from becoming negative
Previously, an index into an internal table was allowed to become negative after a period of continuous operation, which caused the Ceph Object Gateway to crash.
With this fix, the index is prevented from becoming negative and the Ceph Object Gateway no longer crashes.
Usage of MD5 in a FIPS-enabled environment is explicitly allowed and S3 multipart operations can be completed
Previously, in a FIPS-enabled environment, the usage of MD5 digest was not allowed by default, unless explicitly excluded for non-cryptographic purposes. Due to this, a segfault occurred during the S3 complete multipart upload operation.
With this fix, the usage of MD5 for non-cryptographic purposes in a FIPS-enabled environment for S3 complete multipart PUT
operations is explicitly allowed and the S3 multipart operations can be completed.
Result code 2002 of radosgw-admin
commands is explicitly translated to 2
Previously, a change in the S3 error translation of internal NoSuchBucket
result inadvertently changed the error code from the radosgw-admin bucket stats
command causing the programs checking the shell result code of those radosgw-admin
commands to see a different result code.
With this fix, the result code 2002 is explicitly translated to 2 and users can see the original behavior.
Usage of MD5 in a FIPS-enabled environment is explicitly allowed and S3 multipart operations can be completed
Previously, in a FIPS-enabled environment, the usage of MD5 digest was not allowed by default, unless explicitly excluded for non-cryptographic purposes. Due to this, a segfault occurred during the S3 complete multipart upload operation.
With this fix, the usage of MD5 for non-cryptographic purposes in a FIPS-enabled environment for S3 complete multipart PUT
operations is explicitly allowed and the S3 multipart operations can be completed.
6.7. Multi-site Ceph Object Gateway
radosgw-admin bi purge
command works on deleted buckets
Previously, radosgw-admin bi purge
command required a bucket entrypoint object, which does not exist for deleted buckets causing bi purge
to be unable to clean up after deleted buckets.
With this fix, bi purge
accepts --bucket-id
to avoid the need for a bucket entry point and the command works on deleted buckets.
Null pointer check no longer causes multi-site data sync crash
Previously, a null pointer access would crash the multisite data sync.
With this fix, null pointer check is successfully implemented, preventing any possible crashes.
(BZ#1967901)
Metadata sync no longer gets stuck when encountering errors
Previously, some errors in metadata sync would not retry, causing sync to get stuck when some errors occurred in a Ceph Object Gateway multi-site configuration.
With this fix, retry behaviour is corrected and metadata sync does not get stuck when errors are encountered.
(BZ#2068039)
Special handling is added for rgw_data_notify_interval_msec=0
parameter
Previously, rgw_data_notify_interval_msec
had no special handling for 0, resulting in the primary site flooding the secondary site with notifications.
With this fix, special handling for rgw_data_notify_interval_msec=0
is added and async data notification can now be disabled.
6.8. RADOS
Ceph cluster issues a health warning if the require-osd-release
flag is not set to the appropriate release after a cluster upgrade.
Previously, the logic in the code that detects the require-osd-release
flag mismatch after an upgrade was inadvertently removed during a code refactoring effort. Since the warning was not raised in the ceph -s
output post an upgrade, any change made to the cluster without setting the flag to the appropriate release resulted in issues, such as, placement groups (PG) stuck in certain states, excessive Ceph process memory consumption, slow requests, among many other issues.
With this fix, the Ceph cluster issues a health warning if the require-osd-release
flag is not set to the appropriate release after a cluster upgrade.
PGs no longer get incorrectly stuck in remapped+peering
state in stretch mode
Previously, due to a logical error, when operating a cluster in stretch mode, it was possible for some placement groups (PGs) to get permanently stuck in remapped+peering
state under certain cluster conditions, causing the data to be unavailable until the OSDs were taken offline.
With this fix, PGs choose stable OSD sets and they no longer get incorrectly stuck in remapped+peering
state in stretch mode.
OSD deployment tool successfully deploys all the OSDs while making changes to the cluster
The KVMonitor paxos services manages the keys being added, removed, or modified when performing changes to the cluster. Previously, while adding new OSDs using the OSD deployment tool, the keys would be added without verifying whether the service could write to it. Due to this, assertion failure would occur in the paxos code causing the monitor to crash.
The latest fix ensures that the KVMonitor service is able to write prior to adding new OSDs, otherwise, the command back is pushed back into the relevant queue to be retried at a later point. The OSD deployment tool successfully deploys all the OSDs without any issues.
Corrupted dups entries of a PG Log can be removed by off-line and on-line trimming
Previously, trimming of PG log dups entries could be prevented during the low-level PG split operation, which is used by the PG autoscaler with far higher frequency than by a human operator. Stalling the trimming of dups resulted in significant memory growth of PG log, leading to OSD crashes as it ran out of memory. Restarting an OSD did not solve the problem as the PG log is stored on disk and reloaded to RAM on startup.
With this fix, both off-line (using the ceph-objectstore-tool
command) and on-line (within OSD) trimming is able to remove corrupted dups entries of a PG Log that jammed the on-line trimming machinery and were responsible for the memory growth. A debug improvement is implemented that prints the number of dups entries to the OSD’s log to help future investigations.
6.9. RBD Mirroring
last_copied_object_number
value is properly updated for all images
Previously, due to an implementation defect, last_copied_object_number
value was properly updated only for fully allocated images. This caused the last_copied_object_number
value to be incorrect for any sparse image and the image replication progress to be lost in case of abrupt rbd-mirror daemon restart.
With this fix, last_copied_object_number
value is properly updated for all images and upon rbd-mirror daemon restart, image replication resumes from where it had previously stopped.
Existing schedules take effect when an image is promoted to primary
Previously, due to an ill-considered optimization, existing schedules would not take effect following an image’s promotion to primary resulting in the snapshot-based mirroring process to not start for a recently promoted image.
With this release, the optimization causing this issue is removed and the existing schedules now take effect when an image is promoted to primary and the snapshot-based mirroring process starts as expected.
Snapshot-based mirroring process no longer gets cancelled
Previously, as a result of an internal race condition, the rbd mirror snapshot schedule add
command would be cancelled out. The snapshot-based mirroring process for the affected image would not start, if no other existing schedules were applicable.
With this release, the race condition is fixed and the snapshot-based mirroring process starts as expected.
Replay or resync is no longer attempted if the remote image is not primary
Previously, due to an implementation defect, replay or resync would be attempted even if the remote image was not primary, that is, there is nowhere to replay or resync from. This caused the snapshot-based mirroring to run into a livelock and to continuously report "failed to unlink local peer from remote image" error.
With this fix, the implementation defect is fixed and replay or resync is not attempted if the remote image is not primary, thereby no errors are reported.
Mirror snapshots that are in use by rbd-mirror daemon on the secondary cluster are not removed
Previously, as a result of an internal race condition, the mirror snapshot that was in use by the rbd-mirror daemon on the secondary cluster would be removed, causing the snapshot-based mirroring process for the affected image to stop, reporting a "split-brain" error.
With this fix, the mirror snapshot queue is extended in length and the mirror snapshot cleanup procedure is amended accordingly. Mirror snapshots that are in use by the rbd-mirror daemon on the secondary cluster are no longer removed and the snapshot-based mirroring process does not stop.
Logic no longer causes RBD mirror to crash if owner is locked during schedule_request_lock()
Previously, during schedule_request_lock()
, for an already locked owner, the block device mirror would crash and image syncing would stop.
With this fix, if the owner is already locked, schedule_request_lock()
is gracefully aborted and the block device mirroring does not crash.
Image replication no longer stops with incomplete local non-primary snapshot
error
Previously, due to an implementation defect, upon an abrupt rbd-mirror daemon restart, image replication would stop with incomplete local non-primary snapshot
error.
With this fix, image replication no longer stops with incomplete local non-primary snapshot
error and works as expected.
6.10. The Ceph Ansible utility
Correct value is set for autotune_memory_target_ratio
when migrating to cephadm
Previously, when migrating to cephadm
, nothing would set a proper value for autotune_memory_target_ratio
depending on the kind of deployment, HCI or non_HCI. Due to this, no ratio was set and there would be no difference between the two deployments.
With this fix, the cephadm-adopt playbook
sets the right ratio depending on the kind of deployment and the right value is set for autotune_memory_target_ratio
parameter.