Chapter 6. Troubleshooting a multi-site Ceph Object Gateway
This chapter contains information on how to fix the most common errors related to multi-site Ceph Object Gateways configuration and operational conditions.
When the radosgw-admin bucket sync status
command reports that the bucket is behind on shards even if the data is consistent across multi-site, run additional writes to the bucket. It synchronizes the status reports and displays a message that the bucket is caught up with source.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph Object Gateway.
6.1. Error code definitions for the Ceph Object Gateway
The Ceph Object Gateway logs contain error and warning messages to assist in troubleshooting conditions in your environment. Some common ones are listed below with suggested resolutions.
Common error messages
data_sync: ERROR: a sync operation returned error
- This is the high-level data sync process complaining that a lower-level bucket sync process returned an error. This message is redundant; the bucket sync error appears above it in the log.
data sync: ERROR: failed to sync object: BUCKET_NAME:_OBJECT_NAME_
- Either the process failed to fetch the required object over HTTP from a remote gateway or the process failed to write that object to RADOS and it will be tried again.
data sync: ERROR: failure in sync, backing out (sync_status=2)
-
A low level message reflecting one of the above conditions, specifically that the data was deleted before it could sync and thus showing a
-2 ENOENT
status. data sync: ERROR: failure in sync, backing out (sync_status=-5)
-
A low level message reflecting one of the above conditions, specifically that we failed to write that object to RADOS and thus showing a
-5 EIO
. ERROR: failed to fetch remote data log info: ret=11
-
This is the
EAGAIN
generic error code fromlibcurl
reflecting an error condition from another gateway. It will try again by default. meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
- The shard of the mdlog was never created so there is nothing to sync.
Syncing error messages
failed to sync object
- Either the process failed to fetch this object over HTTP from a remote gateway or it failed to write that object to RADOS and it will be tried again.
failed to sync bucket instance: (11) Resource temporarily unavailable
- A connection issue between primary and secondary zones.
failed to sync bucket instance: (125) Operation canceled
- A racing condition exists between writes to the same RADOS object.
ERROR: request failed: (13) Permission denied If the realm has been changed on the master zone, the master zone’s gateway may need to be restarted to recognize this user
While configuring the secondary site, sometimes a
rgw realm pull --url http://primary_endpoint --access-key <> --secret <>
command fails with a permission denied error.In such cases, run the following commands on the primary site to ensure that the system user credentials are the same:
radosgw-admin user info --uid SYNCHRONIZATION_USER, and radosgw-admin zone get
Additional Resources
- Contact Red Hat Support for any additional assistance.
6.2. Syncing a multi-site Ceph Object Gateway
A multi-site sync reads the change log from other zones. To get a high-level view of the sync progress from the metadata and the data logs, you can use the following command:
Example
[ceph: root@host01 /]# radosgw-admin sync status
This command lists which log shards, if any, which are behind their source zone.
Sometimes you might observe recovering shards when running the radosgw-admin sync status
command. For data sync, there are 128 shards of replication logs that are each processed independently. If any of the actions triggered by these replication log events result in any error from the network, storage, or elsewhere, those errors get tracked so the operation can retry again later. While a given shard has errors that need a retry, radosgw-admin sync status
command reports that shard as recovering
. This recovery happens automatically, so the operator does not need to intervene to resolve them.
If the results of the sync status you have run above reports log shards are behind, run the following command substituting the shard-id for X.
Buckets within a multi-site object can be also be monitored on the Ceph dashboard. For more information, see Monitoring buckets of a multi-site object within the Red Hat Ceph Storage Dashboard Guide.
Syntax
radosgw-admin data sync status --shard-id=X --source-zone=ZONE_NAME
Example
[ceph: root@host01 /]# radosgw-admin data sync status --shard-id=27 --source-zone=us-east { "shard_id": 27, "marker": { "status": "incremental-sync", "marker": "1_1534494893.816775_131867195.1", "next_step_marker": "", "total_entries": 1, "pos": 0, "timestamp": "0.000000" }, "pending_buckets": [], "recovering_buckets": [ "pro-registry:4ed07bb2-a80b-4c69-aa15-fdc17ae6f5f2.314303.1:26" ] }
The output lists which buckets are next to sync and which buckets, if any, are going to be retried due to previous errors.
Inspect the status of individual buckets with the following command, substituting the bucket id for X.
Syntax
radosgw-admin bucket sync status --bucket=X.
Replace X with the ID number of the bucket.
The result shows which bucket index log shards are behind their source zone.
A common error in sync is EBUSY
, which means the sync is already in progress, often on another gateway. Read errors written to the sync error log, which can be read with the following command:
radosgw-admin sync error list
The syncing process will try again until it is successful. Errors can still occur that can require intervention.
6.3. Performance counters for multi-site Ceph Object Gateway data sync
The following performance counters are available for multi-site configurations of the Ceph Object Gateway to measure data sync:
-
poll_latency
measures the latency of requests for remote replication logs. -
fetch_bytes
measures the number of objects and bytes fetched by data sync.
Use the ceph --admin-daemon
command to view the current metric data for the performance counters:
Syntax
ceph --admin-daemon /var/run/ceph/ceph-client.rgw.RGW_ID.asok perf dump data-sync-from-ZONE_NAME
Example
[ceph: root@host01 /]# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.host02-rgw0.103.94309060818504.asok perf dump data-sync-from-us-west { "data-sync-from-us-west": { "fetch bytes": { "avgcount": 54, "sum": 54526039885 }, "fetch not modified": 7, "fetch errors": 0, "poll latency": { "avgcount": 41, "sum": 2.533653367, "avgtime": 0.061796423 }, "poll errors": 0 } }
You must run the ceph --admin-daemon
command from the node running the daemon.
Additional Resources
- See the Ceph performance counters chapter in the Red Hat Ceph Storage Administration Guide for more information about performance counters.
6.4. Synchronizing data in a multi-site Ceph Object Gateway configuration
In a multi-site Ceph Object Gateway configuration of a storage cluster, failover and failback causes data synchronization to stop. The radosgw-admin sync status
command reports that the data sync is behind for an extended period of time.
You can run the radosgw-admin data sync init
command to synchronize data between the sites and then restart the Ceph Object Gateway. This command does not touch any actual object data and initiates data sync for a specified source zone. It causes the zone to restart a full sync from the source zone.
Contact Red Hat support before running the data sync init
command.
If you are going for a full restart of sync, and if there is a lot of data that needs to be synced on the source zone, then the bandwidth consumption is high and then you have to plan accordingly.
If a user accidentally deletes a bucket on the secondary site, you can use the metadata sync init
command on the site to synchronize data.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Ceph Object Gateway configured at two sites at least.
Procedure
Check the sync status between the sites:
Example
[ceph: host04 /]# radosgw-admin sync status realm d713eec8-6ec4-4f71-9eaf-379be18e551b (india) zonegroup ccf9e0b2-df95-4e0a-8933-3b17b64c52b7 (shared) zone 04daab24-5bbd-4c17-9cf5-b1981fd7ff79 (primary) current time 2022-09-15T06:53:52Z zonegroup features enabled: resharding metadata sync no sync (zone is master) data sync source: 596319d2-4ffe-4977-ace1-8dd1790db9fb (secondary) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source
Synchronize data from the secondary zone:
Example
[ceph: root@host04 /]# radosgw-admin data sync init --source-zone primary
Restart all the Ceph Object Gateway daemons at the site:
Example
[ceph: root@host04 /]# ceph orch restart rgw.myrgw
6.5. Troubleshooting radosgw-admin
commands after upgrading a cluster
Troubleshoot using radosgw-admin
commands inside the cephadm shell after upgrading a cluster.
The following is an example of errors that could be emitted after trying to run radosgw-admin
commands inside the cephadm shell after upgrading a cluster.
2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 ERROR: failed to decode obj from .rgw.root:periods.91d2a42c-735b-492a-bcf3-05235ce888aa.3 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 failed reading current period info: (5) Input/output error 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 ERROR: failed to start notify service ((5) Input/output error 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 ERROR: failed to init services (ret=(5) Input/output error) couldn't init storage provider
Example
[ceph: root@host01 /]# date;radosgw-admin bucket list Mon May 13 09:05:30 UTC 2024 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 ERROR: failed to decode obj from .rgw.root:periods.91d2a42c-735b-492a-bcf3-05235ce888aa.3 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 failed reading current period info: (5) Input/output error 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 ERROR: failed to start notify service ((5) Input/output error 2024-05-13T09:05:30.607+0000 7f4e7c4ea500 0 ERROR: failed to init services (ret=(5) Input/output error) couldn't init storage provider
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the nodes.
Procedure
Repair the conenction by running the command again with the
-- radosgw-admin
syntax.Syntax
cephadm shell --radosgw-admin COMMAND
Example
[root@host01 /]# cephadm shell -- radosgw-admin bucket list