15.26. Troubleshooting Replication-Related Problems
This section lists some error messages, explains possible causes, and offers remedies.
It is possible to get more debugging information for replication by setting the error log level to
8192
, which is replication debugging. See Section 21.3.7, “Configuring the Log Levels”.
To change the error log level to
8192
:
# dsconf -D "cn=Directory Manager" ldap://server.example.com config replace nsslapd-errorlog-level=8192
Because log level is additive, running the above command will result in excessive messages in the error log. So, use it judiciously.
15.26.1. Possible Replication-related Error Messages
The following sections describe many common replication problems.
agmt=%s (%s:%d) Replica has a different generation ID than the local data
- Reason: The consumer specified at the beginning of this message has not been (successfully) initialized yet, or it was initialized from a different root supplier.
- Impact: The local supplier will not replicate any data to the consumer.
- Remedy: Ignore this message if it occurs before the consumer is initialized. Otherwise, reinitialize the consumer if the message is persistent. In a multi-supplier environment, all the servers should be initialized only once from a root supplier, directly or indirectly. For example, M1 initializes M2 and M4, M2 then initializes M3, and so on. The important thing to note is that M2 must not start initializing M3 until M2's own initialization is done (check the total update status from the M1's web console or M1 or M2's error log). Also, M2 should not initialize M1 back.
Warning: data for replica's was reloaded, and it no longer matches the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers, in which case the consumers should be reinitialized.
- Reason: This message may appear only when a supplier is restarted. It indicates that the supplier was unable to write the changelog or did not flush out its RUV at its last shutdown. The former is usually because of a disk-space problem, and the latter because a server crashed or was ungracefully shut down.
- Impact: The server will not be able to send the changes to a consumer if the consumer's
maxcsn
no longer exists in the server's changelog. - Remedy: Check the disk space and the possible core file (under the server's logs directory). If this is a single-supplier replication, reinitialize the consumers. Otherwise, if the server later complains that it cannot locate some CSN for a consumer, see if the consumer can get the CSN from other suppliers. If not, reinitialize the consumer.
agmt=%s(%s:%d): Can't locate CSN %s in the changelog (DB rc=%d). The consumer may need to be reinitialized.
- Reason: Most likely the changelog was recreated because of the disk is full or the server ungracefully shutdown.
- Impact: The local server will not be able to send any more change to that consumer until the consumer is reinitialized or gets the CSN from other suppliers.
- Remedy: If this is a single-supplier replication, reinitialize the consumers. Otherwise, see if the consumer can get the CSN from other suppliers. If not, reinitialize the consumer.
Too much time skew
- Reason: The system clocks on the host machines are extremely out of sync.
- Impact: The system clock is used to generate a part of the CSN. In order to reflect the change sequence among multiple suppliers, suppliers would forward-adjust their local clocks based on the remote clocks of the other suppliers. Because the adjustment is limited to a certain amount, any difference that exceeds the permitted limit will cause the replication session to be aborted.
- Remedy: Synchronize the system clocks on the Directory Server host machines. If applicable, run the network time protocol (
ntp
) daemon on those hosts.
agmt=%s(%s:%d): Warning: Unable to send endReplication extended operation (%s)
- Reason: The consumer is not responding.
- Impact: If the consumer recovers without being restarted, there is a chance that the replica on the consumer will be locked forever if it did not receive the release lock message from the supplier.
- Remedy: Watch if the consumer can receive any new change from any of its suppliers, or start the replication monitor, and see if all the suppliers of this consumer warn that the replica is busy. If the replica appears to be locked forever and no supplier can get in, restart the consumer.
Changelog is getting too big.
- Reason: Either changelog purge is turned off, which is the default setting, or changelog purge is turned on, but some consumers are way behind the supplier.
- Remedy: By default, changelog purge is turned off. To turn it on from the command line, run
ldapmodify
as follows:ldapmodify -D "cn=Directory Manager" -W -p 389 -h server.example.com -x dn: cn=changelog5,cn=config changetype: modify add: nsslapd-changelogmaxage nsslapd-changelogmaxage: 1d
1d
means 1 day. Other valid time units are s for seconds, m for minutes, h for hours, and w for weeks. A value of 0 turns off the purge.With changelog purge turned on, a purge thread that wakes up every five minutes will remove a change if its age is greater than the value of nsslapd-changelogmaxage and if it has been replayed to all the direct consumers of this supplier (supplier or hub).If it appears that the changelog is not purged when the purge threshold is reached, check the maximum time lag from the replication monitor among all the consumers. Irrespective of what the purge threshold is, no change will be purged before it is replayed by all the consumers.
The Replication Monitor is not responding.
- Reason: The LDAPS port is specified in some replication agreement, but the certificate database is not specified or not accessible by the Replication Monitor. If there is no LDAPS port problem, one of the servers in the replication topology might hang.
- Remedy: Map the TLS port to a non-TLS port in the configuration file of the Replication Monitor. For example, if 636 is the TLS port and 389 is the non-TLS port, add the following line in the
[connection]
section:*:636=389:*:password
In the Replication Monitor, some consumers show just the header of the table.
- Reason: No change has originated from the corresponding suppliers. In this case, the
MaxCSN
: in the header part should be"None"
. - Remedy: There is nothing wrong if there is no change originated from a supplier.