9.3.3. Total Cluster Failure
standalone
: not part of a HA clusterjoining
: newly started backup, not yet joined to the cluster.catch-up
: backup has connected to the primary and is downloading queues, messages etc.ready
: backup is connected and actively replicating from primary, it is ready to take over.recovering
: newly-promoted to primary, waiting for backups to catch up before serving clients. Only a single primary broker can be recovering at a time.active
: serving clients, only a single primary broker can be active at a time.
All brokers are in joining or catch-up mode. rgmanager
tries to promote a new primary but cannot find any candidates and so gives up. clustat will show that the qpidd
services are running but the the qpidd-primary
service has stopped, something like this:
Service Name | Owner (Last) | State |
---|---|---|
service:mrg33-qpidd-service
|
20.0.10.33
|
started
|
service:mrg34-qpidd-service
|
20.0.10.34
|
started
|
service:mrg35-qpidd-service
|
20.0.10.35
|
started
|
service:qpidd-primary-service
|
(20.0.10.33)
|
stopped
|
qpid-ha status --all
.
- In luci:<your-cluster>:Nodes click reboot to restart the entire cluster.
- or stop and restart the cluster with
ccs --stopall
;ccs --startall
- In luci:<your-cluster>:Service Groups:
- select all the qpidd (not primary) services, click restart.
- select the qpidd-primary service, click restart.
- or stop the primary and qpidd services with clusvcadm, then restart (primary last)
A new primary is promoted and the cluster is functional. All non-persistent data from before the failure is lost.