Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 38. Restoring etcd quorum
If you lose etcd quorum, you can restore it.
- If you run etcd on a separate host, you must back up etcd, take down your etcd cluster, and form a new one. You can use one healthy etcd node to form a new cluster, but you must remove all other healthy nodes.
- If you run etcd as static pods on your master nodes, you stop the etcd pods, create a temporary cluster, and then restart the etcd pods.
During etcd quorum loss, applications that run on OpenShift Container Platform are unaffected. However, the platform functionality is limited to read-only operations. You cannot take action such as scaling an application up or down, changing deployments, or running or modifying builds.
To confirm the loss of etcd quorum, run one of the following commands and confirm that the cluster is unhealthy:
If you use the etcd v2 API, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you use the v3 API, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Note the member IDs and host names of the hosts. You use one of the nodes that can be reached to form a new cluster.
38.1. Restoring etcd quorum for separate services Link kopierenLink in die Zwischenablage kopiert!
38.1.1. Backing up etcd Link kopierenLink in die Zwischenablage kopiert!
When you back up etcd, you must back up both the etcd configuration files and the etcd data.
38.1.1.1. Backing up etcd configuration files Link kopierenLink in die Zwischenablage kopiert!
The etcd configuration files to be preserved are all stored in the /etc/etcd
directory of the instances where etcd is running. This includes the etcd configuration file (/etc/etcd/etcd.conf
) and the required certificates for cluster communication. All those files are generated at installation time by the Ansible installer.
Procedure
For each etcd member of the cluster, back up the etcd configuration.
ssh master-0 mkdir -p /backup/etcd-config-$(date +%Y%m%d)/ cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/
$ ssh master-0
# mkdir -p /backup/etcd-config-$(date +%Y%m%d)/
# cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/
- 1
- Replace
master-0
with the name of your etcd member.
The certificates and configuration files on each etcd cluster member are unique.
38.1.1.2. Backing up etcd data Link kopierenLink in die Zwischenablage kopiert!
Prerequisites
The OpenShift Container Platform installer creates aliases to avoid typing all the flags named etcdctl2
for etcd v2 tasks and etcdctl3
for etcd v3 tasks.
However, the etcdctl3
alias does not provide the full endpoint list to the etcdctl
command, so you must specify the --endpoints
option and list all the endpoints.
Before backing up etcd:
-
etcdctl
binaries must be available or, in containerized installations, therhel7/etcd
container must be available. - Ensure that the OpenShift Container Platform API service is running.
- Ensure connectivity with the etcd cluster (port 2379/tcp).
- Ensure the proper certificates to connect to the etcd cluster.
Procedure
While the etcdctl backup
command is used to perform the backup, etcd v3 has no concept of a backup. Instead, you either take a snapshot from a live member with the etcdctl snapshot save
command or copy the member/snap/db
file from an etcd data directory.
The etcdctl backup
command rewrites some of the metadata contained in the backup, specifically, the node ID and cluster ID, which means that in the backup, the node loses its former identity. To recreate a cluster from the backup, you create a new, single-node cluster, then add the rest of the nodes to the cluster. The metadata is rewritten to prevent the new node from joining an existing cluster.
Back up the etcd data:
Clusters upgraded from previous versions of OpenShift Container Platform might contain v2 data stores. Back up all etcd data stores.
Obtain the etcd endpoint IP address from the static pod manifest:
export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"
$ export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow export ETCD_EP=$(grep https ${ETCD_POD_MANIFEST} | cut -d '/' -f3)
$ export ETCD_EP=$(grep https ${ETCD_POD_MANIFEST} | cut -d '/' -f3)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Log in as an administrator:
oc login -u system:admin
$ oc login -u system:admin
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Obtain the etcd pod name:
export ETCD_POD=$(oc get pods -n kube-system | grep -o -m 1 '^master-etcd\S*')
$ export ETCD_POD=$(oc get pods -n kube-system | grep -o -m 1 '^master-etcd\S*')
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
kube-system
project:oc project kube-system
$ oc project kube-system
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Take a snapshot of the etcd data in the pod and store it locally:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- You must write the snapshot to a directory under
/var/lib/etcd/
.
38.1.2. Removing an etcd host Link kopierenLink in die Zwischenablage kopiert!
If an etcd host fails beyond restoration, remove it from the cluster. To recover from an etcd quorum loss, you must also remove all healthy etcd nodes but one from your cluster.
Steps to be performed on all masters hosts
Procedure
Remove each other etcd host from the etcd cluster. Run the following command for each etcd node:
etcdctl3 --endpoints=https://<surviving host IP>:2379
# etcdctl3 --endpoints=https://<surviving host IP>:2379 --cacert=/etc/etcd/ca.crt --cert=/etc/etcd/peer.crt --key=/etc/etcd/peer.key member remove <failed member ID>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the other etcd hosts from the
/etc/origin/master/master-config.yaml
+master configuration file on every master:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the master API service on every master:
master-restart api restart-master controller
# master-restart api restart-master controller
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Steps to be performed in the current etcd cluster
Procedure
Remove the failed host from the cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
remove
command requires the etcd ID, not the hostname.
To ensure the etcd configuration does not use the failed host when the etcd service is restarted, modify the
/etc/etcd/etcd.conf
file on all remaining etcd hosts and remove the failed host in the value for theETCD_INITIAL_CLUSTER
variable:vi /etc/etcd/etcd.conf
# vi /etc/etcd/etcd.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380
Copy to Clipboard Copied! Toggle word wrap Toggle overflow becomes:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteRestarting the etcd services is not required, because the failed host is removed using
etcdctl
.Modify the Ansible inventory file to reflect the current status of the cluster and to avoid issues when re-running a playbook:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you are using Flannel, modify the
flanneld
service configuration located at/etc/sysconfig/flanneld
on every host and remove the etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379
FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the
flanneld
service:systemctl restart flanneld.service
# systemctl restart flanneld.service
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
38.1.3. Creating a single-node etcd cluster Link kopierenLink in die Zwischenablage kopiert!
To restore the full functionality of your OpenShift Container Platform instance, make a remaining etcd node a standalone etcd cluster.
Procedure
On the etcd node that you did not remove from the cluster, stop all etcd services by removing the etcd pod definition:
mkdir -p /etc/origin/node/pods-stopped mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped/ systemctl stop atomic-openshift-node mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
# mkdir -p /etc/origin/node/pods-stopped # mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped/ # systemctl stop atomic-openshift-node # mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the etcd service on the host, forcing a new cluster.
These commands create a custom file for the etcd service, which adds the
--force-new-cluster
option to the etcd start command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow List the etcd member and confirm that the member list contains only your single etcd host:
etcdctl member list
# etcdctl member list 165201190bf7f217: name=192.168.34.20 peerURLs=http://localhost:2380 clientURLs=https://master-0.example.com:2379 isLeader=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After restoring the data and creating a new cluster, you must update the
peerURLs
parameter value to use the IP address where etcd listens for peer communication:etcdctl member update 165201190bf7f217 https://192.168.34.20:2380
# etcdctl member update 165201190bf7f217 https://192.168.34.20:2380
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
165201190bf7f217
is the member ID shown in the output of the previous command, andhttps://192.168.34.20:2380
is its IP address.
To verify, check that the IP is in the member list:
etcdctl2 member list
$ etcdctl2 member list 5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
38.1.4. Adding etcd nodes after restoring Link kopierenLink in die Zwischenablage kopiert!
After the first instance is running, you can add multiple etcd servers to your cluster.
Procedure
Get the etcd name for the instance in the
ETCD_NAME
variable:grep ETCD_NAME /etc/etcd/etcd.conf
# grep ETCD_NAME /etc/etcd/etcd.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the IP address where etcd listens for peer communication:
grep ETCD_INITIAL_ADVERTISE_PEER_URLS /etc/etcd/etcd.conf
# grep ETCD_INITIAL_ADVERTISE_PEER_URLS /etc/etcd/etcd.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the node was previously part of a etcd cluster, delete the previous etcd data:
rm -Rf /var/lib/etcd/*
# rm -Rf /var/lib/etcd/*
Copy to Clipboard Copied! Toggle word wrap Toggle overflow On the etcd host where etcd is properly running, add the new member:
etcdctl3 member add *<name>* \ --peer-urls="*<advertise_peer_urls>*"
# etcdctl3 member add *<name>* \ --peer-urls="*<advertise_peer_urls>*"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The command outputs some variables. For example:
ETCD_NAME="master2" ETCD_INITIAL_CLUSTER="master-0.example.com=https://192.168.55.8:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_NAME="master2" ETCD_INITIAL_CLUSTER="master-0.example.com=https://192.168.55.8:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the values from the previous command to the
/etc/etcd/etcd.conf
file of the new host:vi /etc/etcd/etcd.conf
# vi /etc/etcd/etcd.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the etcd service in the node joining the cluster:
systemctl start etcd.service
# systemctl start etcd.service
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for error messages:
master-logs etcd etcd
# master-logs etcd etcd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once you add all the nodes, verify the cluster status and cluster health:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Add the remaining peers back into the cluster.
38.2. Restoring etcd quorum for static pods Link kopierenLink in die Zwischenablage kopiert!
If you lose etcd quorum on a cluster that uses static pods for etcd, take the following steps:
Procedure
Stop the etcd pod:
mv /etc/origin/node/pods/etcd.yaml .
mv /etc/origin/node/pods/etcd.yaml .
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Temporarily force a new cluster on the etcd host:
cp /etc/etcd/etcd.conf etcd.conf.bak echo "ETCD_FORCE_NEW_CLUSTER=true" >> /etc/etcd/etcd.conf
$ cp /etc/etcd/etcd.conf etcd.conf.bak $ echo "ETCD_FORCE_NEW_CLUSTER=true" >> /etc/etcd/etcd.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the etcd pod:
mv etcd.yaml /etc/origin/node/pods/.
$ mv etcd.yaml /etc/origin/node/pods/.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Stop the etcd pod and remove the
FORCE_NEW_CLUSTER
command:mv /etc/origin/node/pods/etcd.yaml . rm /etc/etcd/etcd.conf mv etcd.conf.bak /etc/etcd/etcd.conf
$ mv /etc/origin/node/pods/etcd.yaml . $ rm /etc/etcd/etcd.conf $ mv etcd.conf.bak /etc/etcd/etcd.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the etcd pod:
mv etcd.yaml /etc/origin/node/pods/.
$ mv etcd.yaml /etc/origin/node/pods/.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow