Chapter 2. Replacing a failed master host
This document describes the process to replace a single etcd member. This procedure assumes that there is still an etcd quorum in the cluster.
If you have lost the majority of your master hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure to recover from lost master hosts instead of this procedure.
If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to recover from expired control plane certificates instead of this procedure.
To replace a single master host:
- Remove the member from the etcd cluster.
- If the etcd certificates for the master host are valid, then add the member back to the etcd cluster.
- If there are no etcd certificates for the master host or they are no longer valid, then generate etcd certificates and add the member to the etcd cluster.
2.1. Removing a failed master host from the etcd cluster
Follow these steps to remove a failed master host from the etcd cluster.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - You have SSH access to an active master host.
Procedure
View the list of Pods associated with etcd.
In a terminal that has access to the cluster, run the following command:
$ oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-128-73.us-east-2.compute.internal 2/2 Running 0 15h etcd-member-ip-10-0-147-172.us-east-2.compute.internal 2/2 Running 7 122m etcd-member-ip-10-0-171-108.us-east-2.compute.internal 2/2 Running 0 15h
- Access an active master host.
Run the
etcd-member-remove.sh
script and pass in the name of the etcd member to remove:[core@ip-10-0-128-73 ~]$ sudo -E /usr/local/bin/etcd-member-remove.sh etcd-member-ip-10-0-147-172.us-east-2.compute.internal Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 etcd client certs already backed up and available ./assets/backup/ Member 23e4736df4451b32 removed from cluster 6e25bab1bb556673 etcd member etcd-member-ip-10-0-147-172.us-east-2.compute.internal with 23e4736df4451b32 successfully removed..
Verify that the etcd member has been successfully removed from the cluster:
Connect to the running etcd container:
[core@ip-10-0-128-73 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
In the etcd container, export the variables needed for connecting to etcd:
sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
In the etcd container, execute
etcdctl member list
and verify that the removed member is no longer listed:sh-4.3# etcdctl member list -w table +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+ | 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 | | cbe982c74cbb42f | started | etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | https://10.0.171.108:2379 | +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
2.2. Adding the member back to the cluster
After you have removed the member from the etcd cluster, use one of the following procedures to add the member to the cluster:
- If the etcd certificates for the master host are valid, then add the member back to the etcd cluster.
- If there are no etcd certificates for the master host or they are no longer valid, then generate etcd certificates and add the member to the etcd cluster.
2.2.1. Adding a master host back to the etcd cluster
Follow these steps to add a master host back to the etcd cluster. This procedure assumes that you previously removed the master host from the cluster and that its etcd dependencies, such as TLS certificates and DNS, are valid.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - You have SSH access to the master host to add to the etcd cluster.
- You have the IP address of an existing active etcd member.
Procedure
Access the master host to add to the etcd cluster.
ImportantYou must run this procedure on the master host that is being added to the etcd cluster.
Run the
etcd-member-add.sh
script and pass in two parameters:- the IP address of an existing etcd member
- the name of the etcd member to add
[core@ip-10-0-147-172 ~]$ sudo -E /usr/local/bin/etcd-member-add.sh \ 10.0.128.73 \ 1 etcd-member-ip-10-0-147-172.us-east-2.compute.internal 2 Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 etcd-member.yaml found in ./assets/backup/ etcd.conf backup upready exists ./assets/backup/etcd.conf Stopping etcd.. Waiting for etcd-member to stop etcd data-dir backup found ./assets/backup/etcd.. Updating etcd membership.. Removing etcd data_dir /var/lib/etcd.. ETCD_NAME="etcd-member-ip-10-0-147-172.us-east-2.compute.internal" ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-147-172.us-east-2.compute.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-171-108.us-east-2.compute.internal=https://etcd-2.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-128-73.us-east-2.compute.internal=https://etcd-0.clustername.devcluster.openshift.com:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380" ETCD_INITIAL_CLUSTER_STATE="existing"' Member 1e42c7070decd39 added to cluster 6e25bab1bb556673 Starting etcd..
Verify that the etcd member has been successfully added to the etcd cluster:
Connect to the running etcd container:
[core@ip-10-0-147-172 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
In the etcd container, export the variables needed for connecting to etcd:
sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
In the etcd container, execute
etcdctl member list
and verify that the new member is listed:sh-4.3# etcdctl member list -w table +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+ | 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 | | cbe982c74cbb42f | started | etcd-member-ip-10-0-147-172.us-east-2.compute.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.147.172:2379 | | a752f80bcb0da3e8 | started | etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | https://10.0.171.108:2379 | +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
It may take up to 10 minutes for the new member to start.
In the etcd container, execute
etcdctl endpoint health
and verify that the new member is healthy:sh-4.3# etcdctl endpoint health --cluster https://10.0.128.73:2379 is healthy: successfully committed proposal: took = 4.5576ms https://10.0.147.172:2379 is healthy: successfully committed proposal: took = 5.1521ms https://10.0.171.108:2379 is healthy: successfully committed proposal: took = 4.2631ms
Verify that the new member is in the list of Pods associated with etcd and that its status is
Running
.In a terminal that has access to the cluster, run the following command:
$ oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-128-73.us-east-2.compute.internal 2/2 Running 0 15h etcd-member-ip-10-0-147-172.us-east-2.compute.internal 2/2 Running 7 122m etcd-member-ip-10-0-171-108.us-east-2.compute.internal 2/2 Running 0 15h
2.2.2. Generating etcd certificates and adding the member to the cluster
If the node is new or the etcd certificates on the node are no longer valid, you must generate the etcd certificates before you can add the member to the etcd cluster.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - You have SSH access to the new master host to add to the etcd cluster.
- You have SSH access to the one of the healthy master hosts.
- You have the IP address of one of the healthy master hosts.
Procedure
Set up a temporary etcd certificate signer service on one of the healthy master nodes.
Access one of the healthy master nodes and log in to your cluster as a
cluster-admin
user using the following command.[core@ip-10-0-143-125 ~]$ sudo oc login https://localhost:6443 Authentication required for https://localhost:6443 (openshift) Username: kubeadmin Password: Login successful.
Obtain the pull specification for the
kube-etcd-signer-server
image.[core@ip-10-0-143-125 ~]$ export KUBE_ETCD_SIGNER_SERVER=$(sudo oc adm release info --image-for kube-etcd-signer-server --registry-config=/var/lib/kubelet/config.json)
Run the
tokenize-signer.sh
script.Be sure to pass in the
-E
flag tosudo
so that environment variables are properly passed to the script.[core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/tokenize-signer.sh ip-10-0-143-125 1 Populating template /usr/local/share/openshift-recovery/template/kube-etcd-cert-signer.yaml.template Populating template ./assets/tmp/kube-etcd-cert-signer.yaml.stage1 Tokenized template now ready: ./assets/manifests/kube-etcd-cert-signer.yaml
- 1
- The host name of the healthy master, where the signer should be deployed.
Create the signer Pod using the file that was generated.
[core@ip-10-0-143-125 ~]$ sudo oc create -f assets/manifests/kube-etcd-cert-signer.yaml pod/etcd-signer created
Verify that the signer is listening on this master node.
[core@ip-10-0-143-125 ~]$ ss -ltn | grep 9943 LISTEN 0 128 *:9943 *:*
Add the new master host to the etcd cluster.
Access the new master host to be added to the cluster, and log in to your cluster as a
cluster-admin
user using the following command.[core@ip-10-0-156-255 ~]$ sudo oc login https://localhost:6443 Authentication required for https://localhost:6443 (openshift) Username: kubeadmin Password: Login successful.
Export two environment variables that are required by the
etcd-member-recover.sh
script.[core@ip-10-0-156-255 ~]$ export SETUP_ETCD_ENVIRONMENT=$(sudo oc adm release info --image-for machine-config-operator --registry-config=/var/lib/kubelet/config.json)
[core@ip-10-0-156-255 ~]$ export KUBE_CLIENT_AGENT=$(sudo oc adm release info --image-for kube-client-agent --registry-config=/var/lib/kubelet/config.json)
Run the
etcd-member-recover.sh
script.Be sure to pass in the
-E
flag tosudo
so that environment variables are properly passed to the script.[core@ip-10-0-156-255 ~]$ sudo -E /usr/local/bin/etcd-member-recover.sh 10.0.143.125 etcd-member-ip-10-0-156-255.ec2.internal 1 Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 etcd-member.yaml found in ./assets/backup/ etcd.conf backup upready exists ./assets/backup/etcd.conf Trying to backup etcd client certs.. etcd client certs already backed up and available ./assets/backup/ Stopping etcd.. Waiting for etcd-member to stop etcd data-dir backup found ./assets/backup/etcd.. etcd TLS certificate backups found in ./assets/backup.. Removing etcd certs.. Populating template /usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template Populating template ./assets/tmp/etcd-generate-certs.stage1 Populating template ./assets/tmp/etcd-generate-certs.stage2 Starting etcd client cert recovery agent.. Waiting for certs to generate.. Waiting for certs to generate.. Waiting for certs to generate.. Waiting for certs to generate.. Stopping cert recover.. Waiting for generate-certs to stop Patching etcd-member manifest.. Updating etcd membership.. Member 249a4b9a790b3719 added to cluster 807ae3bffc8d69ca ETCD_NAME="etcd-member-ip-10-0-156-255.ec2.internal" ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-143-125.ec2.internal=https://etcd-0.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-156-255.ec2.internal=https://etcd-1.clustername.devcluster.openshift.com:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380" ETCD_INITIAL_CLUSTER_STATE="existing" Starting etcd..
- 1
- Specify both the IP address of the healthy master where the signer server is running, and the etcd name of the new member.
Verify that the new master host has been added to the etcd member list.
Access the healthy master and connect to the running etcd container.
[core@ip-10-0-143-125 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
In the etcd container, export variables needed for connecting to etcd.
sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
In the etcd container, execute
etcdctl member list
and verify that the new member is listed.sh-4.3# etcdctl member list -w table +------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+ | cbe982c74cbb42f | started | etcd-member-ip-10-0-156-255.ec2.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.156.255:2379 | | 249a4b9a790b3719 | started | etcd-member-ip-10-0-143-125.ec2.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | https://10.0.143.125:2379 | +------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+
It may take up to 20 minutes for the new member to start.
After the new member is added, remove the signer Pod because it is no longer needed.
In a terminal that has access to the cluster, run the following command:
$ oc delete pod -n openshift-config etcd-signer