Chapter 3. Restarting the cluster gracefully
This document describes the process to restart your cluster after a graceful shutdown.
Even though the cluster is expected to be functional after the restart, the cluster might not recover due to unexpected conditions, for example:
- etcd data corruption during shutdown
- Node failure due to hardware
- Network connectivity issues
If your cluster fails to recover, follow the steps to restore to a previous cluster state.
3.1. Prerequisites
- You have gracefully shut down your cluster.
3.2. Restarting the cluster
You can restart your cluster after it has been shut down gracefully.
Prerequisites
- 
						You have access to the cluster as a user with the cluster-adminrole.
- This procedure assumes that you gracefully shut down the cluster.
Procedure
- Turn on the control plane nodes. - If you are using the - admin.kubeconfigfrom the cluster installation and the API virtual IP address (VIP) is up, complete the following steps:- 
										Set the KUBECONFIGenvironment variable to theadmin.kubeconfigpath.
- For each control plane node in the cluster, run the following command: - oc adm uncordon <node> - $ oc adm uncordon <node>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- 
										Set the 
- If you do not have access to your - admin.kubeconfigcredentials, complete the following steps:- Use SSH to connect to a control plane node.
- 
										Copy the localhost-recovery.kubeconfigfile to the/rootdirectory.
- Use that file to run the following command for each control plane node in the cluster: - oc adm uncordon <node> - $ oc adm uncordon <node>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
 
- Power on any cluster dependencies, such as external storage or an LDAP server.
- Start all cluster machines. - Use the appropriate method for your cloud environment to start the machines, for example, from your cloud provider’s web console. - Wait approximately 10 minutes before continuing to check the status of control plane nodes. 
- Verify that all control plane nodes are ready. - oc get nodes -l node-role.kubernetes.io/master - $ oc get nodes -l node-role.kubernetes.io/master- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The control plane nodes are ready if the status is - Ready, as shown in the following output:- NAME STATUS ROLES AGE VERSION ip-10-0-168-251.ec2.internal Ready master 75m v1.25.0 ip-10-0-170-223.ec2.internal Ready master 75m v1.25.0 ip-10-0-211-16.ec2.internal Ready master 75m v1.25.0 - NAME STATUS ROLES AGE VERSION ip-10-0-168-251.ec2.internal Ready master 75m v1.25.0 ip-10-0-170-223.ec2.internal Ready master 75m v1.25.0 ip-10-0-211-16.ec2.internal Ready master 75m v1.25.0- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If the control plane nodes are not ready, then check whether there are any pending certificate signing requests (CSRs) that must be approved. - Get the list of current CSRs: - oc get csr - $ oc get csr- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Review the details of a CSR to verify that it is valid: - oc describe csr <csr_name> - $ oc describe csr <csr_name>- 1 - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1
- <csr_name>is the name of a CSR from the list of current CSRs.
 
- Approve each valid CSR: - oc adm certificate approve <csr_name> - $ oc adm certificate approve <csr_name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- After the control plane nodes are ready, verify that all worker nodes are ready. - oc get nodes -l node-role.kubernetes.io/worker - $ oc get nodes -l node-role.kubernetes.io/worker- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The worker nodes are ready if the status is - Ready, as shown in the following output:- NAME STATUS ROLES AGE VERSION ip-10-0-179-95.ec2.internal Ready worker 64m v1.25.0 ip-10-0-182-134.ec2.internal Ready worker 64m v1.25.0 ip-10-0-250-100.ec2.internal Ready worker 64m v1.25.0 - NAME STATUS ROLES AGE VERSION ip-10-0-179-95.ec2.internal Ready worker 64m v1.25.0 ip-10-0-182-134.ec2.internal Ready worker 64m v1.25.0 ip-10-0-250-100.ec2.internal Ready worker 64m v1.25.0- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If the worker nodes are not ready, then check whether there are any pending certificate signing requests (CSRs) that must be approved. - Get the list of current CSRs: - oc get csr - $ oc get csr- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Review the details of a CSR to verify that it is valid: - oc describe csr <csr_name> - $ oc describe csr <csr_name>- 1 - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1
- <csr_name>is the name of a CSR from the list of current CSRs.
 
- Approve each valid CSR: - oc adm certificate approve <csr_name> - $ oc adm certificate approve <csr_name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Verify that the cluster started properly. - Check that there are no degraded cluster Operators. - oc get clusteroperators - $ oc get clusteroperators- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Check that there are no cluster Operators with the - DEGRADEDcondition set to- True.- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Check that all nodes are in the - Readystate:- oc get nodes - $ oc get nodes- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Check that the status for all nodes is - Ready.- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the cluster did not start properly, you might need to restore your cluster using an etcd backup. 
 
- After the control plane and worker nodes are ready, mark all the nodes in the cluster as schedulable. Run the following command: - for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do echo ${node} ; oc adm uncordon ${node} ; done- for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do echo ${node} ; oc adm uncordon ${node} ; done- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow