Chapter 2. Shutting down the cluster gracefully


This document describes the process to gracefully shut down your cluster. You might need to temporarily shut down your cluster for maintenance reasons, or to save on resource costs.

2.1. Prerequisites

  • Take an etcd backup prior to shutting down the cluster.

    Important

    It is important to take an etcd backup before performing this procedure so that your cluster can be restored if you encounter any issues when restarting the cluster.

    For example, the following conditions can cause the restarted cluster to malfunction:

    • etcd data corruption during shutdown
    • Node failure due to hardware
    • Network connectivity issues

    If your cluster fails to recover, follow the steps to restore to a previous cluster state.

2.2. Shutting down the cluster

You can shut down your cluster in a graceful manner so that it can be restarted at a later date.

Note

You can shut down a cluster until a year from the installation date and expect it to restart gracefully. After a year from the installation date, the cluster certificates expire.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have taken an etcd backup.

Procedure

  1. If you are shutting the cluster down for an extended period, determine the date on which certificates expire and run the following command:

    $ oc -n openshift-kube-apiserver-operator get secret kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'

    Example output

    2022-08-05T14:37:50Zuser@user:~ $ 1

    1
    To ensure that the cluster can restart gracefully, plan to restart it on or before the specified date. As the cluster restarts, the process might require you to manually approve the pending certificate signing requests (CSRs) to recover kubelet certificates.
  2. Mark all the nodes in the cluster as unschedulable. You can do this from your cloud provider’s web console, or by running the following loop:

    $ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do echo ${node} ; oc adm cordon ${node} ; done

    Example output

    ci-ln-mgdnf4b-72292-n547t-master-0
    node/ci-ln-mgdnf4b-72292-n547t-master-0 cordoned
    ci-ln-mgdnf4b-72292-n547t-master-1
    node/ci-ln-mgdnf4b-72292-n547t-master-1 cordoned
    ci-ln-mgdnf4b-72292-n547t-master-2
    node/ci-ln-mgdnf4b-72292-n547t-master-2 cordoned
    ci-ln-mgdnf4b-72292-n547t-worker-a-s7ntl
    node/ci-ln-mgdnf4b-72292-n547t-worker-a-s7ntl cordoned
    ci-ln-mgdnf4b-72292-n547t-worker-b-cmc9k
    node/ci-ln-mgdnf4b-72292-n547t-worker-b-cmc9k cordoned
    ci-ln-mgdnf4b-72292-n547t-worker-c-vcmtn
    node/ci-ln-mgdnf4b-72292-n547t-worker-c-vcmtn cordoned

  3. Evacuate the pods using the following method:

    $ for node in $(oc get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}'); do echo ${node} ; oc adm drain ${node} --delete-emptydir-data --ignore-daemonsets=true --timeout=15s --force ; done
  4. Shut down all of the nodes in the cluster. You can do this from the web console for your cloud provider web console, or by running the following loop. Shutting down the nodes by using one of these methods allows pods to terminate gracefully, which reduces the chance for data corruption.

    Note

    Ensure that the control plane node with the API VIP assigned is the last node processed in the loop. Otherwise, the shutdown command fails.

    $ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do oc debug node/${node} -- chroot /host shutdown -h 1; done 1
    1
    -h 1 indicates how long, in minutes, this process lasts before the control plane nodes are shut down. For large-scale clusters with 10 nodes or more, set to -h 10 or longer to make sure all the compute nodes have time to shut down first.

    Example output

    Starting pod/ip-10-0-130-169us-east-2computeinternal-debug ...
    To use host binaries, run `chroot /host`
    Shutdown scheduled for Mon 2021-09-13 09:36:17 UTC, use 'shutdown -c' to cancel.
    Removing debug pod ...
    Starting pod/ip-10-0-150-116us-east-2computeinternal-debug ...
    To use host binaries, run `chroot /host`
    Shutdown scheduled for Mon 2021-09-13 09:36:29 UTC, use 'shutdown -c' to cancel.

    Note

    It is not necessary to drain control plane nodes of the standard pods that ship with OpenShift Container Platform prior to shutdown. Cluster administrators are responsible for ensuring a clean restart of their own workloads after the cluster is restarted. If you drained control plane nodes prior to shutdown because of custom workloads, you must mark the control plane nodes as schedulable before the cluster will be functional again after restart.

  5. Shut off any cluster dependencies that are no longer needed, such as external storage or an LDAP server. Be sure to consult your vendor’s documentation before doing so.

    Important

    If you deployed your cluster on a cloud-provider platform, do not shut down, suspend, or delete the associated cloud resources. If you delete the cloud resources of a suspended virtual machine, OpenShift Container Platform might not restore successfully.

2.3. Additional resources

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.