Este contenido no está disponible en el idioma seleccionado.

Chapter 17. Day 2 operations for telco core CNF clusters


17.1. Upgrading telco core CNF clusters

17.1.1. Upgrading a telco core CNF cluster

OpenShift Container Platform has long term support or extended update support (EUS) on all even releases and update paths between EUS releases. You can update from one EUS version to the next EUS version. It is also possible to update between y-stream and z-stream versions.

17.1.1.1. Cluster updates for telco core CNF clusters

Updating your cluster is a critical task that ensures that bugs and potential security vulnerabilities are patched. Often, updates to cloud-native network functions (CNF) require additional functionality from the platform that comes when you update the cluster version. You also must update the cluster periodically to ensure that the cluster platform version is supported.

You can minimize the effort required to stay current with updates by keeping up-to-date with EUS releases and upgrading to select important z-stream releases only.

Note

The update path for the cluster can vary depending on the size and topology of the cluster. The update procedures described here are valid for most clusters from 3-node clusters up to the largest size clusters certified by the telco scale team. This includes some scenarios for mixed-workload clusters.

The following update scenarios are described:

  • Control Plane Only updates
  • Y-stream updates
  • Z-stream updates
Important

Control Plane Only updates were previously known as EUS-to-EUS updates. Control Plane Only updates are only viable between even-numbered minor versions of OpenShift Container Platform.

17.1.2. Verifying cluster API versions between update versions

APIs change over time as components are updated. It is important to verify that cloud-native network function (CNF) APIs are compatible with the updated cluster version.

17.1.2.1. OpenShift Container Platform API compatibility

When considering what z-stream release to update to as part of a new y-stream update, you must be sure that all the patches that are in the z-stream version you are moving from are in the new z-stream version. If the version you update to does not have all the required patches, the built-in compatibility of Kubernetes is broken.

For example, if the cluster version is 4.15.32, you must update to 4.16 z-stream release that has all of the patches that are applied to 4.15.32.

17.1.2.1.1. About Kubernetes version skew

Each cluster Operator supports specific API versions. Kubernetes APIs evolve over time, and newer versions can be deprecated or change existing APIs. This is referred to as "version skew". For every new release, you must review the API changes. The APIs might be compatible across several releases of an Operator, but compatibility is not guaranteed. To mitigate against problems that arise from version skew, follow a well-defined update strategy.

17.1.2.2. Determining the cluster version update path

Use the Red Hat OpenShift Container Platform Update Graph tool to determine if the path is valid for the z-stream release you want to update to. Verify the update with your Red Hat Technical Account Manager to ensure that the update path is valid for telco implementations.

Important

The <4.y+1.z> or <4.y+2.z> version that you update to must have the same patch level as the <4.y.z> release you are updating from.

The OpenShift update process mandates that if a fix is present in a specific <4.y.z> release, then the that fix must be present in the <4.y+1.z> release that you update to.

Figure 17.1. Bug fix backporting and the update graph

Bug fix backporting and the update graph
Important

OpenShift development has a strict backport policy that prevents regressions. For example, a bug must be fixed in 4.16.z before it is fixed in 4.15.z. This means that the update graph does not allow for updates to chronologically older releases even if the minor version is greater, for example, updating from 4.15.24 to 4.16.2.

17.1.2.3. Selecting the target release

Use the Red Hat OpenShift Container Platform Update Graph or the cincinnati-graph-data repository to determine what release to update to.

17.1.2.3.1. Determining what z-stream updates are available

Before you can update to a new z-stream release, you need to know what versions are available.

Note

You do not need to change the channel when performing a z-stream update.

Procedure

  1. Determine which z-stream releases are available. Run the following command:

    $ oc adm upgrade

    Example output

    Cluster version is 4.14.34
    
    Upstream is unset, so the cluster will use an appropriate default.
    Channel: stable-4.14 (available channels: candidate-4.14, candidate-4.15, eus-4.14, eus-4.16, fast-4.14, fast-4.15, stable-4.14, stable-4.15)
    
    Recommended updates:
    
      VERSION     IMAGE
      4.14.37     quay.io/openshift-release-dev/ocp-release@sha256:14e6ba3975e6c73b659fa55af25084b20ab38a543772ca70e184b903db73092b
      4.14.36     quay.io/openshift-release-dev/ocp-release@sha256:4bc4925e8028158e3f313aa83e59e181c94d88b4aa82a3b00202d6f354e8dfed
      4.14.35     quay.io/openshift-release-dev/ocp-release@sha256:883088e3e6efa7443b0ac28cd7682c2fdbda889b576edad626769bf956ac0858

17.1.2.3.2. Changing the channel for a Control Plane Only update

You must change the channel to the required version for a Control Plane Only update.

Note

You do not need to change the channel when performing a z-stream update.

Procedure

  1. Determine the currently configured update channel:

    $ oc get clusterversion -o=jsonpath='{.items[*].spec}' | jq

    Example output

    {
      "channel": "stable-4.14",
      "clusterID": "01eb9a57-2bfb-4f50-9d37-dc04bd5bac75"
    }

  2. Change the channel to point to the new channel you want to update to:

    $ oc adm upgrade channel eus-4.16
  3. Confirm the updated channel:

    $ oc get clusterversion -o=jsonpath='{.items[*].spec}' | jq

    Example output

    {
      "channel": "eus-4.16",
      "clusterID": "01eb9a57-2bfb-4f50-9d37-dc04bd5bac75"
    }

17.1.2.3.2.1. Changing the channel for an early EUS to EUS update

The update path to a brand new release of OpenShift Container Platform is not available in either the EUS channel or the stable channel until 45 to 90 days after the initial GA of a minor release.

To begin testing an update to a new release, you can use the fast channel.

Procedure

  1. Change the channel to fast-<y+1>. For example, run the following command:

    $ oc adm upgrade channel fast-4.16
  2. Check the update path from the new channel. Run the following command:

    $ oc adm upgrade
    Cluster version is 4.15.33
    
    Upgradeable=False
    
      Reason: AdminAckRequired
      Message: Kubernetes 1.28 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions.
    
    Upstream is unset, so the cluster will use an appropriate default.
    Channel: fast-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.15, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16)
    
    Recommended updates:
    
      VERSION     IMAGE
      4.16.14     quay.io/openshift-release-dev/ocp-release@sha256:6618dd3c0f5
      4.16.13     quay.io/openshift-release-dev/ocp-release@sha256:7a72abc3
      4.16.12     quay.io/openshift-release-dev/ocp-release@sha256:1c8359fc2
      4.16.11     quay.io/openshift-release-dev/ocp-release@sha256:bc9006febfe
      4.16.10     quay.io/openshift-release-dev/ocp-release@sha256:dece7b61b1
      4.15.36     quay.io/openshift-release-dev/ocp-release@sha256:c31a56d19
      4.15.35     quay.io/openshift-release-dev/ocp-release@sha256:f21253
      4.15.34     quay.io/openshift-release-dev/ocp-release@sha256:2dd69c5
  3. Follow the update procedure to get to version 4.16 (<y+1> from version 4.15)

    Note

    You can keep your worker nodes paused between EUS releases even if you are using the fast channel.

  4. When you get to the required <y+1> release, change the channel again, this time to fast-<y+2>.
  5. Follow the EUS update procedure to get to the required <y+2> release.
17.1.2.3.3. Changing the channel for a y-stream update

In a y-stream update you change the channel to the next release channel.

Note

Use the stable or EUS release channels for production clusters.

Procedure

  1. Change the update channel:

    $ oc adm upgrade channel stable-4.15
  2. Check the update path from the new channel. Run the following command:

    $ oc adm upgrade

    Example output

    Cluster version is 4.14.34
    
    Upgradeable=False
    
      Reason: AdminAckRequired
      Message: Kubernetes 1.27 and therefore OpenShift 4.15 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions.
    
    Upstream is unset, so the cluster will use an appropriate default.
    Channel: stable-4.15 (available channels: candidate-4.14, candidate-4.15, eus-4.14, eus-4.15, fast-4.14, fast-4.15, stable-4.14, stable-4.15)
    
    Recommended updates:
    
      VERSION     IMAGE
      4.15.33     quay.io/openshift-release-dev/ocp-release@sha256:7142dd4b560
      4.15.32     quay.io/openshift-release-dev/ocp-release@sha256:cda8ea5b13dc9
      4.15.31     quay.io/openshift-release-dev/ocp-release@sha256:07cf61e67d3eeee
      4.15.30     quay.io/openshift-release-dev/ocp-release@sha256:6618dd3c0f5
      4.15.29     quay.io/openshift-release-dev/ocp-release@sha256:7a72abc3
      4.15.28     quay.io/openshift-release-dev/ocp-release@sha256:1c8359fc2
      4.15.27     quay.io/openshift-release-dev/ocp-release@sha256:bc9006febfe
      4.15.26     quay.io/openshift-release-dev/ocp-release@sha256:dece7b61b1
      4.14.38     quay.io/openshift-release-dev/ocp-release@sha256:c93914c62d7
      4.14.37     quay.io/openshift-release-dev/ocp-release@sha256:c31a56d19
      4.14.36     quay.io/openshift-release-dev/ocp-release@sha256:f21253
      4.14.35     quay.io/openshift-release-dev/ocp-release@sha256:2dd69c5

17.1.3. Preparing the telco core cluster platform for update

Typically, telco clusters run on bare-metal hardware. Often you must update the firmware to take on important security fixes, take on new functionality, or maintain compatibility with the new release of OpenShift Container Platform.

17.1.3.1. Ensuring the host firmware is compatible with the update

You are responsible for the firmware versions that you run in your clusters. Updating host firmware is not a part of the OpenShift Container Platform update process. It is not recommended to update firmware in conjunction with the OpenShift Container Platform version.

Important

Hardware vendors advise that it is best to apply the latest certified firmware version for the specific hardware that you are running. For telco use cases, always verify firmware updates in test environments before applying them in production. The high throughput nature of telco CNF workloads can be adversely affected by sub-optimal host firmware.

You should thoroughly test new firmware updates to ensure that they work as expected with the current version of OpenShift Container Platform. Ideally, you test the latest firmware version with the target OpenShift Container Platform update version.

17.1.3.2. Ensuring that layered products are compatible with the update

Verify that all layered products run on the version of OpenShift Container Platform that you are updating to before you begin the update. This generally includes all Operators.

Procedure

  1. Verify the currently installed Operators in the cluster. For example, run the following command:

    $ oc get csv -A

    Example output

    NAMESPACE                              NAME            DISPLAY          VERSION   REPLACES                             PHASE
    gitlab-operator-kubernetes.v0.17.2     GitLab                           0.17.2    gitlab-operator-kubernetes.v0.17.1   Succeeded
    openshift-operator-lifecycle-manager   packageserver   Package Server   0.19.0                                         Succeeded

  2. Check that Operators that you install with OLM are compatible with the update version.

    Operators that are installed with the Operator Lifecycle Manager (OLM) are not part of the standard cluster Operators set.

    Use the Operator Update Information Checker to understand if you must update an Operator after each y-stream update or if you can wait until you have fully updated to the next EUS release.

    Tip

    You can also use the Operator Update Information Checker to see what versions of OpenShift Container Platform are compatible with specific releases of an Operator.

  3. Check that Operators that you install outside of OLM are compatible with the update version.

    For all OLM-installed Operators that are not directly supported by Red Hat, contact the Operator vendor to ensure release compatibility.

    • Some Operators are compatible with several releases of OpenShift Container Platform. You might not must update the Operators until after you complete the cluster update. See "Updating the worker nodes" for more information.
    • See "Updating all the OLM Operators" for information about updating an Operator after performing the first y-stream control plane update.

17.1.3.3. Applying MachineConfigPool labels to nodes before the update

Prepare MachineConfigPool (mcp) node labels to group nodes together in groups of roughly 8 to 10 nodes. With mcp groups, you can reboot groups of nodes independently from the rest of the cluster.

You use the mcp node labels to pause and unpause the set of nodes during the update process so that you can do the update and reboot at a time of your choosing.

17.1.3.3.1. Staggering the cluster update

Sometimes there are problems during the update. Often the problem is related to hardware failure or nodes needing to be reset. Using mcp node labels, you can update nodes in stages by pausing the update at critical moments, tracking paused and unpaused nodes as you proceed. When a problem occurs, you use the nodes that are in an unpaused state to ensure that there are enough nodes running to keep all applications pods running.

17.1.3.3.2. Dividing worker nodes into MachineConfigPool groups

How you divide worker nodes into mcp groups can vary depending on how many nodes are in the cluster or how many nodes you assign to a node role. By default the 2 roles in a cluster are control plane and worker.

In clusters that run telco workloads, you can further split the worker nodes between CNF control plane and CNF data plane roles. Add mcp role labels that split the worker nodes into each of these two groups.

Note

Larger clusters can have as many as 100 worker nodes in the CNF control plane role. No matter how many nodes there are in the cluster, keep each MachineConfigPool group to around 10 nodes. This allows you to control how many nodes are taken down at a time. With multiple MachineConfigPool groups, you can unpause several groups at a time to accelerate the update, or separate the update over 2 or more maintenance windows.

Example cluster with 15 worker nodes

Consider a cluster with 15 worker nodes:

  • 10 worker nodes are CNF control plane nodes.
  • 5 worker nodes are CNF data plane nodes.

Split the CNF control plane and data plane worker node roles into at least 2 mcp groups each. Having 2 mcp groups per role means that you can have one set of nodes that are not affected by the update.

Example cluster with 6 worker nodes

Consider a cluster with 6 worker nodes:

  • Split the worker nodes into 3 mcp groups of 2 nodes each.

Upgrade one of the mcp groups. Allow the updated nodes to sit through a day to allow for verification of CNF compatibility before completing the update on the other 4 nodes.

Important

The process and pace at which you unpause the mcp groups is determined by your CNF applications and configuration.

If your CNF pod can handle being scheduled across nodes in a cluster, you can unpause several mcp groups at a time and set the MaxUnavailable in the mcp custom resource (CR) to as high as 50%. This allows up to half of the nodes in an mcp group to restart and get updated.

17.1.3.3.3. Reviewing configured cluster MachineConfigPool roles

Review the currently configured MachineConfigPool roles in the cluster.

Procedure

  1. Get the currently configured mcp groups in the cluster:

    $ oc get mcp

    Example output

    NAME     CONFIG                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-bere83   True      False      False      3              3                   3                     0                      25d
    worker   rendered-worker-245c4f   True      False      False      2              2                   2                     0                      25d

  2. Compare the list of mcp roles to list of nodes in the cluster:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE   VERSION
    ctrl-plane-0   Ready    control-plane,master   39d   v1.27.15+6147456
    ctrl-plane-1   Ready    control-plane,master   39d   v1.27.15+6147456
    ctrl-plane-2   Ready    control-plane,master   39d   v1.27.15+6147456
    worker-0       Ready    worker                 39d   v1.27.15+6147456
    worker-1       Ready    worker                 39d   v1.27.15+6147456

    Note

    When you apply an mcp group change, the node roles are updated.

    Determine how you want to separate the worker nodes into mcp groups.

17.1.3.3.4. Creating MachineConfigPool groups for the cluster

Creating mcp groups is a 2-step process:

  1. Add an mcp label to the nodes in the cluster
  2. Apply an mcp CR to the cluster that organizes the nodes based on their labels

Procedure

  1. Label the nodes so that they can be put into mcp groups. Run the following commands:

    $ oc label node worker-0 node-role.kubernetes.io/mcp-1=
    $ oc label node worker-1 node-role.kubernetes.io/mcp-2=

    The mcp-1 and mcp-2 labels are applied to the nodes. For example:

    Example output

    NAME           STATUS   ROLES                  AGE   VERSION
    ctrl-plane-0   Ready    control-plane,master   39d   v1.27.15+6147456
    ctrl-plane-1   Ready    control-plane,master   39d   v1.27.15+6147456
    ctrl-plane-2   Ready    control-plane,master   39d   v1.27.15+6147456
    worker-0       Ready    mcp-1,worker           39d   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           39d   v1.27.15+6147456

  2. Create YAML custom resources (CRs) that apply the labels as mcp CRs in the cluster. Save the following YAML in the mcps.yaml file:

    ---
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: mcp-2
    spec:
      machineConfigSelector:
        matchExpressions:
          - {
             key: machineconfiguration.openshift.io/role,
             operator: In,
             values: [worker,mcp-2]
            }
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/mcp-2: ""
    ---
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: mcp-1
    spec:
      machineConfigSelector:
        matchExpressions:
          - {
             key: machineconfiguration.openshift.io/role,
             operator: In,
             values: [worker,mcp-1]
            }
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/mcp-1: ""
  3. Create the MachineConfigPool resources:

    $ oc apply -f mcps.yaml

    Example output

    machineconfigpool.machineconfiguration.openshift.io/mcp-2 created

Verification

Monitor the MachineConfigPool resources as they are applied in the cluster. After you apply the mcp resources, the nodes are added into the new machine config pools. This takes a few minutes.

Note

The nodes do not reboot while being added into the mcp groups. The original worker and master mcp groups remain unchanged.

  • Check the status of the new mcp resources:

    $ oc get mcp

    Example output

    NAME     CONFIG                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-be3e83   True      False      False      3              3                 3                     0                      25d
    mcp-1    rendered-mcp-1-2f4c4f    False     True       True       1              0                 0                     0                      10s
    mcp-2    rendered-mcp-2-2r4s1f    False     True       True       1              0                 0                     0                      10s
    worker   rendered-worker-23fc4f   False     True       True       0              0                 0                     2                      25d

    Eventually, the resources are fully applied:

    NAME     CONFIG                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-be3e83   True      False      False      3              3                 3                     0                      25d
    mcp-1    rendered-mcp-1-2f4c4f    True      False      False      1              1                 1                     0                      7m33s
    mcp-2    rendered-mcp-2-2r4s1f    True      False      False      1              1                 1                     0                      51s
    worker   rendered-worker-23fc4f   True      False      False      0              0                 0                     0                      25d

17.1.3.4. Telco deployment environment considerations

In telco environments, most clusters are in disconnected networks. To update clusters in these environments, you must update your offline image repository.

17.1.3.5. Preparing the cluster platform for update

Before you update the cluster, perform some basic checks and verifications to make sure that the cluster is ready for the update.

Procedure

  1. Verify that there are no failed or in progress pods in the cluster by running the following command:

    $ oc get pods -A | grep -E -vi 'complete|running'
    Note

    You might have to run this command more than once if there are pods that are in a pending state.

  2. Verify that all nodes in the cluster are available:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE   VERSION
    ctrl-plane-0   Ready    control-plane,master   32d   v1.27.15+6147456
    ctrl-plane-1   Ready    control-plane,master   32d   v1.27.15+6147456
    ctrl-plane-2   Ready    control-plane,master   32d   v1.27.15+6147456
    worker-0       Ready    mcp-1,worker           32d   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           32d   v1.27.15+6147456

  3. Verify that all bare-metal nodes are provisioned and ready.

    $ oc get bmh -n openshift-machine-api

    Example output

    NAME           STATE         CONSUMER                   ONLINE   ERROR   AGE
    ctrl-plane-0   unmanaged     cnf-58879-master-0         true             33d
    ctrl-plane-1   unmanaged     cnf-58879-master-1         true             33d
    ctrl-plane-2   unmanaged     cnf-58879-master-2         true             33d
    worker-0       unmanaged     cnf-58879-worker-0-45879   true             33d
    worker-1       progressing   cnf-58879-worker-0-dszsh   false            1d 1

    1
    An error occurred while provisioning the worker-1 node.

Verification

  • Verify that all cluster Operators are ready:

    $ oc get co

    Example output

    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                             4.14.34   True        False         False      17h
    baremetal                                  4.14.34   True        False         False      32d
    
    ...
    
    service-ca                                 4.14.34   True        False         False      32d
    storage                                    4.14.34   True        False         False      32d

Additional resources

17.1.4. Configuring CNF pods before updating the telco core CNF cluster

Follow the guidance in Red Hat best practices for Kubernetes when developing cloud-native network functions (CNFs) to ensure that the cluster can schedule pods during an update.

Important

Always deploy pods in groups by using Deployment resources. Deployment resources spread the workload across all of the available pods ensuring there is no single point of failure. When a pod that is managed by a Deployment resource is deleted, a new pod takes its place automatically.

17.1.4.1. Ensuring that CNF workloads run uninterrupted with pod disruption budgets

You can configure the minimum number of pods in a deployment to allow the CNF workload to run uninterrupted by setting a pod disruption budget in a PodDisruptionBudget custom resource (CR) that you apply. Be careful when setting this value; setting it improperly can cause an update to fail.

For example, if you have 4 pods in a deployment and you set the pod disruption budget to 4, the cluster scheduler keeps 4 pods running at all times - no pods can be scaled down.

Instead, set the pod disruption budget to 2, letting 2 of the 4 pods be scheduled as down. Then, the worker nodes where those pods are located can be rebooted.

Note

Setting the pod disruption budget to 2 does not mean that your deployment runs on only 2 pods for a period of time, for example, during an update. The cluster scheduler creates 2 new pods to replace the 2 older pods. However, there is short period of time between the new pods coming online and the old pods being deleted.

17.1.4.2. Ensuring that pods do not run on the same cluster node

High availability in Kubernetes requires duplicate processes to be running on separate nodes in the cluster. This ensures that the application continues to run even if one node becomes unavailable. In OpenShift Container Platform, processes can be automatically duplicated in separate pods in a deployment. You configure anti-affinity in the Pod spec to ensure that the pods in a deployment do not run on the same cluster node.

During an update, setting pod anti-affinity ensures that pods are distributed evenly across nodes in the cluster. This means that node reboots are easier during an update. For example, if there are 4 pods from a single deployment on a node, and the pod disruption budget is set to only allow 1 pod to be deleted at a time, then it will take 4 times as long for that node to reboot. Setting pod anti-affinity spreads pods across the cluster to prevent such occurrences.

Additional resources

17.1.4.3. Application liveness, readiness, and startup probes

You can use liveness, readiness and startup probes to check the health of your live application containers before you schedule an update. These are very useful tools to use with pods that are dependent upon keeping state for their application containers.

Liveness health check
Determines if a container is running. If the liveness probe fails for a container, the pod responds based on the restart policy.
Readiness probe
Determines if a container is ready to accept service requests. If the readiness probe fails for a container, the kubelet removes the container from the list of available service endpoints.
Startup probe
A startup probe indicates whether the application within a container is started. All other probes are disabled until the startup succeeds. If the startup probe does not succeed, the kubelet kills the container, and the container is subject to the pod restartPolicy setting.

Additional resources

17.1.5. Before you update the telco core CNF cluster

Before you start the cluster update, you must pause worker nodes, back up the etcd database, and do a final cluster health check before proceeding.

17.1.5.1. Pausing worker nodes before the update

You must pause the worker nodes before you proceed with the update. In the following example, there are 2 mcp groups, mcp-1 and mcp-2. You patch the spec.paused field to true for each of these MachineConfigPool groups.

Procedure

  1. Patch the mcp CRs to pause the nodes and drain and remove the pods from those nodes by running the following command:

    $ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":true}}'
    $ oc patch mcp/mcp-2 --type merge --patch '{"spec":{"paused":true}}'
  2. Get the status of the paused mcp groups:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   true
    mcp-2   true

Note

The default control plane and worker mcp groups are not changed during an update.

17.1.5.2. Backup the etcd database before you proceed with the update

You must backup the etcd database before you proceed with the update.

17.1.5.2.1. Backing up etcd data

Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static pods. This backup can be saved and used at a later time if you need to restore etcd.

Important

Only save a backup from a single control plane host. Do not take a backup from each control plane host in the cluster.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have checked whether the cluster-wide proxy is enabled.

    Tip

    You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpProxy, httpsProxy, and noProxy fields have values set.

Procedure

  1. Start a debug session as root for a control plane node:

    $ oc debug --as-root node/<node_name>
  2. Change your root directory to /host in the debug shell:

    sh-4.4# chroot /host
  3. If the cluster-wide proxy is enabled, export the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables by running the following commands:

    $ export HTTP_PROXY=http://<your_proxy.example.com>:8080
    $ export HTTPS_PROXY=https://<your_proxy.example.com>:8080
    $ export NO_PROXY=<example.com>
  4. Run the cluster-backup.sh script in the debug shell and pass in the location to save the backup to.

    Tip

    The cluster-backup.sh script is maintained as a component of the etcd Cluster Operator and is a wrapper around the etcdctl snapshot save command.

    sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup

    Example script output

    found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-6
    found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-7
    found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
    found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-3
    ede95fe6b88b87ba86a03c15e669fb4aa5bf0991c180d3c6895ce72eaade54a1
    etcdctl version: 3.4.14
    API version: 3.4
    {"level":"info","ts":1624647639.0188997,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/home/core/assets/backup/snapshot_2021-06-25_190035.db.part"}
    {"level":"info","ts":"2021-06-25T19:00:39.030Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
    {"level":"info","ts":1624647639.0301006,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://10.0.0.5:2379"}
    {"level":"info","ts":"2021-06-25T19:00:40.215Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
    {"level":"info","ts":1624647640.6032252,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://10.0.0.5:2379","size":"114 MB","took":1.584090459}
    {"level":"info","ts":1624647640.6047094,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/home/core/assets/backup/snapshot_2021-06-25_190035.db"}
    Snapshot saved at /home/core/assets/backup/snapshot_2021-06-25_190035.db
    {"hash":3866667823,"revision":31407,"totalKey":12828,"totalSize":114446336}
    snapshot db and kube resources are successfully saved to /home/core/assets/backup

    In this example, two files are created in the /home/core/assets/backup/ directory on the control plane host:

    • snapshot_<datetimestamp>.db: This file is the etcd snapshot. The cluster-backup.sh script confirms its validity.
    • static_kuberesources_<datetimestamp>.tar.gz: This file contains the resources for the static pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.

      Note

      If etcd encryption is enabled, it is recommended to store this second file separately from the etcd snapshot for security reasons. However, this file is required to restore from the etcd snapshot.

      Keep in mind that etcd encryption only encrypts values, not keys. This means that resource types, namespaces, and object names are unencrypted.

17.1.5.2.2. Creating a single etcd backup

Follow these steps to create a single etcd backup by creating and applying a custom resource (CR).

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have access to the OpenShift CLI (oc).

Procedure

  • If dynamically-provisioned storage is available, complete the following steps to create a single automated etcd backup:

    1. Create a persistent volume claim (PVC) named etcd-backup-pvc.yaml with contents such as the following example:

      kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: etcd-backup-pvc
        namespace: openshift-etcd
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 200Gi 1
        volumeMode: Filesystem
      1
      The amount of storage available to the PVC. Adjust this value for your requirements.
    2. Apply the PVC by running the following command:

      $ oc apply -f etcd-backup-pvc.yaml
    3. Verify the creation of the PVC by running the following command:

      $ oc get pvc

      Example output

      NAME              STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      etcd-backup-pvc   Bound                                                       51s

      Note

      Dynamic PVCs stay in the Pending state until they are mounted.

    4. Create a CR file named etcd-single-backup.yaml with contents such as the following example:

      apiVersion: operator.openshift.io/v1alpha1
      kind: EtcdBackup
      metadata:
        name: etcd-single-backup
        namespace: openshift-etcd
      spec:
        pvcName: etcd-backup-pvc 1
      1
      The name of the PVC to save the backup to. Adjust this value according to your environment.
    5. Apply the CR to start a single backup:

      $ oc apply -f etcd-single-backup.yaml
  • If dynamically-provisioned storage is not available, complete the following steps to create a single automated etcd backup:

    1. Create a StorageClass CR file named etcd-backup-local-storage.yaml with the following contents:

      apiVersion: storage.k8s.io/v1
      kind: StorageClass
      metadata:
        name: etcd-backup-local-storage
      provisioner: kubernetes.io/no-provisioner
      volumeBindingMode: Immediate
    2. Apply the StorageClass CR by running the following command:

      $ oc apply -f etcd-backup-local-storage.yaml
    3. Create a PV named etcd-backup-pv-fs.yaml with contents such as the following example:

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-backup-pv-fs
      spec:
        capacity:
          storage: 100Gi 1
        volumeMode: Filesystem
        accessModes:
        - ReadWriteOnce
        persistentVolumeReclaimPolicy: Retain
        storageClassName: etcd-backup-local-storage
        local:
          path: /mnt
        nodeAffinity:
          required:
            nodeSelectorTerms:
            - matchExpressions:
            - key: kubernetes.io/hostname
               operator: In
               values:
               - <example_master_node> 2
      1
      The amount of storage available to the PV. Adjust this value for your requirements.
      2
      Replace this value with the node to attach this PV to.
    4. Verify the creation of the PV by running the following command:

      $ oc get pv

      Example output

      NAME                    CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS                REASON   AGE
      etcd-backup-pv-fs       100Gi      RWO            Retain           Available           etcd-backup-local-storage            10s

    5. Create a PVC named etcd-backup-pvc.yaml with contents such as the following example:

      kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: etcd-backup-pvc
        namespace: openshift-etcd
      spec:
        accessModes:
        - ReadWriteOnce
        volumeMode: Filesystem
        resources:
          requests:
            storage: 10Gi 1
      1
      The amount of storage available to the PVC. Adjust this value for your requirements.
    6. Apply the PVC by running the following command:

      $ oc apply -f etcd-backup-pvc.yaml
    7. Create a CR file named etcd-single-backup.yaml with contents such as the following example:

      apiVersion: operator.openshift.io/v1alpha1
      kind: EtcdBackup
      metadata:
        name: etcd-single-backup
        namespace: openshift-etcd
      spec:
        pvcName: etcd-backup-pvc 1
      1
      The name of the persistent volume claim (PVC) to save the backup to. Adjust this value according to your environment.
    8. Apply the CR to start a single backup:

      $ oc apply -f etcd-single-backup.yaml

Additional resources

17.1.5.3. Checking the cluster health

You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.

Procedure

  1. Check the status of the cluster Operators by running the following command:

    $ oc get co

    Example output

    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                             4.14.34   True        False         False      4d22h
    baremetal                                  4.14.34   True        False         False      4d22h
    cloud-controller-manager                   4.14.34   True        False         False      4d23h
    cloud-credential                           4.14.34   True        False         False      4d23h
    cluster-autoscaler                         4.14.34   True        False         False      4d22h
    config-operator                            4.14.34   True        False         False      4d22h
    console                                    4.14.34   True        False         False      4d22h
    ...
    service-ca                                 4.14.34   True        False         False      4d22h
    storage                                    4.14.34   True        False         False      4d22h

  2. Check the status of the cluster nodes:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE     VERSION
    ctrl-plane-0   Ready    control-plane,master   4d22h   v1.27.15+6147456
    ctrl-plane-1   Ready    control-plane,master   4d22h   v1.27.15+6147456
    ctrl-plane-2   Ready    control-plane,master   4d22h   v1.27.15+6147456
    worker-0       Ready    mcp-1,worker           4d22h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           4d22h   v1.27.15+6147456

  3. Check that there are no in-progress or failed pods. There should be no pods returned when you run the following command.

    $ oc get po -A | grep -E -iv 'running|complete'

17.1.6. Completing the Control Plane Only cluster update

Follow these steps to perform the Control Plane Only cluster update and monitor the update through to completion.

Important

Control Plane Only updates were previously known as EUS-to-EUS updates. Control Plane Only updates are only viable between even-numbered minor versions of OpenShift Container Platform.

17.1.6.1. Acknowledging the Control Plane Only or y-stream update

When you update to all versions from 4.11 and later, you must manually acknowledge that the update can continue.

Important

Before you acknowledge the update, verify that there are no Kubernetes APIs in use that are removed in the version you are updating to. For example, in OpenShift Container Platform 4.17, there are no API removals. See "Kubernetes API removals" for more information.

Procedure

  1. Run the following command:

    $ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-<update_version_from>-kube-<kube_api_version>-api-removals-in-<update_version_to>":"true"}}' --type=merge

    where:

    <update_version_from>
    Is the cluster version you are moving from, for example, 4.14.
    <kube_api_version>
    Is kube API version, for example, 1.28.
    <update_version_to>
    Is the cluster version you are moving to, for example, 4.15.

Verification

  • Verify the update. Run the following command:

    $ oc get configmap admin-acks -n openshift-config -o json | jq .data

    Example output

    {
      "ack-4.14-kube-1.28-api-removals-in-4.15": "true",
      "ack-4.15-kube-1.29-api-removals-in-4.16": "true"
    }

    Note

    In this example, the cluster is updated from version 4.14 to 4.15, and then from 4.15 to 4.16 in a Control Plane Only update.

Additional resources

17.1.6.2. Starting the cluster update

When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.

Note

You can verify that you are updating to a viable release by running the oc adm upgrade command. The oc adm upgrade command lists the compatible update releases.

Procedure

  1. Start the update:

    $ oc adm upgrade --to=4.15.33
    Important
    • Control Plane Only update: Make sure you point to the interim <y+1> release path
    • Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
    • Z-stream update - Verify that there are no problems moving to that specific release

    Example output

    Requested update to 4.15.33 1

    1
    The Requested update value changes depending on your particular update.

Additional resources

17.1.6.3. Monitoring the cluster update

You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.

Procedure

  • Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:

    $ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.14.34   True        True          4m6s    Working towards 4.15.33: 111 of 873 done (12% complete), waiting on kube-apiserver
    
    NAME                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                 4.14.34   True        False         False      4d22h
    baremetal                      4.14.34   True        False         False      4d23h
    cloud-controller-manager       4.14.34   True        False         False      4d23h
    cloud-credential               4.14.34   True        False         False      4d23h
    cluster-autoscaler             4.14.34   True        False         False      4d23h
    console                        4.14.34   True        False         False      4d22h
    
    ...
    
    storage                        4.14.34   True        False         False      4d23h
    config-operator                4.15.33   True        False         False      4d23h
    etcd                           4.15.33   True        False         False      4d23h
    
    NAME           STATUS   ROLES                  AGE     VERSION
    ctrl-plane-0   Ready    control-plane,master   4d23h   v1.27.15+6147456
    ctrl-plane-1   Ready    control-plane,master   4d23h   v1.27.15+6147456
    ctrl-plane-2   Ready    control-plane,master   4d23h   v1.27.15+6147456
    worker-0       Ready    mcp-1,worker           4d23h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           4d23h   v1.27.15+6147456
    
    NAMESPACE               NAME                       READY   STATUS              RESTARTS   AGE
    openshift-marketplace   redhat-marketplace-rf86t   0/1     ContainerCreating   0          0s

Verification

During the update the watch command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE column.

When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.

Note

During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.

As soon as the last control plane node reboot is complete, the cluster version is displayed as updated.

When the control plane update is complete a message such as the following is displayed. This example shows an update completed to the intermediate y-stream release.

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.33   True        False         28m     Cluster version is 4.15.33

NAME                         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication               4.15.33   True        False         False	  5d
baremetal                    4.15.33   True        False         False	  5d
cloud-controller-manager     4.15.33   True        False         False	  5d1h
cloud-credential             4.15.33   True        False         False	  5d1h
cluster-autoscaler           4.15.33   True        False         False	  5d
config-operator              4.15.33   True        False         False	  5d
console                      4.15.33   True        False         False	  5d

...

service-ca                   4.15.33   True        False         False	  5d
storage                      4.15.33   True        False         False	  5d

NAME           STATUS   ROLES                  AGE   VERSION
ctrl-plane-0   Ready    control-plane,master   5d    v1.28.13+2ca1a23
ctrl-plane-1   Ready    control-plane,master   5d    v1.28.13+2ca1a23
ctrl-plane-2   Ready    control-plane,master   5d    v1.28.13+2ca1a23
worker-0       Ready    mcp-1,worker           5d    v1.28.13+2ca1a23
worker-1       Ready    mcp-2,worker           5d    v1.28.13+2ca1a23

17.1.6.4. Updating the OLM Operators

In telco environments, software needs to vetted before it is loaded onto a production cluster. Production clusters are also configured in a disconnected network, which means that they are not always directly connected to the internet. Because the clusters are in a disconnected network, the OpenShift Operators are configured for manual update during installation so that new versions can be managed on a cluster-by-cluster basis. Perform the following procedure to move the Operators to the newer versions.

Procedure

  1. Check to see which Operators need to be updated:

    $ oc get installplan -A | grep -E 'APPROVED|false'

    Example output

    NAMESPACE           NAME            CSV                                               APPROVAL   APPROVED
    metallb-system      install-nwjnh   metallb-operator.v4.16.0-202409202304             Manual     false
    openshift-nmstate   install-5r7wr   kubernetes-nmstate-operator.4.16.0-202409251605   Manual     false

  2. Patch the InstallPlan resources for those Operators:

    $ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \
    '{"spec":{"approved":true}}'

    Example output

    installplan.operators.coreos.com/install-nwjnh patched

  3. Monitor the namespace by running the following command:

    $ oc get all -n metallb-system

    Example output

    NAME                                                       READY   STATUS              RESTARTS   AGE
    pod/metallb-operator-controller-manager-69b5f884c-8bp22    0/1     ContainerCreating   0          4s
    pod/metallb-operator-controller-manager-77895bdb46-bqjdx   1/1     Running             0          4m1s
    pod/metallb-operator-webhook-server-5d9b968896-vnbhk       0/1     ContainerCreating   0          4s
    pod/metallb-operator-webhook-server-d76f9c6c8-57r4w        1/1     Running             0          4m1s
    
    ...
    
    NAME                                                             DESIRED   CURRENT   READY   AGE
    replicaset.apps/metallb-operator-controller-manager-69b5f884c    1         1         0       4s
    replicaset.apps/metallb-operator-controller-manager-77895bdb46   1         1         1       4m1s
    replicaset.apps/metallb-operator-controller-manager-99b76f88     0         0         0       4m40s
    replicaset.apps/metallb-operator-webhook-server-5d9b968896       1         1         0       4s
    replicaset.apps/metallb-operator-webhook-server-6f7dbfdb88       0         0         0       4m40s
    replicaset.apps/metallb-operator-webhook-server-d76f9c6c8        1         1         1       4m1s

    When the update is complete, the required pods should be in a Running state, and the required ReplicaSet resources should be ready:

    NAME                                                      READY   STATUS    RESTARTS   AGE
    pod/metallb-operator-controller-manager-69b5f884c-8bp22   1/1     Running   0          25s
    pod/metallb-operator-webhook-server-5d9b968896-vnbhk      1/1     Running   0          25s
    
    ...
    
    NAME                                                             DESIRED   CURRENT   READY   AGE
    replicaset.apps/metallb-operator-controller-manager-69b5f884c    1         1         1       25s
    replicaset.apps/metallb-operator-controller-manager-77895bdb46   0         0         0       4m22s
    replicaset.apps/metallb-operator-webhook-server-5d9b968896       1         1         1       25s
    replicaset.apps/metallb-operator-webhook-server-d76f9c6c8        0         0         0       4m22s

Verification

  • Verify that the Operators do not need to be updated for a second time:

    $ oc get installplan -A | grep -E 'APPROVED|false'

    There should be no output returned.

    Note

    Sometimes you have to approve an update twice because some Operators have interim z-stream release versions that need to be installed before the final version.

Additional resources

17.1.6.4.1. Performing the second y-stream update

After completing the first y-stream update, you must update the y-stream control plane version to the new EUS version.

Procedure

  1. Verify that the <4.y.z> release that you selected is still listed as a good channel to move to:

    $ oc adm upgrade

    Example output

    Cluster version is 4.15.33
    
    Upgradeable=False
    
      Reason: AdminAckRequired
      Message: Kubernetes 1.29 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7031404 for details and instructions.
    
    Upstream is unset, so the cluster will use an appropriate default.
    Channel: eus-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16)
    
    Recommended updates:
    
      VERSION     IMAGE
      4.16.14     quay.io/openshift-release-dev/ocp-release@sha256:0521a0f1acd2d1b77f76259cb9bae9c743c60c37d9903806a3372c1414253658
      4.16.13     quay.io/openshift-release-dev/ocp-release@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b
      4.15.34     quay.io/openshift-release-dev/ocp-release@sha256:f2e0c593f6ed81250c11d0bac94dbaf63656223477b7e8693a652f933056af6e

    Note

    If you update soon after the initial GA of a new Y-stream release, you might not see new y-stream releases available when you run the oc adm upgrade command.

  2. Optional: View the potential update releases that are not recommended. Run the following command:

    $ oc adm upgrade --include-not-recommended

    Example output

    Cluster version is 4.15.33
    
    Upgradeable=False
    
      Reason: AdminAckRequired
      Message: Kubernetes 1.29 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7031404 for details and instructions.
    
    Upstream is unset, so the cluster will use an appropriate default.Channel: eus-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16)
    
    Recommended updates:
    
      VERSION     IMAGE
      4.16.14     quay.io/openshift-release-dev/ocp-release@sha256:0521a0f1acd2d1b77f76259cb9bae9c743c60c37d9903806a3372c1414253658
      4.16.13     quay.io/openshift-release-dev/ocp-release@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b
      4.15.34     quay.io/openshift-release-dev/ocp-release@sha256:f2e0c593f6ed81250c11d0bac94dbaf63656223477b7e8693a652f933056af6e
    
    Supported but not recommended updates:
    
      Version: 4.16.15
      Image: quay.io/openshift-release-dev/ocp-release@sha256:671bc35e
      Recommended: Unknown
      Reason: EvaluationFailed
      Message: Exposure to AzureRegistryImagePreservation is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 0
      In Azure clusters, the in-cluster image registry may fail to preserve images on update. https://issues.redhat.com/browse/IR-461

    Note

    The example shows a potential error that can affect clusters hosted in Microsoft Azure. It does not show risks for bare-metal clusters.

17.1.6.4.2. Acknowledging the y-stream release update

When moving between y-stream releases, you must run a patch command to explicitly acknowledge the update. In the output of the oc adm upgrade command, a URL is provided that shows the specific command to run.

Important

Before you acknowledge the update, verify that there are no Kubernetes APIs in use that are removed in the version you are updating to. For example, in OpenShift Container Platform 4.17, there are no API removals. See "Kubernetes API removals" for more information.

Procedure

  1. Acknowledge the y-stream release upgrade by patching the admin-acks config map in the openshift-config namespace. For example, run the following command:

    $ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.15-kube-1.29-api-removals-in-4.16":"true"}}' --type=merge

    Example output

    configmap/admin-acks patched

17.1.6.5. Starting the y-stream control plane update

After you have determined the full new release that you are moving to, you can run the oc adm upgrade –to=x.y.z command.

Procedure

  • Start the y-stream control plane update. For example, run the following command:

    $ oc adm upgrade --to=4.16.14

    Example output

    Requested update to 4.16.14

    You might move to a z-stream release that has potential issues with platforms other than the one you are running on. The following example shows a potential problem for cluster updates on Microsoft Azure:

    $ oc adm upgrade --to=4.16.15

    Example output

    error: the update 4.16.15 is not one of the recommended updates, but is available as a conditional update. To accept the Recommended=Unknown risk and to proceed with update use --allow-not-recommended.
      Reason: EvaluationFailed
      Message: Exposure to AzureRegistryImagePreservation is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 0
      In Azure clusters, the in-cluster image registry may fail to preserve images on update. https://issues.redhat.com/browse/IR-461

    Note

    The example shows a potential error that can affect clusters hosted in Microsoft Azure. It does not show risks for bare-metal clusters.

    $ oc adm upgrade --to=4.16.15 --allow-not-recommended

    Example output

    warning: with --allow-not-recommended you have accepted the risks with 4.14.11 and bypassed Recommended=Unknown EvaluationFailed: Exposure to AzureRegistryImagePreservation is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 0
    In Azure clusters, the in-cluster image registry may fail to preserve images on update. https://issues.redhat.com/browse/IR-461
    
    Requested update to 4.16.15

17.1.6.6. Monitoring the second part of a <y+1> cluster update

Monitor the second part of the cluster update to the <y+1> version.

Procedure

  • Monitor the progress of the second part of the <y+1> update. For example, to monitor the update from 4.15 to 4.16, run the following command:

    $ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.15; oc get co | grep 4.16; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.15.33   True        True          10m     Working towards 4.16.14: 132 of 903 done (14% complete), waiting on kube-controller-manager, kube-scheduler
    
    NAME                         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication               4.15.33   True        False         False      5d3h
    baremetal                    4.15.33   True        False         False      5d4h
    cloud-controller-manager     4.15.33   True        False         False      5d4h
    cloud-credential             4.15.33   True        False         False      5d4h
    cluster-autoscaler           4.15.33   True        False         False      5d4h
    console                      4.15.33   True        False         False      5d3h
    ...
    config-operator              4.16.14   True        False         False      5d4h
    etcd                         4.16.14   True        False         False      5d4h
    kube-apiserver               4.16.14   True        True          False      5d4h    NodeInstallerProgressing: 1 node is at revision 15; 2 nodes are at revision 17
    
    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d4h   v1.28.13+2ca1a23
    ctrl-plane-1   Ready    control-plane,master   5d4h   v1.28.13+2ca1a23
    ctrl-plane-2   Ready    control-plane,master   5d4h   v1.28.13+2ca1a23
    worker-0       Ready    mcp-1,worker           5d4h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           5d4h   v1.27.15+6147456
    
    NAMESPACE                                          NAME                                                              READY   STATUS      RESTARTS       AGE
    openshift-kube-apiserver                           kube-apiserver-ctrl-plane-0                                       0/5     Pending     0              <invalid>

    As soon as the last control plane node is complete, the cluster version is updated to the new EUS release. For example:

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.16.14   True        False         123m    Cluster version is 4.16.14
    
    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                             4.16.14   True        False         False	  5d6h
    baremetal                                  4.16.14   True        False         False	  5d7h
    cloud-controller-manager                   4.16.14   True        False         False	  5d7h
    cloud-credential                           4.16.14   True        False         False	  5d7h
    cluster-autoscaler                         4.16.14   True        False         False	  5d7h
    config-operator                            4.16.14   True        False         False	  5d7h
    console                                    4.16.14   True        False         False	  5d6h
    #...
    operator-lifecycle-manager-packageserver   4.16.14   True        False         False	  5d7h
    service-ca                                 4.16.14   True        False         False	  5d7h
    storage                                    4.16.14   True        False         False	  5d7h
    
    NAME           STATUS   ROLES                  AGE     VERSION
    ctrl-plane-0   Ready    control-plane,master   5d7h    v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d7h    v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d7h    v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d7h    v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           5d7h    v1.27.15+6147456

Additional resources

17.1.6.7. Updating all the OLM Operators

In the second phase of a multi-version upgrade, you must approve all of the Operators and additionally add installations plans for any other Operators that you want to upgrade.

Follow the same procedure as outlined in "Updating the OLM Operators". Ensure that you also update any non-OLM Operators as required.

Procedure

  1. Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:

    $ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"
  2. Check to see which Operators need to be updated:

    $ oc get installplan -A | grep -E 'APPROVED|false'
  3. Patch the InstallPlan resources for those Operators:

    $ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \
    '{"spec":{"approved":true}}'
  4. Monitor the namespace by running the following command:

    $ oc get all -n metallb-system

    When the update is complete, the required pods should be in a Running state, and the required ReplicaSet resources should be ready.

Verification

During the update the watch command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE column.

When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.

Note

During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.

17.1.6.8. Updating the worker nodes

You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp groups you created. Unpausing the mcp group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.

In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.

Important

This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.

  1. You can check how many nodes are managed in an mcp group. Run the following command to get the list of mcp groups:

    $ oc get mcp

    Example output

    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-c9a52144456dbff9c9af9c5a37d1b614   True      False      False      3              3                   3                     0                      36d
    mcp-1    rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
    mcp-2    rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
    worker   rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0   True      False      False      0              0                   0                     0                      36d

    Note

    You decide how many mcp groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.

  2. Get the list of nodes in the cluster:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d8h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           5d8h   v1.27.15+6147456

  3. Confirm the MachineConfigPool groups that are paused:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   true
    mcp-2   true

    Note

    Each MachineConfigPool can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.

  4. Unpause the required mcp group to begin the upgrade:

    $ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'

    Example output

    machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched

  5. Confirm that the required mcp group is unpaused:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   false
    mcp-2   true

  6. As each mcp group is upgraded, continue to unpause and upgrade the remaining nodes.

    $ oc get nodes

    Example output

    NAME           STATUS                        ROLES                  AGE    VERSION
    ctrl-plane-0   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-1   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-2   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    worker-0       Ready                         mcp-1,worker           5d8h   v1.29.8+f10c92d
    worker-1       NotReady,SchedulingDisabled   mcp-2,worker           5d8h   v1.27.15+6147456

17.1.6.9. Verifying the health of the newly updated cluster

Run the following commands after updating the cluster to verify that the cluster is back up and running.

Procedure

  1. Check the cluster version by running the following command:

    $ oc get clusterversion

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.16.14   True        False         4h38m   Cluster version is 4.16.14

    This should return the new cluster version and the PROGRESSING column should return False.

  2. Check that all nodes are ready:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d9h   v1.29.8+f10c92d
    worker-1       Ready    mcp-2,worker           5d9h   v1.29.8+f10c92d

    All nodes in the cluster should be in a Ready status and running the same version.

  3. Check that there are no paused mcp resources in the cluster:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   false
    mcp-2   false

  4. Check that all cluster Operators are available:

    $ oc get co

    Example output

    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                             4.16.14   True        False         False      5d9h
    baremetal                                  4.16.14   True        False         False      5d9h
    cloud-controller-manager                   4.16.14   True        False         False      5d10h
    cloud-credential                           4.16.14   True        False         False      5d10h
    cluster-autoscaler                         4.16.14   True        False         False      5d9h
    config-operator                            4.16.14   True        False         False      5d9h
    console                                    4.16.14   True        False         False      5d9h
    control-plane-machine-set                  4.16.14   True        False         False      5d9h
    csi-snapshot-controller                    4.16.14   True        False         False      5d9h
    dns                                        4.16.14   True        False         False      5d9h
    etcd                                       4.16.14   True        False         False      5d9h
    image-registry                             4.16.14   True        False         False      85m
    ingress                                    4.16.14   True        False         False      5d9h
    insights                                   4.16.14   True        False         False      5d9h
    kube-apiserver                             4.16.14   True        False         False      5d9h
    kube-controller-manager                    4.16.14   True        False         False      5d9h
    kube-scheduler                             4.16.14   True        False         False      5d9h
    kube-storage-version-migrator              4.16.14   True        False         False      4h48m
    machine-api                                4.16.14   True        False         False      5d9h
    machine-approver                           4.16.14   True        False         False      5d9h
    machine-config                             4.16.14   True        False         False      5d9h
    marketplace                                4.16.14   True        False         False      5d9h
    monitoring                                 4.16.14   True        False         False      5d9h
    network                                    4.16.14   True        False         False      5d9h
    node-tuning                                4.16.14   True        False         False      5d7h
    openshift-apiserver                        4.16.14   True        False         False      5d9h
    openshift-controller-manager               4.16.14   True        False         False      5d9h
    openshift-samples                          4.16.14   True        False         False      5h24m
    operator-lifecycle-manager                 4.16.14   True        False         False      5d9h
    operator-lifecycle-manager-catalog         4.16.14   True        False         False      5d9h
    operator-lifecycle-manager-packageserver   4.16.14   True        False         False      5d9h
    service-ca                                 4.16.14   True        False         False      5d9h
    storage                                    4.16.14   True        False         False      5d9h

    All cluster Operators should report True in the AVAILABLE column.

  5. Check that all pods are healthy:

    $ oc get po -A | grep -E -iv 'complete|running'

    This should not return any pods.

    Note

    You might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.

17.1.7. Completing the y-stream cluster update

Follow these steps to perform the y-stream cluster update and monitor the update through to completion. Completing a y-stream update is more straightforward than a Control Plane Only update.

17.1.7.1. Acknowledging the Control Plane Only or y-stream update

When you update to all versions from 4.11 and later, you must manually acknowledge that the update can continue.

Important

Before you acknowledge the update, verify that there are no Kubernetes APIs in use that are removed in the version you are updating to. For example, in OpenShift Container Platform 4.17, there are no API removals. See "Kubernetes API removals" for more information.

Procedure

  1. Run the following command:

    $ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-<update_version_from>-kube-<kube_api_version>-api-removals-in-<update_version_to>":"true"}}' --type=merge

    where:

    <update_version_from>
    Is the cluster version you are moving from, for example, 4.14.
    <kube_api_version>
    Is kube API version, for example, 1.28.
    <update_version_to>
    Is the cluster version you are moving to, for example, 4.15.

Verification

  • Verify the update. Run the following command:

    $ oc get configmap admin-acks -n openshift-config -o json | jq .data

    Example output

    {
      "ack-4.14-kube-1.28-api-removals-in-4.15": "true",
      "ack-4.15-kube-1.29-api-removals-in-4.16": "true"
    }

    Note

    In this example, the cluster is updated from version 4.14 to 4.15, and then from 4.15 to 4.16 in a Control Plane Only update.

Additional resources

17.1.7.2. Starting the cluster update

When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.

Note

You can verify that you are updating to a viable release by running the oc adm upgrade command. The oc adm upgrade command lists the compatible update releases.

Procedure

  1. Start the update:

    $ oc adm upgrade --to=4.15.33
    Important
    • Control Plane Only update: Make sure you point to the interim <y+1> release path
    • Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
    • Z-stream update - Verify that there are no problems moving to that specific release

    Example output

    Requested update to 4.15.33 1

    1
    The Requested update value changes depending on your particular update.

Additional resources

17.1.7.3. Monitoring the cluster update

You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.

Procedure

  • Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:

    $ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.14.34   True        True          4m6s    Working towards 4.15.33: 111 of 873 done (12% complete), waiting on kube-apiserver
    
    NAME                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                 4.14.34   True        False         False      4d22h
    baremetal                      4.14.34   True        False         False      4d23h
    cloud-controller-manager       4.14.34   True        False         False      4d23h
    cloud-credential               4.14.34   True        False         False      4d23h
    cluster-autoscaler             4.14.34   True        False         False      4d23h
    console                        4.14.34   True        False         False      4d22h
    
    ...
    
    storage                        4.14.34   True        False         False      4d23h
    config-operator                4.15.33   True        False         False      4d23h
    etcd                           4.15.33   True        False         False      4d23h
    
    NAME           STATUS   ROLES                  AGE     VERSION
    ctrl-plane-0   Ready    control-plane,master   4d23h   v1.27.15+6147456
    ctrl-plane-1   Ready    control-plane,master   4d23h   v1.27.15+6147456
    ctrl-plane-2   Ready    control-plane,master   4d23h   v1.27.15+6147456
    worker-0       Ready    mcp-1,worker           4d23h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           4d23h   v1.27.15+6147456
    
    NAMESPACE               NAME                       READY   STATUS              RESTARTS   AGE
    openshift-marketplace   redhat-marketplace-rf86t   0/1     ContainerCreating   0          0s

Verification

During the update the watch command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE column.

When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.

Note

During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.

As soon as the last control plane node reboot is complete, the cluster version is displayed as updated.

When the control plane update is complete a message such as the following is displayed. This example shows an update completed to the intermediate y-stream release.

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.33   True        False         28m     Cluster version is 4.15.33

NAME                         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication               4.15.33   True        False         False	  5d
baremetal                    4.15.33   True        False         False	  5d
cloud-controller-manager     4.15.33   True        False         False	  5d1h
cloud-credential             4.15.33   True        False         False	  5d1h
cluster-autoscaler           4.15.33   True        False         False	  5d
config-operator              4.15.33   True        False         False	  5d
console                      4.15.33   True        False         False	  5d

...

service-ca                   4.15.33   True        False         False	  5d
storage                      4.15.33   True        False         False	  5d

NAME           STATUS   ROLES                  AGE   VERSION
ctrl-plane-0   Ready    control-plane,master   5d    v1.28.13+2ca1a23
ctrl-plane-1   Ready    control-plane,master   5d    v1.28.13+2ca1a23
ctrl-plane-2   Ready    control-plane,master   5d    v1.28.13+2ca1a23
worker-0       Ready    mcp-1,worker           5d    v1.28.13+2ca1a23
worker-1       Ready    mcp-2,worker           5d    v1.28.13+2ca1a23

17.1.7.4. Updating the OLM Operators

In telco environments, software needs to vetted before it is loaded onto a production cluster. Production clusters are also configured in a disconnected network, which means that they are not always directly connected to the internet. Because the clusters are in a disconnected network, the OpenShift Operators are configured for manual update during installation so that new versions can be managed on a cluster-by-cluster basis. Perform the following procedure to move the Operators to the newer versions.

Procedure

  1. Check to see which Operators need to be updated:

    $ oc get installplan -A | grep -E 'APPROVED|false'

    Example output

    NAMESPACE           NAME            CSV                                               APPROVAL   APPROVED
    metallb-system      install-nwjnh   metallb-operator.v4.16.0-202409202304             Manual     false
    openshift-nmstate   install-5r7wr   kubernetes-nmstate-operator.4.16.0-202409251605   Manual     false

  2. Patch the InstallPlan resources for those Operators:

    $ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \
    '{"spec":{"approved":true}}'

    Example output

    installplan.operators.coreos.com/install-nwjnh patched

  3. Monitor the namespace by running the following command:

    $ oc get all -n metallb-system

    Example output

    NAME                                                       READY   STATUS              RESTARTS   AGE
    pod/metallb-operator-controller-manager-69b5f884c-8bp22    0/1     ContainerCreating   0          4s
    pod/metallb-operator-controller-manager-77895bdb46-bqjdx   1/1     Running             0          4m1s
    pod/metallb-operator-webhook-server-5d9b968896-vnbhk       0/1     ContainerCreating   0          4s
    pod/metallb-operator-webhook-server-d76f9c6c8-57r4w        1/1     Running             0          4m1s
    
    ...
    
    NAME                                                             DESIRED   CURRENT   READY   AGE
    replicaset.apps/metallb-operator-controller-manager-69b5f884c    1         1         0       4s
    replicaset.apps/metallb-operator-controller-manager-77895bdb46   1         1         1       4m1s
    replicaset.apps/metallb-operator-controller-manager-99b76f88     0         0         0       4m40s
    replicaset.apps/metallb-operator-webhook-server-5d9b968896       1         1         0       4s
    replicaset.apps/metallb-operator-webhook-server-6f7dbfdb88       0         0         0       4m40s
    replicaset.apps/metallb-operator-webhook-server-d76f9c6c8        1         1         1       4m1s

    When the update is complete, the required pods should be in a Running state, and the required ReplicaSet resources should be ready:

    NAME                                                      READY   STATUS    RESTARTS   AGE
    pod/metallb-operator-controller-manager-69b5f884c-8bp22   1/1     Running   0          25s
    pod/metallb-operator-webhook-server-5d9b968896-vnbhk      1/1     Running   0          25s
    
    ...
    
    NAME                                                             DESIRED   CURRENT   READY   AGE
    replicaset.apps/metallb-operator-controller-manager-69b5f884c    1         1         1       25s
    replicaset.apps/metallb-operator-controller-manager-77895bdb46   0         0         0       4m22s
    replicaset.apps/metallb-operator-webhook-server-5d9b968896       1         1         1       25s
    replicaset.apps/metallb-operator-webhook-server-d76f9c6c8        0         0         0       4m22s

Verification

  • Verify that the Operators do not need to be updated for a second time:

    $ oc get installplan -A | grep -E 'APPROVED|false'

    There should be no output returned.

    Note

    Sometimes you have to approve an update twice because some Operators have interim z-stream release versions that need to be installed before the final version.

Additional resources

17.1.7.5. Updating the worker nodes

You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp groups you created. Unpausing the mcp group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.

In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.

Important

This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.

  1. You can check how many nodes are managed in an mcp group. Run the following command to get the list of mcp groups:

    $ oc get mcp

    Example output

    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-c9a52144456dbff9c9af9c5a37d1b614   True      False      False      3              3                   3                     0                      36d
    mcp-1    rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
    mcp-2    rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
    worker   rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0   True      False      False      0              0                   0                     0                      36d

    Note

    You decide how many mcp groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.

  2. Get the list of nodes in the cluster:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d8h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           5d8h   v1.27.15+6147456

  3. Confirm the MachineConfigPool groups that are paused:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   true
    mcp-2   true

    Note

    Each MachineConfigPool can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.

  4. Unpause the required mcp group to begin the upgrade:

    $ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'

    Example output

    machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched

  5. Confirm that the required mcp group is unpaused:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   false
    mcp-2   true

  6. As each mcp group is upgraded, continue to unpause and upgrade the remaining nodes.

    $ oc get nodes

    Example output

    NAME           STATUS                        ROLES                  AGE    VERSION
    ctrl-plane-0   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-1   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-2   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    worker-0       Ready                         mcp-1,worker           5d8h   v1.29.8+f10c92d
    worker-1       NotReady,SchedulingDisabled   mcp-2,worker           5d8h   v1.27.15+6147456

17.1.7.6. Verifying the health of the newly updated cluster

Run the following commands after updating the cluster to verify that the cluster is back up and running.

Procedure

  1. Check the cluster version by running the following command:

    $ oc get clusterversion

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.16.14   True        False         4h38m   Cluster version is 4.16.14

    This should return the new cluster version and the PROGRESSING column should return False.

  2. Check that all nodes are ready:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d9h   v1.29.8+f10c92d
    worker-1       Ready    mcp-2,worker           5d9h   v1.29.8+f10c92d

    All nodes in the cluster should be in a Ready status and running the same version.

  3. Check that there are no paused mcp resources in the cluster:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   false
    mcp-2   false

  4. Check that all cluster Operators are available:

    $ oc get co

    Example output

    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                             4.16.14   True        False         False      5d9h
    baremetal                                  4.16.14   True        False         False      5d9h
    cloud-controller-manager                   4.16.14   True        False         False      5d10h
    cloud-credential                           4.16.14   True        False         False      5d10h
    cluster-autoscaler                         4.16.14   True        False         False      5d9h
    config-operator                            4.16.14   True        False         False      5d9h
    console                                    4.16.14   True        False         False      5d9h
    control-plane-machine-set                  4.16.14   True        False         False      5d9h
    csi-snapshot-controller                    4.16.14   True        False         False      5d9h
    dns                                        4.16.14   True        False         False      5d9h
    etcd                                       4.16.14   True        False         False      5d9h
    image-registry                             4.16.14   True        False         False      85m
    ingress                                    4.16.14   True        False         False      5d9h
    insights                                   4.16.14   True        False         False      5d9h
    kube-apiserver                             4.16.14   True        False         False      5d9h
    kube-controller-manager                    4.16.14   True        False         False      5d9h
    kube-scheduler                             4.16.14   True        False         False      5d9h
    kube-storage-version-migrator              4.16.14   True        False         False      4h48m
    machine-api                                4.16.14   True        False         False      5d9h
    machine-approver                           4.16.14   True        False         False      5d9h
    machine-config                             4.16.14   True        False         False      5d9h
    marketplace                                4.16.14   True        False         False      5d9h
    monitoring                                 4.16.14   True        False         False      5d9h
    network                                    4.16.14   True        False         False      5d9h
    node-tuning                                4.16.14   True        False         False      5d7h
    openshift-apiserver                        4.16.14   True        False         False      5d9h
    openshift-controller-manager               4.16.14   True        False         False      5d9h
    openshift-samples                          4.16.14   True        False         False      5h24m
    operator-lifecycle-manager                 4.16.14   True        False         False      5d9h
    operator-lifecycle-manager-catalog         4.16.14   True        False         False      5d9h
    operator-lifecycle-manager-packageserver   4.16.14   True        False         False      5d9h
    service-ca                                 4.16.14   True        False         False      5d9h
    storage                                    4.16.14   True        False         False      5d9h

    All cluster Operators should report True in the AVAILABLE column.

  5. Check that all pods are healthy:

    $ oc get po -A | grep -E -iv 'complete|running'

    This should not return any pods.

    Note

    You might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.

17.1.8. Completing the z-stream cluster update

Follow these steps to perform the z-stream cluster update and monitor the update through to completion. Completing a z-stream update is more straightforward than a Control Plane Only or y-stream update.

17.1.8.1. Starting the cluster update

When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.

Note

You can verify that you are updating to a viable release by running the oc adm upgrade command. The oc adm upgrade command lists the compatible update releases.

Procedure

  1. Start the update:

    $ oc adm upgrade --to=4.15.33
    Important
    • Control Plane Only update: Make sure you point to the interim <y+1> release path
    • Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
    • Z-stream update - Verify that there are no problems moving to that specific release

    Example output

    Requested update to 4.15.33 1

    1
    The Requested update value changes depending on your particular update.

Additional resources

17.1.8.2. Updating the worker nodes

You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp groups you created. Unpausing the mcp group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.

In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.

Important

This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.

  1. You can check how many nodes are managed in an mcp group. Run the following command to get the list of mcp groups:

    $ oc get mcp

    Example output

    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-c9a52144456dbff9c9af9c5a37d1b614   True      False      False      3              3                   3                     0                      36d
    mcp-1    rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
    mcp-2    rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
    worker   rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0   True      False      False      0              0                   0                     0                      36d

    Note

    You decide how many mcp groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.

  2. Get the list of nodes in the cluster:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d8h   v1.27.15+6147456
    worker-1       Ready    mcp-2,worker           5d8h   v1.27.15+6147456

  3. Confirm the MachineConfigPool groups that are paused:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   true
    mcp-2   true

    Note

    Each MachineConfigPool can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.

  4. Unpause the required mcp group to begin the upgrade:

    $ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'

    Example output

    machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched

  5. Confirm that the required mcp group is unpaused:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   false
    mcp-2   true

  6. As each mcp group is upgraded, continue to unpause and upgrade the remaining nodes.

    $ oc get nodes

    Example output

    NAME           STATUS                        ROLES                  AGE    VERSION
    ctrl-plane-0   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-1   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    ctrl-plane-2   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
    worker-0       Ready                         mcp-1,worker           5d8h   v1.29.8+f10c92d
    worker-1       NotReady,SchedulingDisabled   mcp-2,worker           5d8h   v1.27.15+6147456

17.1.8.3. Verifying the health of the newly updated cluster

Run the following commands after updating the cluster to verify that the cluster is back up and running.

Procedure

  1. Check the cluster version by running the following command:

    $ oc get clusterversion

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.16.14   True        False         4h38m   Cluster version is 4.16.14

    This should return the new cluster version and the PROGRESSING column should return False.

  2. Check that all nodes are ready:

    $ oc get nodes

    Example output

    NAME           STATUS   ROLES                  AGE    VERSION
    ctrl-plane-0   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    ctrl-plane-1   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    ctrl-plane-2   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
    worker-0       Ready    mcp-1,worker           5d9h   v1.29.8+f10c92d
    worker-1       Ready    mcp-2,worker           5d9h   v1.29.8+f10c92d

    All nodes in the cluster should be in a Ready status and running the same version.

  3. Check that there are no paused mcp resources in the cluster:

    $ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

    Example output

    MCP     Paused
    ---     ------
    master  false
    mcp-1   false
    mcp-2   false

  4. Check that all cluster Operators are available:

    $ oc get co

    Example output

    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    authentication                             4.16.14   True        False         False      5d9h
    baremetal                                  4.16.14   True        False         False      5d9h
    cloud-controller-manager                   4.16.14   True        False         False      5d10h
    cloud-credential                           4.16.14   True        False         False      5d10h
    cluster-autoscaler                         4.16.14   True        False         False      5d9h
    config-operator                            4.16.14   True        False         False      5d9h
    console                                    4.16.14   True        False         False      5d9h
    control-plane-machine-set                  4.16.14   True        False         False      5d9h
    csi-snapshot-controller                    4.16.14   True        False         False      5d9h
    dns                                        4.16.14   True        False         False      5d9h
    etcd                                       4.16.14   True        False         False      5d9h
    image-registry                             4.16.14   True        False         False      85m
    ingress                                    4.16.14   True        False         False      5d9h
    insights                                   4.16.14   True        False         False      5d9h
    kube-apiserver                             4.16.14   True        False         False      5d9h
    kube-controller-manager                    4.16.14   True        False         False      5d9h
    kube-scheduler                             4.16.14   True        False         False      5d9h
    kube-storage-version-migrator              4.16.14   True        False         False      4h48m
    machine-api                                4.16.14   True        False         False      5d9h
    machine-approver                           4.16.14   True        False         False      5d9h
    machine-config                             4.16.14   True        False         False      5d9h
    marketplace                                4.16.14   True        False         False      5d9h
    monitoring                                 4.16.14   True        False         False      5d9h
    network                                    4.16.14   True        False         False      5d9h
    node-tuning                                4.16.14   True        False         False      5d7h
    openshift-apiserver                        4.16.14   True        False         False      5d9h
    openshift-controller-manager               4.16.14   True        False         False      5d9h
    openshift-samples                          4.16.14   True        False         False      5h24m
    operator-lifecycle-manager                 4.16.14   True        False         False      5d9h
    operator-lifecycle-manager-catalog         4.16.14   True        False         False      5d9h
    operator-lifecycle-manager-packageserver   4.16.14   True        False         False      5d9h
    service-ca                                 4.16.14   True        False         False      5d9h
    storage                                    4.16.14   True        False         False      5d9h

    All cluster Operators should report True in the AVAILABLE column.

  5. Check that all pods are healthy:

    $ oc get po -A | grep -E -iv 'complete|running'

    This should not return any pods.

    Note

    You might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.

17.2. Troubleshooting and maintaining telco core CNF clusters

17.2.1. Troubleshooting and maintaining telco core CNF clusters

Troubleshooting and maintenance are weekly tasks that can be a challenge if you do not have the tools to reach your goal, whether you want to update a component or investigate an issue. Part of the challenge is knowing where and how to search for tools and answers.

To maintain and troubleshoot a bare-metal environment where high-bandwidth network throughput is required, see the following procedures.

Important

This troubleshooting information is not a reference for configuring OpenShift Container Platform or developing Cloud-native Network Function (CNF) applications.

For information about developing CNF applications for telco, see Red Hat Best Practices for Kubernetes.

17.2.1.1. Cloud-native Network Functions

If you are starting to use OpenShift Container Platform for telecommunications Cloud-native Network Function (CNF) applications, learning about CNFs can help you understand the issues that you might encounter.

To learn more about CNFs and their evolution, see VNF and CNF, what’s the difference?.

17.2.1.2. Getting Support

If you experience difficulty with a procedure, visit the Red Hat Customer Portal. From the Customer Portal, you can find help in various ways:

  • Search or browse through the Red Hat Knowledgebase of articles and solutions about Red Hat products.
  • Submit a support case to Red Hat Support.
  • Access other product documentation.

To identify issues with your deployment, you can use the debugging tool or check the health endpoint of your deployment. After you have debugged or obtained health information about your deployment, you can search the Red Hat Knowledgebase for a solution or file a support ticket.

17.2.1.2.1. About the Red Hat Knowledgebase

The Red Hat Knowledgebase provides rich content aimed at helping you make the most of Red Hat’s products and technologies. The Red Hat Knowledgebase consists of articles, product documentation, and videos outlining best practices on installing, configuring, and using Red Hat products. In addition, you can search for solutions to known issues, each providing concise root cause descriptions and remedial steps.

17.2.1.2.2. Searching the Red Hat Knowledgebase

In the event of an OpenShift Container Platform issue, you can perform an initial search to determine if a solution already exists within the Red Hat Knowledgebase.

Prerequisites

  • You have a Red Hat Customer Portal account.

Procedure

  1. Log in to the Red Hat Customer Portal.
  2. Click Search.
  3. In the search field, input keywords and strings relating to the problem, including:

    • OpenShift Container Platform components (such as etcd)
    • Related procedure (such as installation)
    • Warnings, error messages, and other outputs related to explicit failures
  4. Click the Enter key.
  5. Optional: Select the OpenShift Container Platform product filter.
  6. Optional: Select the Documentation content type filter.
17.2.1.2.3. Submitting a support case

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the OpenShift CLI (oc).
  • You have a Red Hat Customer Portal account.
  • You have a Red Hat Standard or Premium subscription.

Procedure

  1. Log in to the Customer Support page of the Red Hat Customer Portal.
  2. Click Get support.
  3. On the Cases tab of the Customer Support page:

    1. Optional: Change the pre-filled account and owner details if needed.
    2. Select the appropriate category for your issue, such as Bug or Defect, and click Continue.
  4. Enter the following information:

    1. In the Summary field, enter a concise but descriptive problem summary and further details about the symptoms being experienced, as well as your expectations.
    2. Select OpenShift Container Platform from the Product drop-down menu.
    3. Select 4.17 from the Version drop-down.
  5. Review the list of suggested Red Hat Knowledgebase solutions for a potential match against the problem that is being reported. If the suggested articles do not address the issue, click Continue.
  6. Review the updated list of suggested Red Hat Knowledgebase solutions for a potential match against the problem that is being reported. The list is refined as you provide more information during the case creation process. If the suggested articles do not address the issue, click Continue.
  7. Ensure that the account information presented is as expected, and if not, amend accordingly.
  8. Check that the autofilled OpenShift Container Platform Cluster ID is correct. If it is not, manually obtain your cluster ID.

    • To manually obtain your cluster ID using the OpenShift Container Platform web console:

      1. Navigate to Home Overview.
      2. Find the value in the Cluster ID field of the Details section.
    • Alternatively, it is possible to open a new support case through the OpenShift Container Platform web console and have your cluster ID autofilled.

      1. From the toolbar, navigate to (?) Help Open Support Case.
      2. The Cluster ID value is autofilled.
    • To obtain your cluster ID using the OpenShift CLI (oc), run the following command:

      $ oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}'
  9. Complete the following questions where prompted and then click Continue:

    • What are you experiencing? What are you expecting to happen?
    • Define the value or impact to you or the business.
    • Where are you experiencing this behavior? What environment?
    • When does this behavior occur? Frequency? Repeatedly? At certain times?
  10. Upload relevant diagnostic data files and click Continue. It is recommended to include data gathered using the oc adm must-gather command as a starting point, plus any issue specific data that is not collected by that command.
  11. Input relevant case management details and click Continue.
  12. Preview the case details and click Submit.

17.2.2. General troubleshooting

When you encounter a problem, the first step is to find the specific area where the issue is happening. To narrow down the potential problematic areas, complete one or more tasks:

  • Query your cluster
  • Check your pod logs
  • Debug a pod
  • Review events

17.2.2.1. Querying your cluster

Get information about your cluster so that you can more accurately find potential problems.

Procedure

  1. Switch into a project by running the following command:

    $ oc project <project_name>
  2. Query your cluster version, cluster Operator, and node within that namespace by running the following command:

    $ oc get clusterversion,clusteroperator,node

    Example output

    NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    clusterversion.config.openshift.io/version   4.16.11   True        False         62d     Cluster version is 4.16.11
    
    NAME                                                                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    clusteroperator.config.openshift.io/authentication                             4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/baremetal                                  4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/cloud-controller-manager                   4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/cloud-credential                           4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/cluster-autoscaler                         4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/config-operator                            4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/console                                    4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/control-plane-machine-set                  4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/csi-snapshot-controller                    4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/dns                                        4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/etcd                                       4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/image-registry                             4.16.11   True        False         False      55d
    clusteroperator.config.openshift.io/ingress                                    4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/insights                                   4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/kube-apiserver                             4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/kube-controller-manager                    4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/kube-scheduler                             4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/kube-storage-version-migrator              4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/machine-api                                4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/machine-approver                           4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/machine-config                             4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/marketplace                                4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/monitoring                                 4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/network                                    4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/node-tuning                                4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/openshift-apiserver                        4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/openshift-controller-manager               4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/openshift-samples                          4.16.11   True        False         False      35d
    clusteroperator.config.openshift.io/operator-lifecycle-manager                 4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/operator-lifecycle-manager-catalog         4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/operator-lifecycle-manager-packageserver   4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/service-ca                                 4.16.11   True        False         False      62d
    clusteroperator.config.openshift.io/storage                                    4.16.11   True        False         False      62d
    
    NAME                STATUS   ROLES                         AGE   VERSION
    node/ctrl-plane-0   Ready    control-plane,master,worker   62d   v1.29.7
    node/ctrl-plane-1   Ready    control-plane,master,worker   62d   v1.29.7
    node/ctrl-plane-2   Ready    control-plane,master,worker   62d   v1.29.7

For more information, see "oc get" and "Reviewing pod status".

Additional resources

17.2.2.2. Checking pod logs

Get logs from the pod so that you can review the logs for issues.

Procedure

  1. List the pods by running the following command:

    $ oc get pod

    Example output

    NAME        READY   STATUS    RESTARTS          AGE
    busybox-1   1/1     Running   168 (34m ago)     7d
    busybox-2   1/1     Running   119 (9m20s ago)   4d23h
    busybox-3   1/1     Running   168 (43m ago)     7d
    busybox-4   1/1     Running   168 (43m ago)     7d

  2. Check pod log files by running the following command:

    $ oc logs -n <namespace> busybox-1

For more information, see "oc logs", "Logging", and "Inspecting pod and container logs".

17.2.2.3. Describing a pod

Describing a pod gives you information about that pod to help with troubleshooting. The Events section provides detailed information about the pod and the containers inside of it.

Procedure

  • Describe a pod by running the following command:

    $ oc describe pod -n <namespace> busybox-1

    Example output

    Name:             busybox-1
    Namespace:        busy
    Priority:         0
    Service Account:  default
    Node:             worker-3/192.168.0.0
    Start Time:       Mon, 27 Nov 2023 14:41:25 -0500
    Labels:           app=busybox
                      pod-template-hash=<hash>
    Annotations:      k8s.ovn.org/pod-networks:
    …
    Events:
      Type    Reason   Age                   From     Message
      ----    ------   ----                  ----     -------
      Normal  Pulled   41m (x170 over 7d1h)  kubelet  Container image "quay.io/quay/busybox:latest" already present on machine
      Normal  Created  41m (x170 over 7d1h)  kubelet  Created container busybox
      Normal  Started  41m (x170 over 7d1h)  kubelet  Started container busybox

For more information, see "oc describe".

Additional resources

17.2.2.4. Reviewing events

You can review the events in a given namespace to find potential issues.

Procedure

  1. Check for events in your namespace by running the following command:

    $ oc get events -n <namespace> --sort-by=".metadata.creationTimestamp" 1
    1
    Adding the --sort-by=".metadata.creationTimestamp" flag places the most recent events at the end of the output.
  2. Optional: If the events within your specified namespace do not provide enough information, expand your query to all namespaces by running the following command:

    $ oc get events -A --sort-by=".metadata.creationTimestamp" 1
    1
    The --sort-by=".metadata.creationTimestamp" flag places the most recent events at the end of the output.

    To filter the results of all events from a cluster, you can use the grep command. For example, if you are looking for errors, the errors can appear in two different sections of the output: the TYPE or MESSAGE sections. With the grep command, you can search for keywords, such as error or failed.

  3. For example, search for a message that contains warning or error by running the following command:

    $ oc get events -A | grep -Ei "warning|error"

    Example output

    NAMESPACE    LAST SEEN   TYPE      REASON          OBJECT              MESSAGE
    openshift    59s         Warning   FailedMount     pod/openshift-1     MountVolume.SetUp failed for volume "v4-0-config-user-idp-0-file-data" : references non-existent secret key: test

  4. Optional: To clean up the events and see only recurring events, you can delete the events in the relevant namespace by running the following command:

    $ oc delete events -n <namespace> --all

For more information, see "Watching cluster events".

Additional resources

17.2.2.5. Connecting to a pod

You can directly connect to a currently running pod with the oc rsh command, which provides you with a shell on that pod.

Warning

In pods that run a low-latency application, latency issues can occur when you run the oc rsh command. Use the oc rsh command only if you cannot connect to the node by using the oc debug command.

Procedure

  • Connect to your pod by running the following command:

    $ oc rsh -n <namespace> busybox-1

For more information, see "oc rsh" and "Accessing running pods".

Additional resources

17.2.2.6. Debugging a pod

In certain cases, you do not want to directly interact with your pod that is in production.

To avoid interfering with running traffic, you can use a secondary pod that is a copy of your original pod. The secondary pod uses the same components as that of the original pod but does not have running traffic.

Procedure

  1. List the pods by running the following command:

    $ oc get pod

    Example output

    NAME        READY   STATUS    RESTARTS          AGE
    busybox-1   1/1     Running   168 (34m ago)     7d
    busybox-2   1/1     Running   119 (9m20s ago)   4d23h
    busybox-3   1/1     Running   168 (43m ago)     7d
    busybox-4   1/1     Running   168 (43m ago)     7d

  2. Debug a pod by running the following command:

    $ oc debug -n <namespace> busybox-1

    Example output

    Starting pod/busybox-1-debug, command was: sleep 3600
    Pod IP: 10.133.2.11

    If you do not see a shell prompt, press Enter.

For more information, see "oc debug" and "Starting debug pods with root access".

17.2.2.7. Running a command on a pod

If you want to run a command or set of commands on a pod without directly logging into it, you can use the oc exec -it command. You can interact with the pod quickly to get process or output information from the pod. A common use case is to run the oc exec -it command inside a script to run the same command on multiple pods in a replica set or deployment.

Warning

In pods that run a low-latency application, the oc exec command can cause latency issues.

Procedure

  • To run a command on a pod without logging into it, run the following command:

    $ oc exec -it <pod> -- <command>

For more information, see "oc exec" and "Executing remote commands in containers".

17.2.3. Cluster maintenance

In telco networks, you must pay more attention to certain configurations due the nature of bare-metal deployments. You can troubleshoot more effectively by completing these tasks:

  • Monitor for failed or failing hardware components
  • Periodically check the status of the cluster Operators
Note

For hardware monitoring, contact your hardware vendor to find the appropriate logging tool for your specific hardware.

17.2.3.1. Checking cluster Operators

Periodically check the status of your cluster Operators to find issues early.

Procedure

  • Check the status of the cluster Operators by running the following command:

    $ oc get co

17.2.3.2. Watching for failed pods

To reduce troubleshooting time, regularly monitor for failed pods in your cluster.

Procedure

  • To watch for failed pods, run the following command:

    $ oc get po -A | grep -Eiv 'complete|running'

17.2.4. Security

Implementing a robust cluster security profile is important for building resilient telco networks.

17.2.4.1. Authentication

Determine which identity providers are in your cluster. For more information about supported identity providers, see "Supported identity providers" in Authentication and authorization.

After you know which providers are configured, you can inspect the openshift-authentication namespace to determine if there are potential issues.

Procedure

  1. Check the events in the openshift-authentication namespace by running the following command:

    $ oc get events -n openshift-authentication --sort-by='.metadata.creationTimestamp'
  2. Check the pods in the openshift-authentication namespace by running the following command:

    $ oc get pod -n openshift-authentication
  3. Optional: If you need more information, check the logs of one of the running pods by running the following command:

    $ oc logs -n openshift-authentication <pod_name>

Additional resources

17.2.5. Certificate maintenance

Certificate maintenance is required for continuous cluster authentication. As a cluster administrator, you must manually renew certain certificates, while others are automatically renewed by the cluster.

Learn about certificates in OpenShift Container Platform and how to maintain them by using the following resources:

17.2.5.1. Certificates manually managed by the administrator

The following certificates must be renewed by a cluster administrator:

  • Proxy certificates
  • User-provisioned certificates for the API server
17.2.5.1.1. Managing proxy certificates

Proxy certificates allow users to specify one or more custom certificate authority (CA) certificates that are used by platform components when making egress connections.

Note

Certain CAs set expiration dates and you might need to renew these certificates every two years.

If you did not originally set the requested certificates, you can determine the certificate expiration in several ways. Most Cloud-native Network Functions (CNFs) use certificates that are not specifically designed for browser-based connectivity. Therefore, you need to pull the certificate from the ConfigMap object of your deployment.

Procedure

  • To get the expiration date, run the following command against the certificate file:

    $ openssl x509 -enddate -noout -in <cert_file_name>.pem

For more information about determining how and when to renew your proxy certificates, see "Proxy certificates" in Security and compliance.

Additional resources

17.2.5.1.2. User-provisioned API server certificates

The API server is accessible by clients that are external to the cluster at api.<cluster_name>.<base_domain>. You might want clients to access the API server at a different hostname or without the need to distribute the cluster-managed certificate authority (CA) certificates to the clients. You must set a custom default certificate to be used by the API server when serving content.

For more information, see "User-provided certificates for the API server" in Security and compliance

17.2.5.2. Certificates managed by the cluster

You only need to check cluster-managed certificates if you detect an issue in the logs. The following certificates are automatically managed by the cluster:

  • Service CA certificates
  • Node certificates
  • Bootstrap certificates
  • etcd certificates
  • OLM certificates
  • Machine Config Operator certificates
  • Monitoring and cluster logging Operator component certificates
  • Control plane certificates
  • Ingress certificates
17.2.5.2.1. Certificates managed by etcd

The etcd certificates are used for encrypted communication between etcd member peers as well as encrypted client traffic. The certificates are renewed automatically within the cluster provided that communication between all nodes and all services is current. Therefore, if your cluster might lose communication between components during a specific period of time, which is close to the end of the etcd certificate lifetime, it is recommended to renew the certificate in advance. For example, communication can be lost during an upgrade due to nodes rebooting at different times.

  • You can manually renew etcd certificates by running the following command:

    $ for each in $(oc get secret -n openshift-etcd | grep "kubernetes.io/tls" | grep -e \
    "etcd-peer\|etcd-serving" | awk '{print $1}'); do oc get secret $each -n openshift-etcd -o \
    jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -enddate; done

For more information about updating etcd certificates, see Checking etcd certificate expiry in OpenShift 4. For more information about etcd certificates, see "etcd certificates" in Security and compliance.

Additional resources

17.2.5.2.2. Node certificates

Node certificates are self-signed certificates, which means that they are signed by the cluster and they originate from an internal certificate authority (CA) that is generated by the bootstrap process.

After the cluster is installed, the cluster automatically renews the node certificates.

For more information, see "Node certificates" in Security and compliance.

Additional resources

17.2.5.2.3. Service CA certificates

The service-ca is an Operator that creates a self-signed certificate authority (CA) when an OpenShift Container Platform cluster is deployed. This allows user to add certificates to their deployments without manually creating them. Service CA certificates are self-signed certificates.

For more information, see "Service CA certificates" in Security and compliance.

Additional resources

17.2.6. Machine Config Operator

The Machine Config Operator provides useful information to cluster administrators and controls what is running directly on the bare-metal host.

The Machine Config Operator differentiates between different groups of nodes in the cluster, allowing control plane nodes and worker nodes to run with different configurations. These groups of nodes run worker or application pods, which are called MachineConfigPool (mcp) groups. The same machine config is applied on all nodes or only on one MCP in the cluster.

For more information about how and why to apply MCPs in a telco core cluster, see Applying MachineConfigPool labels to nodes before the update.

For more information about the Machine Config Operator, see Machine Config Operator.

17.2.6.1. Purpose of the Machine Config Operator

The Machine Config Operator (MCO) manages and applies configuration and updates of Red Hat Enterprise Linux CoreOS (RHCOS) and container runtime, including everything between the kernel and kubelet. Managing RHCOS is important since most telecommunications companies run on bare-metal hardware and use some sort of hardware accelerator or kernel modification. Applying machine configuration to RHCOS manually can cause problems because the MCO monitors each node and what is applied to it.

You must consider these minor components and how the MCO can help you manage your clusters effectively.

Important

You must use the MCO to perform all changes on worker or control plane nodes. Do not manually make changes to RHCOS or node files.

17.2.6.2. Applying several machine config files at the same time

When you need to change the machine config for a group of nodes in the cluster, also known as machine config pools (MCPs), sometimes the changes must be applied with several different machine config files. The nodes need to restart for the machine config file to be applied. After each machine config file is applied to the cluster, all nodes restart that are affected by the machine config file.

To prevent the nodes from restarting for each machine config file, you can apply all of the changes at the same time by pausing each MCP that is updated by the new machine config file.

Procedure

  1. Pause the affected MCP by running the following command:

    $ oc patch mcp/<mcp_name> --type merge --patch '{"spec":{"paused":true}}'
  2. After you apply all machine config changes to the cluster, run the following command:

    $ oc patch mcp/<mcp_name> --type merge --patch '{"spec":{"paused":false}}'

This allows the nodes in your MCP to reboot into the new configurations.

17.2.7. Bare-metal node maintenance

You can connect to a node for general troubleshooting. However, in some cases, you need to perform troubleshooting or maintenance tasks on certain hardware components. This section discusses topics that you need to perform that hardware maintenance.

17.2.7.1. Connecting to a bare-metal node in your cluster

You can connect to bare-metal cluster nodes for general maintenance tasks.

Note

Configuring the cluster node from the host operating system is not recommended or supported.

To troubleshoot your nodes, you can do the following tasks:

  • Retrieve logs from node
  • Use debugging
  • Use SSH to connect to the node
Important

Use SSH only if you cannot connect to the node with the oc debug command.

Procedure

  1. Retrieve the logs from a node by running the following command:

    $ oc adm node-logs <node_name> -u crio
  2. Use debugging by running the following command:

    $ oc debug node/<node_name>
  3. Set /host as the root directory within the debug shell. The debug pod mounts the host’s root file system in /host within the pod. By changing the root directory to /host, you can run binaries contained in the host’s executable paths:

    # chroot /host

    Output

    You are now logged in as root on the node

  4. Optional: Use SSH to connect to the node by running the following command:

    $ ssh core@<node_name>

17.2.7.2. Moving applications to pods within the cluster

For scheduled hardware maintenance, you need to consider how to move your application pods to other nodes within the cluster without affecting the pod workload.

Procedure

  • Mark the node as unschedulable by running the following command:

    $ oc adm cordon <node_name>

When the node is unschedulable, no pods can be scheduled on the node. For more information, see "Working with nodes".

Note

When moving CNF applications, you might need to verify ahead of time that there are enough additional worker nodes in the cluster due to anti-affinity and pod disruption budget.

Additional resources

17.2.7.3. DIMM memory replacement

Dual in-line memory module (DIMM) problems sometimes only appear after a server reboots. You can check the log files for these problems.

When you perform a standard reboot and the server does not start, you can see a message in the console that there is a faulty DIMM memory. In that case, you can acknowledge the faulty DIMM and continue rebooting if the remaining memory is sufficient. Then, you can schedule a maintenance window to replace the faulty DIMM.

Sometimes, a message in the event logs indicates a bad memory module. In these cases, you can schedule the memory replacement before the server is rebooted.

17.2.7.4. Disk replacement

If you do not have disk redundancy configured on your node through hardware or software redundant array of independent disks (RAID), you need to check the following:

  • Does the disk contain running pod images?
  • Does the disk contain persistent data for pods?

For more information, see "OpenShift Container Platform storage overview" in Storage.

17.2.7.5. Cluster network card replacement

When you replace a network card, the MAC address changes. The MAC address can be part of the DHCP or SR-IOV Operator configuration, router configuration, firewall rules, or application Cloud-native Network Function (CNF) configuration. Before you bring back a node online after replacing a network card, you must verify that these configurations are up-to-date.

Important

If you do not have specific procedures for MAC address changes within the network, contact your network administrator or network hardware vendor.

17.3. Observability

17.3.1. Observability in OpenShift Container Platform

OpenShift Container Platform generates a large amount of data, such as performance metrics and logs from both the platform and the workloads running on it. As an administrator, you can use various tools to collect and analyze all the data available. What follows is an outline of best practices for system engineers, architects, and administrators configuring the observability stack.

Unless explicitly stated, the material in this document refers to both Edge and Core deployments.

17.3.1.1. Understanding the monitoring stack

The monitoring stack uses the following components:

  • Prometheus collects and analyzes metrics from OpenShift Container Platform components and from workloads, if configured to do so.
  • Alertmanager is a component of Prometheus that handles routing, grouping, and silencing of alerts.
  • Thanos handles long term storage of metrics.

Figure 17.2. OpenShift Container Platform monitoring architecture

OpenShift Container Platform monitoring architecture
Note

For a single-node OpenShift cluster, you should disable Alertmanager and Thanos because the cluster sends all metrics to the hub cluster for analysis and retention.

17.3.1.2. Key performance metrics

Depending on your system, there can be hundreds of available measurements.

Here are some key metrics that you should pay attention to:

  • etcd response times
  • API response times
  • Pod restarts and scheduling
  • Resource usage
  • OVN health
  • Overall cluster operator health

A good rule to follow is that if you decide that a metric is important, there should be an alert for it.

Note

You can check the available metrics by running the following command:

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -qsk http://localhost:9090/api/v1/metadata | jq '.data
17.3.1.2.1. Example queries in PromQL

The following tables show some queries that you can explore in the metrics query browser using the OpenShift Container Platform console.

Note

The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser. You can get the OpenShift Console FQDN by running the following command:

$ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].host}'
Table 17.1. Node memory & CPU usage
MetricQuery

CPU % requests by node

sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="cpu"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100

Overall cluster CPU % utilization

sum by (managed_cluster) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (managed_cluster) (sum_over_time(kube_node_status_allocatable{resource="cpu"}[60m])) *100

Memory % requests by node

sum by (node) (sum_over_time(kube_pod_container_resource_requests{resource="memory"}[60m]))/sum by (node) (sum_over_time(kube_node_status_allocatable{resource="memory"}[60m])) *100

Overall cluster memory % utilization

(1-(sum by (managed_cluster)(avg_over_timenode_memory_MemAvailable_bytes[60m] ))/sum by (managed_cluster)(avg_over_time(kube_node_status_allocatable{resource="memory"}[60m])))*100

Table 17.2. API latency by verb
MetricQuery

GET

histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"kube-apiserver|openshift-apiserver", verb="GET"}[60m])))

PATCH

histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver|openshift-apiserver", verb="PATCH"}[60m])))

POST

histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver|openshift-apiserver", verb="POST"}[60m])))

LIST

histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver|openshift-apiserver", verb="LIST"}[60m])))

PUT

histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver|openshift-apiserver", verb="PUT"}[60m])))

DELETE

histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver|openshift-apiserver", verb="DELETE"}[60m])))

Combined

histogram_quantile(0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver=~"(openshift-apiserver|kube-apiserver)", verb!="WATCH"}[60m])))

Table 17.3. etcd
MetricQuery

fsync 99th percentile latency (per instance)

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))

fsync 99th percentile latency (per cluster)

sum by (managed_cluster) ( histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[60m])))

Leader elections

sum(rate(etcd_server_leader_changes_seen_total[1440m]))

Network latency

histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))

Table 17.4. Operator health
MetricQuery

Degraded operators

sum by (managed_cluster, name) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m]))

Total degraded operators per cluster

sum by (managed_cluster) (avg_over_time(cluster_operator_conditions{condition="Degraded", name!="version"}[60m] ))

17.3.1.2.2. Recommendations for storage of metrics

Out of the box, Prometheus does not back up saved metrics with persistent storage. If you restart the Prometheus pods, all metrics data are lost. You should configure the monitoring stack to use the back-end storage that is available on the platform. To meet the high IO demands of Prometheus you should use local storage.

For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus.

Red Hat OpenShift Data Foundation (ODF), which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster.

To keep system resource requirements low on a RAN single-node OpenShift or far edge cluster, you should not provision backend storage for the monitoring stack. Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.

17.3.1.3. Monitoring the edge

Single-node OpenShift at the edge keeps the footprint of the platform components to a minimum. The following procedure is an example of how you can configure a single-node OpenShift node with a small monitoring footprint.

Prerequisites

  • For environments that use Red Hat Advanced Cluster Management (RHACM), you have enabled the Observability service.
  • The hub cluster is running Red Hat OpenShift Data Foundation (ODF).

Procedure

  1. Create a ConfigMap CR, and save it as monitoringConfigMap.yaml, as in the following example:

    apiVersion: v1
    kind: ConfigMap
    metadata:
     name: cluster-monitoring-config
     namespace: openshift-monitoring
     data:
     config.yaml: |
       alertmanagerMain:
         enabled: false
       telemeterClient:
         enabled: false
       prometheusK8s:
          retention: 24h
  2. On the single-node OpenShift, apply the ConfigMap CR by running the following command:

    $ oc apply -f monitoringConfigMap.yaml
  3. Create a NameSpace CR, and save it as monitoringNamespace.yaml, as in the following example:

    apiVersion: v1
    kind: Namespace
    metadata:
      name: open-cluster-management-observability
  4. On the hub cluster, apply the Namespace CR on the hub cluster by running the following command:

    $ oc apply -f monitoringNamespace.yaml
  5. Create an ObjectBucketClaim CR, and save it as monitoringObjectBucketClaim.yaml, as in the following example:

    apiVersion: objectbucket.io/v1alpha1
    kind: ObjectBucketClaim
    metadata:
      name: multi-cloud-observability
      namespace: open-cluster-management-observability
    spec:
      storageClassName: openshift-storage.noobaa.io
      generateBucketName: acm-multi
  6. On the hub cluster, apply the ObjectBucketClaim CR, by running the following command:

    $ oc apply -f monitoringObjectBucketClaim.yaml
  7. Create a Secret CR, and save it as monitoringSecret.yaml, as in the following example:

    apiVersion: v1
    kind: Secret
    metadata:
      name: multiclusterhub-operator-pull-secret
      namespace: open-cluster-management-observability
    stringData:
      .dockerconfigjson: 'PULL_SECRET'
  8. On the hub cluster, apply the Secret CR by running the following command:

    $ oc apply -f monitoringSecret.yaml
  9. Get the keys for the NooBaa service and the backend bucket name from the hub cluster by running the following commands:

    $ NOOBAA_ACCESS_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')
    $ NOOBAA_SECRET_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')
    $ OBJECT_BUCKET=$(oc get objectbucketclaim -n open-cluster-management-observability multi-cloud-observability -o json | jq -r .spec.bucketName)
  10. Create a Secret CR for bucket storage and save it as monitoringBucketSecret.yaml, as in the following example:

    apiVersion: v1
    kind: Secret
    metadata:
      name: thanos-object-storage
      namespace: open-cluster-management-observability
    type: Opaque
    stringData:
      thanos.yaml: |
        type: s3
        config:
          bucket: ${OBJECT_BUCKET}
          endpoint: s3.openshift-storage.svc
          insecure: true
          access_key: ${NOOBAA_ACCESS_KEY}
          secret_key: ${NOOBAA_SECRET_KEY}
  11. On the hub cluster, apply the Secret CR by running the following command:

    $ oc apply -f monitoringBucketSecret.yaml
  12. Create the MultiClusterObservability CR and save it as monitoringMultiClusterObservability.yaml, as in the following example:

    apiVersion: observability.open-cluster-management.io/v1beta2
    kind: MultiClusterObservability
    metadata:
      name: observability
    spec:
      advanced:
        retentionConfig:
          blockDuration: 2h
          deleteDelay: 48h
          retentionInLocal: 24h
          retentionResolutionRaw: 3d
      enableDownsampling: false
      observabilityAddonSpec:
        enableMetrics: true
        interval: 300
      storageConfig:
        alertmanagerStorageSize: 10Gi
        compactStorageSize: 100Gi
        metricObjectStorage:
          key: thanos.yaml
          name: thanos-object-storage
        receiveStorageSize: 25Gi
        ruleStorageSize: 10Gi
        storeStorageSize: 25Gi
  13. On the hub cluster, apply the MultiClusterObservability CR by running the following command:

    $ oc apply -f monitoringMultiClusterObservability.yaml

Verification

  1. Check the routes and pods in the namespace to validate that the services have deployed on the hub cluster by running the following command:

    $ oc get routes,pods -n open-cluster-management-observability

    Example output

    NAME                                         HOST/PORT                                                                                        PATH      SERVICES                          PORT          TERMINATION          WILDCARD
    route.route.openshift.io/alertmanager        alertmanager-open-cluster-management-observability.cloud.example.com        /api/v2   alertmanager                      oauth-proxy   reencrypt/Redirect   None
    route.route.openshift.io/grafana             grafana-open-cluster-management-observability.cloud.example.com                       grafana                           oauth-proxy   reencrypt/Redirect   None 1
    route.route.openshift.io/observatorium-api   observatorium-api-open-cluster-management-observability.cloud.example.com             observability-observatorium-api   public        passthrough/None     None
    route.route.openshift.io/rbac-query-proxy    rbac-query-proxy-open-cluster-management-observability.cloud.example.com              rbac-query-proxy                  https         reencrypt/Redirect   None
    
    NAME                                                           READY   STATUS    RESTARTS   AGE
    pod/observability-alertmanager-0                               3/3     Running   0          1d
    pod/observability-alertmanager-1                               3/3     Running   0          1d
    pod/observability-alertmanager-2                               3/3     Running   0          1d
    pod/observability-grafana-685b47bb47-dq4cw                     3/3     Running   0          1d
    <...snip…>
    pod/observability-thanos-store-shard-0-0                       1/1     Running   0          1d
    pod/observability-thanos-store-shard-1-0                       1/1     Running   0          1d
    pod/observability-thanos-store-shard-2-0                       1/1     Running   0          1d

    1
    A dashboard is accessible at the grafana route listed. You can use this to view metrics across all managed clusters.

For more information on observability in Red Hat Advanced Cluster Management, see Observability.

17.3.1.4. Alerting

OpenShift Container Platform includes a large number of alert rules, which can change from release to release.

17.3.1.4.1. Viewing default alerts

Use the following procedure to review all of the alert rules in a cluster.

Procedure

  • To review all the alert rules in a cluster, you can run the following command:

    $ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml

    Rules can include a description and provide a link to additional information and mitigation steps. For example, this is the rule for etcdHighFsyncDurations:

          - alert: etcdHighFsyncDurations
            annotations:
              description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
                are {{ $value }}s on etcd instance {{ $labels.instance }}.'
              runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
              summary: etcd cluster 99th percentile fsync durations are too high.
            expr: |
              histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
              > 1
            for: 10m
            labels:
              severity: critical
17.3.1.4.2. Alert notifications

You can view alerts in the OpenShift Container Platform console, however an administrator should configure an external receiver to forward the alerts to. OpenShift Container Platform supports the following receiver types:

  • PagerDuty: a 3rd party incident response platform
  • Webhook: an arbitrary API endpoint that receives an alert via a POST request and can take any necessary action
  • Email: sends an email to designated address
  • Slack: sends a notification to either a slack channel or an individual user

Additional resources

17.3.1.5. Workload monitoring

By default, OpenShift Container Platform does not collect metrics for application workloads. You can configure a cluster to collect workload metrics.

Prerequisites

  • You have defined endpoints to gather workload metrics on the cluster.

Procedure

  1. Create a ConfigMap CR and save it as monitoringConfigMap.yaml, as in the following example:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true 1
    1
    Set to true to enable workload monitoring.
  2. Apply the ConfigMap CR by running the following command:

    $ oc apply -f monitoringConfigMap.yaml
  3. Create a ServiceMonitor CR, and save it as monitoringServiceMonitor.yaml, as in the following example:

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        app: ui
      name: myapp
      namespace: myns
    spec:
      endpoints: 1
      - interval: 30s
        port: ui-http
        scheme: http
        path: /healthz 2
      selector:
        matchLabels:
          app: ui
    1
    Use endpoints to define workload metrics.
    2
    Prometheus scrapes the path /metrics by default. You can define a custom path here.
  4. Apply the ServiceMonitor CR by running the following command:

    $ oc apply -f monitoringServiceMonitor.yaml

Prometheus scrapes the path /metrics by default, however you can define a custom path. It is up to the vendor of the application to expose this endpoint for scraping, with metrics that they deem relevant.

17.3.1.5.1. Creating a workload alert

You can enable alerts for user workloads on a cluster.

Procedure

  1. Create a ConfigMap CR, and save it as monitoringConfigMap.yaml, as in the following example:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true 1
    # ...
    1
    Set to true to enable workload monitoring.
  2. Apply the ConfigMap CR by running the following command:

    $ oc apply -f monitoringConfigMap.yaml
  3. Create a YAML file for alerting rules, monitoringAlertRule.yaml, as in the following example:

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: myapp-alert
      namespace: myns
    spec:
      groups:
      - name: example
        rules:
        - alert: InternalErrorsAlert
          expr: flask_http_request_total{status="500"} > 0
    # ...
  4. Apply the alert rule by running the following command:

    $ oc apply -f monitoringAlertRule.yaml

17.4. Security

17.4.1. Security basics

Security is a critical component of telecommunications deployments on OpenShift Container Platform, particularly when running cloud-native network functions (CNFs).

You can enhance security for high-bandwidth network deployments in telecommunications (telco) environments by following key security considerations. By implementing these standards and best practices, you can strengthen security in telco-specific use cases.

17.4.1.1. RBAC overview

Role-based access control (RBAC) objects determine whether a user is allowed to perform a given action within a project.

Cluster administrators can use the cluster roles and bindings to control who has various access levels to the OpenShift Container Platform platform itself and all projects.

Developers can use local roles and bindings to control who has access to their projects. Note that authorization is a separate step from authentication, which is more about determining the identity of who is taking the action.

Authorization is managed using the following authorization objects:

Rules
Are sets of permitted actions on specific objects. For example, a rule can determine whether a user or service account can create pods. Each rule specifies an API resource, the resource within that API, and the allowed action.
Roles

Are collections of rules that define what actions users or groups can perform. You can associate or bind rules to multiple users or groups. A role file can contain one or more rules that specify the actions and resources allowed for that role.

Roles are categorized into the following types:

  • Cluster roles: You can define cluster roles at the cluster level. They are not tied to a single namespace. They can apply across all namespaces or specific namespaces when you bind them to users, groups, or service accounts.
  • Project roles: You can create project roles within a specific namespace, and they only apply to that namespace. You can assign permissions to specific users to create roles and role bindings within their namespace, ensuring they do not affect other namespaces.
Bindings

Are associations between users and/or groups with a role. You can create a role binding to connect the rules in a role to a specific user ID or group. This brings together the role and the user or group, defining what actions they can perform.

Note

You can bind more than one role to a user or group.

For more information on RBAC, see "Using RBAC to define and apply permissions".

Operational RBAC considerations

To reduce operational overhead, it is important to manage access through groups rather than handling individual user IDs across multiple clusters. By managing groups at an organizational level, you can streamline access control and simplify administration across your organization.

17.4.1.2. Security accounts overview

A service account is an OpenShift Container Platform account that allows a component to directly access the API. Service accounts are API objects that exist within each project. Service accounts provide a flexible way to control API access without sharing a regular user’s credentials.

You can use service accounts to apply role-based access control (RBAC) to pods. By assigning service accounts to workloads, such as pods and deployments, you can grant additional permissions, such as pulling from different registries. This also allows you to assign lower privileges to service accounts, reducing the security footprint of the pods that run under them.

For more information about service accounts, see "Understanding and creating service accounts".

17.4.1.3. Identity provider configuration

Configuring an identity provider is the first step in setting up users on the cluster. You can manage groups at the organizational level by using an identity provider.

The identity provider can pull in specific user groups that are maintained at the organizational level, rather than the cluster level. This allows you to add and remove users from groups that follow your organization’s established practices.

Note

You must set up a cron job to run frequently to pull any changes into the cluster.

You can use an identity provider to manage access levels for specific groups within your organization. For example, you can perform the following actions to manage access levels:

  • Assign the cluster-admin role to teams that require cluster-level privileges.
  • Grant application administrators specific privileges to manage only their respective projects.
  • Provide operational teams with view access across the cluster to enable monitoring without allowing modifications.

For information about configuring an identity provider, see "Understanding identity provider configuration".

17.4.1.4. Replacing the kubeadmin user with a cluster-admin user

The kubeadmin user with the cluster-admin privileges is created on every cluster by default. To enhance the cluster security, you can replace the`kubeadmin` user with a cluster-admin user and then disable or remove the kubeadmin user.

Prerequisites

  • You have created a user with cluster-admin privileges.
  • You have installed the OpenShift CLI (oc).
  • You have administrative access to a virtual vault for secure storage.

Procedure

  1. Create an emergency cluster-admin user by using the htpasswd identity provider. For more information, see "About htpasswd authentication".
  2. Assign the cluster-admin privileges to the new user by running the following command:

    $ oc adm policy add-cluster-role-to-user cluster-admin <emergency_user>
  3. Verify the emergency user access:

    1. Log in to the cluster using the new emergency user.
    2. Confirm that the user has cluster-admin privileges by running the following command:

      $ oc whoami

      Ensure the output shows the emergency user’s ID.

  4. Store the password or authentication key for the emergency user securely in a virtual vault.

    Note

    Follow the best practices of your organization for securing sensitive credentials.

  5. Disable or remove the kubeadmin user to reduce security risks by running the following command:

    $ oc delete secrets kubeadmin -n kube-system

Additional resources

17.4.1.5. Security considerations for telco CNFs

Telco workloads handle vast amounts of sensitive data and demand high reliability. A single security vulnerability can lead to broader cluster-wide compromises. With numerous components running on a single-node OpenShift cluster, each component must be secured to prevent any breach from escalating. Ensuring security across the entire infrastructure, including all components, is essential to maintaining the integrity of the telco network and avoiding vulnerabilities.

The following key security features are essential for telco:

  • Security Context Constraints (SCCs): Provide granular control over pod security in the OpenShift clusters.
  • Pod Security Admission (PSA): Kubernetes-native pod security controls.
  • Encryption: Ensures data confidentiality in high-throughput network environments.

17.4.1.6. Advancement of pod security in Kubernetes and OpenShift Container Platform

Kubernetes initially had limited pod security. When OpenShift Container Platform integrated Kubernetes, Red Hat added pod security through Security Context Constraints (SCCs). In Kubernetes version 1.3, PodSecurityPolicy (PSP) was introduced as a similar feature. However, Pod Security Admission (PSA) was introduced in Kubernetes version 1.21, which resulted in the deprecation of PSP in Kubernetes version 1.25.

PSA also became available in OpenShift Container Platform version 4.11. While PSA improves pod security, it lacks some features provided by SCCs that are still necessary for telco use cases. Therefore, OpenShift Container Platform continues to support both PSA and SCCs.

17.4.1.7. Key areas for CNF deployment

The cloud-native network function (CNF) deployment contains the following key areas:

Core
The first deployments of CNFs occurred in the core of the wireless network. Deploying CNFs in the core typically means racks of servers placed in central offices or data centers. These servers are connected to both the internet and the Radio Access Network (RAN), but they are often behind multiple security firewalls or sometimes disconnected from the internet altogether. This type of setup is called an offline or disconnected cluster.
RAN
After CNFs were successfully tested in the core network and found to be effective, they were considered for deployment in the Radio Access Network (RAN). Deploying CNFs in RAN requires a large number of servers (up to 100,000 in a large deployment). These servers are located near cellular towers and typically run as single-node OpenShift clusters, with the need for high scalability.

17.4.1.8. Telco-specific infrastructure

Hardware requirements
In telco networks, clusters are primarily built on bare-metal hardware. This means that the operating system (op-system-first) is installed directly on the physical machines, without using virtual machines. This reduces network connectivity complexity, minimizes latency, and optimizes CPU usage for applications.
Network requirements
Telco networks require much higher bandwidth compared to standard IT networks. Telco networks commonly use dual-port 25 GB connections or 100 GB Network Interface Cards (NICs) to handle massive data throughput. Security is critical, requiring encrypted connections and secure endpoints to protect sensitive personal data.

17.4.1.9. Lifecycle management

Upgrades are critical for security. When a vulnerability is discovered, it is patched in the latest z-stream release. This fix is then rolled back through each lower y-stream release until all supported versions are patched. Releases that are no longer supported do not receive patches. Therefore, it is important to upgrade OpenShift Container Platform clusters regularly to stay within a supported release and ensure they remain protected against vulnerabilities.

For more information about lifecycle management and upgrades, see "Upgrading a telco core CNF clusters".

17.4.1.10. Evolution of Network Functions to CNFs

Network Functions (NFs) began as Physical Network Functions (PNFs), which were purpose-built hardware devices operating independently. Over time, PNFs evolved into Virtual Network Functions (VNFs), which virtualized their capabilities while controlling resources like CPU, memory, storage, and network.

As technology advanced further, VNFs transitioned to cloud-native network functions (CNFs). CNFs run in lightweight, secure, and scalable containers. They enforce stringent restrictions, including non-root execution and minimal host interference, to enhance security and performance.

PNFs had unrestricted root access to operate independently without interference. With the shift to VNFs, resource usage was controlled, but processes could still run as root within their virtual machines. In contrast, CNFs restrict root access and limit container capabilities to prevent interference with other containers or the host operating system.

The main challenges in migrating to CNFs are as follows:

  • Breaking down monolithic network functions into smaller, containerized processes.
  • Adhering to cloud-native principles, such as non-root execution and isolation, while maintaining telco-grade performance and reliability.

17.4.2. Host security

17.4.2.1. Red Hat Enterprise Linux CoreOS (RHCOS)

Red Hat Enterprise Linux CoreOS (RHCOS) is different from Red Hat Enterprise Linux (RHEL) in key areas. For more information, see "About RHCOS".

From a telco perspective, a major distinction is the control of rpm-ostree, which is updated through the Machine Config Operator.

RHCOS follows the same immutable design used for pods in OpenShift Container Platform. This ensures that the operating system remains consistent across the cluster. For information about RHCOS architecture, see "Red Hat Enterprise Linux CoreOS (RHCOS)".

To manage hosts effectively while maintaining security, avoid direct access whenever possible. Instead, you can use the following methods for host management:

  • Debug pod
  • Direct SSHs
  • Console access

Review the following RHCOS secruity mechanisms that are integral to maintaining host security:

Linux namespaces
Provide isolation for processes and resources. Each container keeps its processes and files within its own namespace. If a user escapes from the container namespace, they could gain access to the host operating system, potentially compromising security.
Security-Enhanced Linux (SELinux)

Enforces mandatory access controls to restrict access to files and directories by processes. It adds an extra layer of security by preventing unauthorized access to files if a process tries to break its confinement.

SELinux follows the security policy of denying everything unless explicitly allowed. If a process attempts to modify or access a file without permission, SELinux denies access. For more information, see Introduction to SELinux.

Linux capabilities
Assign specific privileges to processes at a granular level, minimizing the need for full root permissions. For more information, see "Linux capabilities".
Control groups (cgroups)
Allocate and manage system resources, such as CPU and memory for processes and containers, ensuring efficient usage. As of OpenShift Container Platform 4.16, there are two versions of cgroups. cgroup v2 is now configured by default.
CRI-O
Serves as a lightweight container runtime that enforces security boundaries and manages container workloads.

17.4.2.2. Command-line host access

Direct access to a host must be restricted to avoid modifying the host or accessing pods that should not be accessed. For users who need direct access to a host, it is recommended to use an external authenticator, like SSSD with LDAP, to manage access. This helps maintain consistency across the cluster through the Machine Config Operator.

Important

Do not configure direct access to the root ID on any OpenShift Container Platform cluster server.

You can connect to a node in the cluster using the following methods:

Using debug pod

This is the recommended method to access a node. To debug or connect to a node, run the following command:

$ oc debug node/<worker_node_name>

After connecting to the node, run the following command to get access to the root file system:

# chroot /host

This gives you root access within a debug pod on the node. For more information, see "Starting debug pods with root access".

Direct SSH

Avoid using the root user. Instead, use the core user ID (or your own ID). To connect to the node using SSH, run the following command:

$ ssh core@<worker_node_name>
Important

The core user ID is initially given sudo privileges within the cluster.

If you cannot connect to a node using SSH, see How to connect to OpenShift Container Platform 4.x Cluster nodes using SSH bastion pod to add your SSH key to the core user.

After connecting to the node using SSH, run the following command to get access to the root shell:

$ sudo -i
Console Access

Ensure that consoles are secure. Do not allow direct login with the root ID, instead use individual IDs.

Note

Follow the best practices of your organization for securing console access.

17.4.2.3. Linux capabilities

Linux capabilities define the actions a process can perform on the host system. By default, pods are granted several capabilities unless security measures are applied. These default capabilities are as follows:

  • CHOWN
  • DAC_OVERRIDE
  • FSETID
  • FOWNER
  • SETGID
  • SETUID
  • SETPCAP
  • NET_BIND_SERVICE
  • KILL

You can modify which capabilities that a pod can receive by configuring Security Context Constraints (SCCs).

Important

You must not assign the following capabilities to a pod:

  • SYS_ADMIN: A powerful capability that grants elevated privileges. Allowing this capability can break security boundaries and pose a significant security risk.
  • NET_ADMIN: Allows control over networking, like SR-IOV ports, but can be replaced with alternative solutions in modern setups.

For more information about Linux capabilities, see Linux capabilities man page.

17.4.3. Security context constraints

Similar to the way that RBAC resources control user access, administrators can use security context constraints (SCCs) to control permissions for pods. These permissions determine the actions that a pod can perform and what resources it can access. You can use SCCs to define a set of conditions that a pod must run.

Security context constraints allow an administrator to control the following security constraints:

  • Whether a pod can run privileged containers with the allowPrivilegedContainer flag
  • Whether a pod is constrained with the allowPrivilegeEscalation flag
  • The capabilities that a container can request
  • The use of host directories as volumes
  • The SELinux context of the container
  • The container user ID
  • The use of host namespaces and networking
  • The allocation of an FSGroup that owns the pod volumes
  • The configuration of allowable supplemental groups
  • Whether a container requires write access to its root file system
  • The usage of volume types
  • The configuration of allowable seccomp profiles

Default SCCs are created during installation and when you install some Operators or other components. As a cluster administrator, you can also create your own SCCs by using the OpenShift CLI (oc).

For information about default security context constraints, see Default security context constraints.

Important

Do not modify the default SCCs. Customizing the default SCCs can lead to issues when some of the platform pods deploy or OpenShift Container Platform is upgraded. Additionally, the default SCC values are reset to the defaults during some cluster upgrades, which discards all customizations to those SCCs.

Instead of modifying the default SCCs, create and modify your own SCCs as needed. For detailed steps, see Creating security context constraints.

You can use the following basic SCCs:

  • restricted
  • restricted-v2

The restricted-v2 SCC is the most restrictive SCC provided by a new installation and is used by default for authenticated users. It aligns with Pod Security Admission (PSA) restrictions and improves security, as the original restricted SCC is less restrictive. It also helps transition from the original SCCs to v2 across multiple releases. Eventually, the original SCCs get deprecated. Therefore, it is recommended to use the restricted-v2 SCC.

You can examine the restricted-v2 SCC by running the following command:

$ oc describe scc restricted-v2

Example output

Name:                                           restricted-v2
Priority:                                       <none>
Access:
  Users:                                        <none>
  Groups:                                       <none>
Settings:
  Allow Privileged:                             false
  Allow Privilege Escalation:                   false
  Default Add Capabilities:                     <none>
  Required Drop Capabilities:                   ALL
  Allowed Capabilities:                         NET_BIND_SERVICE
  Allowed Seccomp Profiles:                     runtime/default
  Allowed Volume Types:                         configMap,downwardAPI,emptyDir,ephemeral,persistentVolumeClaim,projected,secret
  Allowed Flexvolumes:                          <all>
  Allowed Unsafe Sysctls:                       <none>
  Forbidden Sysctls:                            <none>
  Allow Host Network:                           false
  Allow Host Ports:                             false
  Allow Host PID:                               false
  Allow Host IPC:                               false
  Read Only Root Filesystem:                    false
  Run As User Strategy: MustRunAsRange
    UID:                                        <none>
    UID Range Min:                              <none>
    UID Range Max:                              <none>
  SELinux Context Strategy: MustRunAs
    User:                                       <none>
    Role:                                       <none>
    Type:                                       <none>
    Level:                                      <none>
  FSGroup Strategy: MustRunAs
    Ranges:                                     <none>
  Supplemental Groups Strategy: RunAsAny
    Ranges:                                     <none>

The restricted-v2 SCC explicitly denies everything except what it explicitly allows. The following settings define the allowed capabilities and security restrictions:

  • Default add capabilities: Set to <none>. It means that no capabilities are added to a pod by default.
  • Required drop capabilities: Set to ALL. This drops all the default Linux capabilities of a pod.
  • Allowed capabilities: NET_BIND_SERVICE. A pod can request this capability, but it is not added by default.
  • Allowed seccomp profiles: runtime/default.

For more information, see Managing security context constraints.

Red Hat logoGithubRedditYoutubeTwitter

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Ayudamos a los usuarios de Red Hat a innovar y alcanzar sus objetivos con nuestros productos y servicios con contenido en el que pueden confiar.

Hacer que el código abierto sea más inclusivo

Red Hat se compromete a reemplazar el lenguaje problemático en nuestro código, documentación y propiedades web. Para más detalles, consulte el Blog de Red Hat.

Acerca de Red Hat

Ofrecemos soluciones reforzadas que facilitan a las empresas trabajar en plataformas y entornos, desde el centro de datos central hasta el perímetro de la red.

© 2024 Red Hat, Inc.