Este contenido no está disponible en el idioma seleccionado.
Chapter 17. Day 2 operations for telco core CNF clusters
17.1. Upgrading telco core CNF clusters
17.1.1. Upgrading a telco core CNF cluster
OpenShift Container Platform has long term support or extended update support (EUS) on all even releases and update paths between EUS releases. You can update from one EUS version to the next EUS version. It is also possible to update between y-stream and z-stream versions.
17.1.1.1. Cluster updates for telco core CNF clusters
Updating your cluster is a critical task that ensures that bugs and potential security vulnerabilities are patched. Often, updates to cloud-native network functions (CNF) require additional functionality from the platform that comes when you update the cluster version. You also must update the cluster periodically to ensure that the cluster platform version is supported.
You can minimize the effort required to stay current with updates by keeping up-to-date with EUS releases and upgrading to select important z-stream releases only.
The update path for the cluster can vary depending on the size and topology of the cluster. The update procedures described here are valid for most clusters from 3-node clusters up to the largest size clusters certified by the telco scale team. This includes some scenarios for mixed-workload clusters.
The following update scenarios are described:
- Control Plane Only updates
- Y-stream updates
- Z-stream updates
Control Plane Only updates were previously known as EUS-to-EUS updates. Control Plane Only updates are only viable between even-numbered minor versions of OpenShift Container Platform.
17.1.2. Verifying cluster API versions between update versions
APIs change over time as components are updated. It is important to verify that cloud-native network function (CNF) APIs are compatible with the updated cluster version.
17.1.2.1. OpenShift Container Platform API compatibility
When considering what z-stream release to update to as part of a new y-stream update, you must be sure that all the patches that are in the z-stream version you are moving from are in the new z-stream version. If the version you update to does not have all the required patches, the built-in compatibility of Kubernetes is broken.
For example, if the cluster version is 4.15.32, you must update to 4.16 z-stream release that has all of the patches that are applied to 4.15.32.
17.1.2.1.1. About Kubernetes version skew
Each cluster Operator supports specific API versions. Kubernetes APIs evolve over time, and newer versions can be deprecated or change existing APIs. This is referred to as "version skew". For every new release, you must review the API changes. The APIs might be compatible across several releases of an Operator, but compatibility is not guaranteed. To mitigate against problems that arise from version skew, follow a well-defined update strategy.
Additional resources
17.1.2.2. Determining the cluster version update path
Use the Red Hat OpenShift Container Platform Update Graph tool to determine if the path is valid for the z-stream release you want to update to. Verify the update with your Red Hat Technical Account Manager to ensure that the update path is valid for telco implementations.
The <4.y+1.z> or <4.y+2.z> version that you update to must have the same patch level as the <4.y.z> release you are updating from.
The OpenShift update process mandates that if a fix is present in a specific <4.y.z> release, then the that fix must be present in the <4.y+1.z> release that you update to.
Figure 17.1. Bug fix backporting and the update graph

OpenShift development has a strict backport policy that prevents regressions. For example, a bug must be fixed in 4.16.z before it is fixed in 4.15.z. This means that the update graph does not allow for updates to chronologically older releases even if the minor version is greater, for example, updating from 4.15.24 to 4.16.2.
Additional resources
17.1.2.3. Selecting the target release
Use the Red Hat OpenShift Container Platform Update Graph or the cincinnati-graph-data repository to determine what release to update to.
17.1.2.3.1. Determining what z-stream updates are available
Before you can update to a new z-stream release, you need to know what versions are available.
You do not need to change the channel when performing a z-stream update.
Procedure
Determine which z-stream releases are available. Run the following command:
$ oc adm upgrade
Example output
Cluster version is 4.14.34 Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.14 (available channels: candidate-4.14, candidate-4.15, eus-4.14, eus-4.16, fast-4.14, fast-4.15, stable-4.14, stable-4.15) Recommended updates: VERSION IMAGE 4.14.37 quay.io/openshift-release-dev/ocp-release@sha256:14e6ba3975e6c73b659fa55af25084b20ab38a543772ca70e184b903db73092b 4.14.36 quay.io/openshift-release-dev/ocp-release@sha256:4bc4925e8028158e3f313aa83e59e181c94d88b4aa82a3b00202d6f354e8dfed 4.14.35 quay.io/openshift-release-dev/ocp-release@sha256:883088e3e6efa7443b0ac28cd7682c2fdbda889b576edad626769bf956ac0858
17.1.2.3.2. Changing the channel for a Control Plane Only update
You must change the channel to the required version for a Control Plane Only update.
You do not need to change the channel when performing a z-stream update.
Procedure
Determine the currently configured update channel:
$ oc get clusterversion -o=jsonpath='{.items[*].spec}' | jq
Example output
{ "channel": "stable-4.14", "clusterID": "01eb9a57-2bfb-4f50-9d37-dc04bd5bac75" }
Change the channel to point to the new channel you want to update to:
$ oc adm upgrade channel eus-4.16
Confirm the updated channel:
$ oc get clusterversion -o=jsonpath='{.items[*].spec}' | jq
Example output
{ "channel": "eus-4.16", "clusterID": "01eb9a57-2bfb-4f50-9d37-dc04bd5bac75" }
17.1.2.3.2.1. Changing the channel for an early EUS to EUS update
The update path to a brand new release of OpenShift Container Platform is not available in either the EUS channel or the stable channel until 45 to 90 days after the initial GA of a minor release.
To begin testing an update to a new release, you can use the fast channel.
Procedure
Change the channel to
fast-<y+1>
. For example, run the following command:$ oc adm upgrade channel fast-4.16
Check the update path from the new channel. Run the following command:
$ oc adm upgrade
Cluster version is 4.15.33 Upgradeable=False Reason: AdminAckRequired Message: Kubernetes 1.28 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions. Upstream is unset, so the cluster will use an appropriate default. Channel: fast-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.15, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16) Recommended updates: VERSION IMAGE 4.16.14 quay.io/openshift-release-dev/ocp-release@sha256:6618dd3c0f5 4.16.13 quay.io/openshift-release-dev/ocp-release@sha256:7a72abc3 4.16.12 quay.io/openshift-release-dev/ocp-release@sha256:1c8359fc2 4.16.11 quay.io/openshift-release-dev/ocp-release@sha256:bc9006febfe 4.16.10 quay.io/openshift-release-dev/ocp-release@sha256:dece7b61b1 4.15.36 quay.io/openshift-release-dev/ocp-release@sha256:c31a56d19 4.15.35 quay.io/openshift-release-dev/ocp-release@sha256:f21253 4.15.34 quay.io/openshift-release-dev/ocp-release@sha256:2dd69c5
Follow the update procedure to get to version 4.16 (<y+1> from version 4.15)
NoteYou can keep your worker nodes paused between EUS releases even if you are using the fast channel.
-
When you get to the required <y+1> release, change the channel again, this time to
fast-<y+2>
. - Follow the EUS update procedure to get to the required <y+2> release.
17.1.2.3.3. Changing the channel for a y-stream update
In a y-stream update you change the channel to the next release channel.
Use the stable or EUS release channels for production clusters.
Procedure
Change the update channel:
$ oc adm upgrade channel stable-4.15
Check the update path from the new channel. Run the following command:
$ oc adm upgrade
Example output
Cluster version is 4.14.34 Upgradeable=False Reason: AdminAckRequired Message: Kubernetes 1.27 and therefore OpenShift 4.15 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions. Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.15 (available channels: candidate-4.14, candidate-4.15, eus-4.14, eus-4.15, fast-4.14, fast-4.15, stable-4.14, stable-4.15) Recommended updates: VERSION IMAGE 4.15.33 quay.io/openshift-release-dev/ocp-release@sha256:7142dd4b560 4.15.32 quay.io/openshift-release-dev/ocp-release@sha256:cda8ea5b13dc9 4.15.31 quay.io/openshift-release-dev/ocp-release@sha256:07cf61e67d3eeee 4.15.30 quay.io/openshift-release-dev/ocp-release@sha256:6618dd3c0f5 4.15.29 quay.io/openshift-release-dev/ocp-release@sha256:7a72abc3 4.15.28 quay.io/openshift-release-dev/ocp-release@sha256:1c8359fc2 4.15.27 quay.io/openshift-release-dev/ocp-release@sha256:bc9006febfe 4.15.26 quay.io/openshift-release-dev/ocp-release@sha256:dece7b61b1 4.14.38 quay.io/openshift-release-dev/ocp-release@sha256:c93914c62d7 4.14.37 quay.io/openshift-release-dev/ocp-release@sha256:c31a56d19 4.14.36 quay.io/openshift-release-dev/ocp-release@sha256:f21253 4.14.35 quay.io/openshift-release-dev/ocp-release@sha256:2dd69c5
17.1.3. Preparing the telco core cluster platform for update
Typically, telco clusters run on bare-metal hardware. Often you must update the firmware to take on important security fixes, take on new functionality, or maintain compatibility with the new release of OpenShift Container Platform.
17.1.3.1. Ensuring the host firmware is compatible with the update
You are responsible for the firmware versions that you run in your clusters. Updating host firmware is not a part of the OpenShift Container Platform update process. It is not recommended to update firmware in conjunction with the OpenShift Container Platform version.
Hardware vendors advise that it is best to apply the latest certified firmware version for the specific hardware that you are running. For telco use cases, always verify firmware updates in test environments before applying them in production. The high throughput nature of telco CNF workloads can be adversely affected by sub-optimal host firmware.
You should thoroughly test new firmware updates to ensure that they work as expected with the current version of OpenShift Container Platform. Ideally, you test the latest firmware version with the target OpenShift Container Platform update version.
17.1.3.2. Ensuring that layered products are compatible with the update
Verify that all layered products run on the version of OpenShift Container Platform that you are updating to before you begin the update. This generally includes all Operators.
Procedure
Verify the currently installed Operators in the cluster. For example, run the following command:
$ oc get csv -A
Example output
NAMESPACE NAME DISPLAY VERSION REPLACES PHASE gitlab-operator-kubernetes.v0.17.2 GitLab 0.17.2 gitlab-operator-kubernetes.v0.17.1 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded
Check that Operators that you install with OLM are compatible with the update version.
Operators that are installed with the Operator Lifecycle Manager (OLM) are not part of the standard cluster Operators set.
Use the Operator Update Information Checker to understand if you must update an Operator after each y-stream update or if you can wait until you have fully updated to the next EUS release.
TipYou can also use the Operator Update Information Checker to see what versions of OpenShift Container Platform are compatible with specific releases of an Operator.
Check that Operators that you install outside of OLM are compatible with the update version.
For all OLM-installed Operators that are not directly supported by Red Hat, contact the Operator vendor to ensure release compatibility.
- Some Operators are compatible with several releases of OpenShift Container Platform. You might not must update the Operators until after you complete the cluster update. See "Updating the worker nodes" for more information.
- See "Updating all the OLM Operators" for information about updating an Operator after performing the first y-stream control plane update.
Additional resources
17.1.3.3. Applying MachineConfigPool labels to nodes before the update
Prepare MachineConfigPool
(mcp
) node labels to group nodes together in groups of roughly 8 to 10 nodes. With mcp
groups, you can reboot groups of nodes independently from the rest of the cluster.
You use the mcp
node labels to pause and unpause the set of nodes during the update process so that you can do the update and reboot at a time of your choosing.
17.1.3.3.1. Staggering the cluster update
Sometimes there are problems during the update. Often the problem is related to hardware failure or nodes needing to be reset. Using mcp
node labels, you can update nodes in stages by pausing the update at critical moments, tracking paused and unpaused nodes as you proceed. When a problem occurs, you use the nodes that are in an unpaused state to ensure that there are enough nodes running to keep all applications pods running.
17.1.3.3.2. Dividing worker nodes into MachineConfigPool groups
How you divide worker nodes into mcp
groups can vary depending on how many nodes are in the cluster or how many nodes you assign to a node role. By default the 2 roles in a cluster are control plane and worker.
In clusters that run telco workloads, you can further split the worker nodes between CNF control plane and CNF data plane roles. Add mcp
role labels that split the worker nodes into each of these two groups.
Larger clusters can have as many as 100 worker nodes in the CNF control plane role. No matter how many nodes there are in the cluster, keep each MachineConfigPool
group to around 10 nodes. This allows you to control how many nodes are taken down at a time. With multiple MachineConfigPool
groups, you can unpause several groups at a time to accelerate the update, or separate the update over 2 or more maintenance windows.
- Example cluster with 15 worker nodes
Consider a cluster with 15 worker nodes:
- 10 worker nodes are CNF control plane nodes.
- 5 worker nodes are CNF data plane nodes.
Split the CNF control plane and data plane worker node roles into at least 2
mcp
groups each. Having 2mcp
groups per role means that you can have one set of nodes that are not affected by the update.- Example cluster with 6 worker nodes
Consider a cluster with 6 worker nodes:
-
Split the worker nodes into 3
mcp
groups of 2 nodes each.
Upgrade one of the
mcp
groups. Allow the updated nodes to sit through a day to allow for verification of CNF compatibility before completing the update on the other 4 nodes.-
Split the worker nodes into 3
The process and pace at which you unpause the mcp
groups is determined by your CNF applications and configuration.
If your CNF pod can handle being scheduled across nodes in a cluster, you can unpause several mcp
groups at a time and set the MaxUnavailable
in the mcp
custom resource (CR) to as high as 50%. This allows up to half of the nodes in an mcp
group to restart and get updated.
17.1.3.3.3. Reviewing configured cluster MachineConfigPool roles
Review the currently configured MachineConfigPool
roles in the cluster.
Procedure
Get the currently configured
mcp
groups in the cluster:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-bere83 True False False 3 3 3 0 25d worker rendered-worker-245c4f True False False 2 2 2 0 25d
Compare the list of
mcp
roles to list of nodes in the cluster:$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 39d v1.27.15+6147456 ctrl-plane-1 Ready control-plane,master 39d v1.27.15+6147456 ctrl-plane-2 Ready control-plane,master 39d v1.27.15+6147456 worker-0 Ready worker 39d v1.27.15+6147456 worker-1 Ready worker 39d v1.27.15+6147456
NoteWhen you apply an
mcp
group change, the node roles are updated.Determine how you want to separate the worker nodes into
mcp
groups.
17.1.3.3.4. Creating MachineConfigPool groups for the cluster
Creating mcp
groups is a 2-step process:
-
Add an
mcp
label to the nodes in the cluster -
Apply an
mcp
CR to the cluster that organizes the nodes based on their labels
Procedure
Label the nodes so that they can be put into
mcp
groups. Run the following commands:$ oc label node worker-0 node-role.kubernetes.io/mcp-1=
$ oc label node worker-1 node-role.kubernetes.io/mcp-2=
The
mcp-1
andmcp-2
labels are applied to the nodes. For example:Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 39d v1.27.15+6147456 ctrl-plane-1 Ready control-plane,master 39d v1.27.15+6147456 ctrl-plane-2 Ready control-plane,master 39d v1.27.15+6147456 worker-0 Ready mcp-1,worker 39d v1.27.15+6147456 worker-1 Ready mcp-2,worker 39d v1.27.15+6147456
Create YAML custom resources (CRs) that apply the labels as
mcp
CRs in the cluster. Save the following YAML in themcps.yaml
file:--- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: mcp-2 spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,mcp-2] } nodeSelector: matchLabels: node-role.kubernetes.io/mcp-2: "" --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: mcp-1 spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,mcp-1] } nodeSelector: matchLabels: node-role.kubernetes.io/mcp-1: ""
Create the
MachineConfigPool
resources:$ oc apply -f mcps.yaml
Example output
machineconfigpool.machineconfiguration.openshift.io/mcp-2 created
Verification
Monitor the MachineConfigPool
resources as they are applied in the cluster. After you apply the mcp
resources, the nodes are added into the new machine config pools. This takes a few minutes.
The nodes do not reboot while being added into the mcp
groups. The original worker and master mcp
groups remain unchanged.
Check the status of the new
mcp
resources:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-be3e83 True False False 3 3 3 0 25d mcp-1 rendered-mcp-1-2f4c4f False True True 1 0 0 0 10s mcp-2 rendered-mcp-2-2r4s1f False True True 1 0 0 0 10s worker rendered-worker-23fc4f False True True 0 0 0 2 25d
Eventually, the resources are fully applied:
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-be3e83 True False False 3 3 3 0 25d mcp-1 rendered-mcp-1-2f4c4f True False False 1 1 1 0 7m33s mcp-2 rendered-mcp-2-2r4s1f True False False 1 1 1 0 51s worker rendered-worker-23fc4f True False False 0 0 0 0 25d
17.1.3.4. Telco deployment environment considerations
In telco environments, most clusters are in disconnected networks. To update clusters in these environments, you must update your offline image repository.
17.1.3.5. Preparing the cluster platform for update
Before you update the cluster, perform some basic checks and verifications to make sure that the cluster is ready for the update.
Procedure
Verify that there are no failed or in progress pods in the cluster by running the following command:
$ oc get pods -A | grep -E -vi 'complete|running'
NoteYou might have to run this command more than once if there are pods that are in a pending state.
Verify that all nodes in the cluster are available:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 32d v1.27.15+6147456 ctrl-plane-1 Ready control-plane,master 32d v1.27.15+6147456 ctrl-plane-2 Ready control-plane,master 32d v1.27.15+6147456 worker-0 Ready mcp-1,worker 32d v1.27.15+6147456 worker-1 Ready mcp-2,worker 32d v1.27.15+6147456
Verify that all bare-metal nodes are provisioned and ready.
$ oc get bmh -n openshift-machine-api
Example output
NAME STATE CONSUMER ONLINE ERROR AGE ctrl-plane-0 unmanaged cnf-58879-master-0 true 33d ctrl-plane-1 unmanaged cnf-58879-master-1 true 33d ctrl-plane-2 unmanaged cnf-58879-master-2 true 33d worker-0 unmanaged cnf-58879-worker-0-45879 true 33d worker-1 progressing cnf-58879-worker-0-dszsh false 1d 1
- 1
- An error occurred while provisioning the
worker-1
node.
Verification
Verify that all cluster Operators are ready:
$ oc get co
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.34 True False False 17h baremetal 4.14.34 True False False 32d ... service-ca 4.14.34 True False False 32d storage 4.14.34 True False False 32d
Additional resources
17.1.4. Configuring CNF pods before updating the telco core CNF cluster
Follow the guidance in Red Hat best practices for Kubernetes when developing cloud-native network functions (CNFs) to ensure that the cluster can schedule pods during an update.
Always deploy pods in groups by using Deployment
resources. Deployment
resources spread the workload across all of the available pods ensuring there is no single point of failure. When a pod that is managed by a Deployment
resource is deleted, a new pod takes its place automatically.
Additional resources
17.1.4.1. Ensuring that CNF workloads run uninterrupted with pod disruption budgets
You can configure the minimum number of pods in a deployment to allow the CNF workload to run uninterrupted by setting a pod disruption budget in a PodDisruptionBudget
custom resource (CR) that you apply. Be careful when setting this value; setting it improperly can cause an update to fail.
For example, if you have 4 pods in a deployment and you set the pod disruption budget to 4, the cluster scheduler keeps 4 pods running at all times - no pods can be scaled down.
Instead, set the pod disruption budget to 2, letting 2 of the 4 pods be scheduled as down. Then, the worker nodes where those pods are located can be rebooted.
Setting the pod disruption budget to 2 does not mean that your deployment runs on only 2 pods for a period of time, for example, during an update. The cluster scheduler creates 2 new pods to replace the 2 older pods. However, there is short period of time between the new pods coming online and the old pods being deleted.
17.1.4.2. Ensuring that pods do not run on the same cluster node
High availability in Kubernetes requires duplicate processes to be running on separate nodes in the cluster. This ensures that the application continues to run even if one node becomes unavailable. In OpenShift Container Platform, processes can be automatically duplicated in separate pods in a deployment. You configure anti-affinity in the Pod
spec to ensure that the pods in a deployment do not run on the same cluster node.
During an update, setting pod anti-affinity ensures that pods are distributed evenly across nodes in the cluster. This means that node reboots are easier during an update. For example, if there are 4 pods from a single deployment on a node, and the pod disruption budget is set to only allow 1 pod to be deleted at a time, then it will take 4 times as long for that node to reboot. Setting pod anti-affinity spreads pods across the cluster to prevent such occurrences.
Additional resources
17.1.4.3. Application liveness, readiness, and startup probes
You can use liveness, readiness and startup probes to check the health of your live application containers before you schedule an update. These are very useful tools to use with pods that are dependent upon keeping state for their application containers.
- Liveness health check
- Determines if a container is running. If the liveness probe fails for a container, the pod responds based on the restart policy.
- Readiness probe
- Determines if a container is ready to accept service requests. If the readiness probe fails for a container, the kubelet removes the container from the list of available service endpoints.
- Startup probe
-
A startup probe indicates whether the application within a container is started. All other probes are disabled until the startup succeeds. If the startup probe does not succeed, the kubelet kills the container, and the container is subject to the pod
restartPolicy
setting.
Additional resources
17.1.5. Before you update the telco core CNF cluster
Before you start the cluster update, you must pause worker nodes, back up the etcd database, and do a final cluster health check before proceeding.
17.1.5.1. Pausing worker nodes before the update
You must pause the worker nodes before you proceed with the update. In the following example, there are 2 mcp
groups, mcp-1
and mcp-2
. You patch the spec.paused
field to true
for each of these MachineConfigPool
groups.
Procedure
Patch the
mcp
CRs to pause the nodes and drain and remove the pods from those nodes by running the following command:$ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":true}}'
$ oc patch mcp/mcp-2 --type merge --patch '{"spec":{"paused":true}}'
Get the status of the paused
mcp
groups:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 true mcp-2 true
The default control plane and worker mcp
groups are not changed during an update.
17.1.5.2. Backup the etcd database before you proceed with the update
You must backup the etcd database before you proceed with the update.
17.1.5.2.1. Backing up etcd data
Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static pods. This backup can be saved and used at a later time if you need to restore etcd.
Only save a backup from a single control plane host. Do not take a backup from each control plane host in the cluster.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. You have checked whether the cluster-wide proxy is enabled.
TipYou can check whether the proxy is enabled by reviewing the output of
oc get proxy cluster -o yaml
. The proxy is enabled if thehttpProxy
,httpsProxy
, andnoProxy
fields have values set.
Procedure
Start a debug session as root for a control plane node:
$ oc debug --as-root node/<node_name>
Change your root directory to
/host
in the debug shell:sh-4.4# chroot /host
If the cluster-wide proxy is enabled, export the
NO_PROXY
,HTTP_PROXY
, andHTTPS_PROXY
environment variables by running the following commands:$ export HTTP_PROXY=http://<your_proxy.example.com>:8080
$ export HTTPS_PROXY=https://<your_proxy.example.com>:8080
$ export NO_PROXY=<example.com>
Run the
cluster-backup.sh
script in the debug shell and pass in the location to save the backup to.TipThe
cluster-backup.sh
script is maintained as a component of the etcd Cluster Operator and is a wrapper around theetcdctl snapshot save
command.sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
Example script output
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-6 found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-7 found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6 found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-3 ede95fe6b88b87ba86a03c15e669fb4aa5bf0991c180d3c6895ce72eaade54a1 etcdctl version: 3.4.14 API version: 3.4 {"level":"info","ts":1624647639.0188997,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/home/core/assets/backup/snapshot_2021-06-25_190035.db.part"} {"level":"info","ts":"2021-06-25T19:00:39.030Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"} {"level":"info","ts":1624647639.0301006,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://10.0.0.5:2379"} {"level":"info","ts":"2021-06-25T19:00:40.215Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"} {"level":"info","ts":1624647640.6032252,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://10.0.0.5:2379","size":"114 MB","took":1.584090459} {"level":"info","ts":1624647640.6047094,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/home/core/assets/backup/snapshot_2021-06-25_190035.db"} Snapshot saved at /home/core/assets/backup/snapshot_2021-06-25_190035.db {"hash":3866667823,"revision":31407,"totalKey":12828,"totalSize":114446336} snapshot db and kube resources are successfully saved to /home/core/assets/backup
In this example, two files are created in the
/home/core/assets/backup/
directory on the control plane host:-
snapshot_<datetimestamp>.db
: This file is the etcd snapshot. Thecluster-backup.sh
script confirms its validity. static_kuberesources_<datetimestamp>.tar.gz
: This file contains the resources for the static pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.NoteIf etcd encryption is enabled, it is recommended to store this second file separately from the etcd snapshot for security reasons. However, this file is required to restore from the etcd snapshot.
Keep in mind that etcd encryption only encrypts values, not keys. This means that resource types, namespaces, and object names are unencrypted.
-
17.1.5.2.2. Creating a single etcd backup
Follow these steps to create a single etcd backup by creating and applying a custom resource (CR).
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. -
You have access to the OpenShift CLI (
oc
).
Procedure
If dynamically-provisioned storage is available, complete the following steps to create a single automated etcd backup:
Create a persistent volume claim (PVC) named
etcd-backup-pvc.yaml
with contents such as the following example:kind: PersistentVolumeClaim apiVersion: v1 metadata: name: etcd-backup-pvc namespace: openshift-etcd spec: accessModes: - ReadWriteOnce resources: requests: storage: 200Gi 1 volumeMode: Filesystem
- 1
- The amount of storage available to the PVC. Adjust this value for your requirements.
Apply the PVC by running the following command:
$ oc apply -f etcd-backup-pvc.yaml
Verify the creation of the PVC by running the following command:
$ oc get pvc
Example output
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE etcd-backup-pvc Bound 51s
NoteDynamic PVCs stay in the
Pending
state until they are mounted.Create a CR file named
etcd-single-backup.yaml
with contents such as the following example:apiVersion: operator.openshift.io/v1alpha1 kind: EtcdBackup metadata: name: etcd-single-backup namespace: openshift-etcd spec: pvcName: etcd-backup-pvc 1
- 1
- The name of the PVC to save the backup to. Adjust this value according to your environment.
Apply the CR to start a single backup:
$ oc apply -f etcd-single-backup.yaml
If dynamically-provisioned storage is not available, complete the following steps to create a single automated etcd backup:
Create a
StorageClass
CR file namedetcd-backup-local-storage.yaml
with the following contents:apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: etcd-backup-local-storage provisioner: kubernetes.io/no-provisioner volumeBindingMode: Immediate
Apply the
StorageClass
CR by running the following command:$ oc apply -f etcd-backup-local-storage.yaml
Create a PV named
etcd-backup-pv-fs.yaml
with contents such as the following example:apiVersion: v1 kind: PersistentVolume metadata: name: etcd-backup-pv-fs spec: capacity: storage: 100Gi 1 volumeMode: Filesystem accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: etcd-backup-local-storage local: path: /mnt nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - <example_master_node> 2
Verify the creation of the PV by running the following command:
$ oc get pv
Example output
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE etcd-backup-pv-fs 100Gi RWO Retain Available etcd-backup-local-storage 10s
Create a PVC named
etcd-backup-pvc.yaml
with contents such as the following example:kind: PersistentVolumeClaim apiVersion: v1 metadata: name: etcd-backup-pvc namespace: openshift-etcd spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 10Gi 1
- 1
- The amount of storage available to the PVC. Adjust this value for your requirements.
Apply the PVC by running the following command:
$ oc apply -f etcd-backup-pvc.yaml
Create a CR file named
etcd-single-backup.yaml
with contents such as the following example:apiVersion: operator.openshift.io/v1alpha1 kind: EtcdBackup metadata: name: etcd-single-backup namespace: openshift-etcd spec: pvcName: etcd-backup-pvc 1
- 1
- The name of the persistent volume claim (PVC) to save the backup to. Adjust this value according to your environment.
Apply the CR to start a single backup:
$ oc apply -f etcd-single-backup.yaml
Additional resources
17.1.5.3. Checking the cluster health
You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.
Procedure
Check the status of the cluster Operators by running the following command:
$ oc get co
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.34 True False False 4d22h baremetal 4.14.34 True False False 4d22h cloud-controller-manager 4.14.34 True False False 4d23h cloud-credential 4.14.34 True False False 4d23h cluster-autoscaler 4.14.34 True False False 4d22h config-operator 4.14.34 True False False 4d22h console 4.14.34 True False False 4d22h ... service-ca 4.14.34 True False False 4d22h storage 4.14.34 True False False 4d22h
Check the status of the cluster nodes:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 4d22h v1.27.15+6147456 ctrl-plane-1 Ready control-plane,master 4d22h v1.27.15+6147456 ctrl-plane-2 Ready control-plane,master 4d22h v1.27.15+6147456 worker-0 Ready mcp-1,worker 4d22h v1.27.15+6147456 worker-1 Ready mcp-2,worker 4d22h v1.27.15+6147456
Check that there are no in-progress or failed pods. There should be no pods returned when you run the following command.
$ oc get po -A | grep -E -iv 'running|complete'
17.1.6. Completing the Control Plane Only cluster update
Follow these steps to perform the Control Plane Only cluster update and monitor the update through to completion.
Control Plane Only updates were previously known as EUS-to-EUS updates. Control Plane Only updates are only viable between even-numbered minor versions of OpenShift Container Platform.
17.1.6.1. Acknowledging the Control Plane Only or y-stream update
When you update to all versions from 4.11 and later, you must manually acknowledge that the update can continue.
Before you acknowledge the update, verify that there are no Kubernetes APIs in use that are removed in the version you are updating to. For example, in OpenShift Container Platform 4.17, there are no API removals. See "Kubernetes API removals" for more information.
Procedure
Run the following command:
$ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-<update_version_from>-kube-<kube_api_version>-api-removals-in-<update_version_to>":"true"}}' --type=merge
where:
- <update_version_from>
-
Is the cluster version you are moving from, for example,
4.14
. - <kube_api_version>
-
Is kube API version, for example,
1.28
. - <update_version_to>
-
Is the cluster version you are moving to, for example,
4.15
.
Verification
Verify the update. Run the following command:
$ oc get configmap admin-acks -n openshift-config -o json | jq .data
Example output
{ "ack-4.14-kube-1.28-api-removals-in-4.15": "true", "ack-4.15-kube-1.29-api-removals-in-4.16": "true" }
NoteIn this example, the cluster is updated from version 4.14 to 4.15, and then from 4.15 to 4.16 in a Control Plane Only update.
Additional resources
17.1.6.2. Starting the cluster update
When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.
You can verify that you are updating to a viable release by running the oc adm upgrade
command. The oc adm upgrade
command lists the compatible update releases.
Procedure
Start the update:
$ oc adm upgrade --to=4.15.33
Important- Control Plane Only update: Make sure you point to the interim <y+1> release path
- Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
- Z-stream update - Verify that there are no problems moving to that specific release
Example output
Requested update to 4.15.33 1
- 1
- The
Requested update
value changes depending on your particular update.
Additional resources
17.1.6.3. Monitoring the cluster update
You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.
Procedure
Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:
$ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.34 True True 4m6s Working towards 4.15.33: 111 of 873 done (12% complete), waiting on kube-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.34 True False False 4d22h baremetal 4.14.34 True False False 4d23h cloud-controller-manager 4.14.34 True False False 4d23h cloud-credential 4.14.34 True False False 4d23h cluster-autoscaler 4.14.34 True False False 4d23h console 4.14.34 True False False 4d22h ... storage 4.14.34 True False False 4d23h config-operator 4.15.33 True False False 4d23h etcd 4.15.33 True False False 4d23h NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 4d23h v1.27.15+6147456 ctrl-plane-1 Ready control-plane,master 4d23h v1.27.15+6147456 ctrl-plane-2 Ready control-plane,master 4d23h v1.27.15+6147456 worker-0 Ready mcp-1,worker 4d23h v1.27.15+6147456 worker-1 Ready mcp-2,worker 4d23h v1.27.15+6147456 NAMESPACE NAME READY STATUS RESTARTS AGE openshift-marketplace redhat-marketplace-rf86t 0/1 ContainerCreating 0 0s
Verification
During the update the watch
command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE
column.
When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.
During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.
As soon as the last control plane node reboot is complete, the cluster version is displayed as updated.
When the control plane update is complete a message such as the following is displayed. This example shows an update completed to the intermediate y-stream release.
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.33 True False 28m Cluster version is 4.15.33 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.33 True False False 5d baremetal 4.15.33 True False False 5d cloud-controller-manager 4.15.33 True False False 5d1h cloud-credential 4.15.33 True False False 5d1h cluster-autoscaler 4.15.33 True False False 5d config-operator 4.15.33 True False False 5d console 4.15.33 True False False 5d ... service-ca 4.15.33 True False False 5d storage 4.15.33 True False False 5d NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d v1.28.13+2ca1a23 ctrl-plane-1 Ready control-plane,master 5d v1.28.13+2ca1a23 ctrl-plane-2 Ready control-plane,master 5d v1.28.13+2ca1a23 worker-0 Ready mcp-1,worker 5d v1.28.13+2ca1a23 worker-1 Ready mcp-2,worker 5d v1.28.13+2ca1a23
17.1.6.4. Updating the OLM Operators
In telco environments, software needs to vetted before it is loaded onto a production cluster. Production clusters are also configured in a disconnected network, which means that they are not always directly connected to the internet. Because the clusters are in a disconnected network, the OpenShift Operators are configured for manual update during installation so that new versions can be managed on a cluster-by-cluster basis. Perform the following procedure to move the Operators to the newer versions.
Procedure
Check to see which Operators need to be updated:
$ oc get installplan -A | grep -E 'APPROVED|false'
Example output
NAMESPACE NAME CSV APPROVAL APPROVED metallb-system install-nwjnh metallb-operator.v4.16.0-202409202304 Manual false openshift-nmstate install-5r7wr kubernetes-nmstate-operator.4.16.0-202409251605 Manual false
Patch the
InstallPlan
resources for those Operators:$ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \ '{"spec":{"approved":true}}'
Example output
installplan.operators.coreos.com/install-nwjnh patched
Monitor the namespace by running the following command:
$ oc get all -n metallb-system
Example output
NAME READY STATUS RESTARTS AGE pod/metallb-operator-controller-manager-69b5f884c-8bp22 0/1 ContainerCreating 0 4s pod/metallb-operator-controller-manager-77895bdb46-bqjdx 1/1 Running 0 4m1s pod/metallb-operator-webhook-server-5d9b968896-vnbhk 0/1 ContainerCreating 0 4s pod/metallb-operator-webhook-server-d76f9c6c8-57r4w 1/1 Running 0 4m1s ... NAME DESIRED CURRENT READY AGE replicaset.apps/metallb-operator-controller-manager-69b5f884c 1 1 0 4s replicaset.apps/metallb-operator-controller-manager-77895bdb46 1 1 1 4m1s replicaset.apps/metallb-operator-controller-manager-99b76f88 0 0 0 4m40s replicaset.apps/metallb-operator-webhook-server-5d9b968896 1 1 0 4s replicaset.apps/metallb-operator-webhook-server-6f7dbfdb88 0 0 0 4m40s replicaset.apps/metallb-operator-webhook-server-d76f9c6c8 1 1 1 4m1s
When the update is complete, the required pods should be in a
Running
state, and the requiredReplicaSet
resources should be ready:NAME READY STATUS RESTARTS AGE pod/metallb-operator-controller-manager-69b5f884c-8bp22 1/1 Running 0 25s pod/metallb-operator-webhook-server-5d9b968896-vnbhk 1/1 Running 0 25s ... NAME DESIRED CURRENT READY AGE replicaset.apps/metallb-operator-controller-manager-69b5f884c 1 1 1 25s replicaset.apps/metallb-operator-controller-manager-77895bdb46 0 0 0 4m22s replicaset.apps/metallb-operator-webhook-server-5d9b968896 1 1 1 25s replicaset.apps/metallb-operator-webhook-server-d76f9c6c8 0 0 0 4m22s
Verification
Verify that the Operators do not need to be updated for a second time:
$ oc get installplan -A | grep -E 'APPROVED|false'
There should be no output returned.
NoteSometimes you have to approve an update twice because some Operators have interim z-stream release versions that need to be installed before the final version.
Additional resources
17.1.6.4.1. Performing the second y-stream update
After completing the first y-stream update, you must update the y-stream control plane version to the new EUS version.
Procedure
Verify that the <4.y.z> release that you selected is still listed as a good channel to move to:
$ oc adm upgrade
Example output
Cluster version is 4.15.33 Upgradeable=False Reason: AdminAckRequired Message: Kubernetes 1.29 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7031404 for details and instructions. Upstream is unset, so the cluster will use an appropriate default. Channel: eus-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16) Recommended updates: VERSION IMAGE 4.16.14 quay.io/openshift-release-dev/ocp-release@sha256:0521a0f1acd2d1b77f76259cb9bae9c743c60c37d9903806a3372c1414253658 4.16.13 quay.io/openshift-release-dev/ocp-release@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b 4.15.34 quay.io/openshift-release-dev/ocp-release@sha256:f2e0c593f6ed81250c11d0bac94dbaf63656223477b7e8693a652f933056af6e
NoteIf you update soon after the initial GA of a new Y-stream release, you might not see new y-stream releases available when you run the
oc adm upgrade
command.Optional: View the potential update releases that are not recommended. Run the following command:
$ oc adm upgrade --include-not-recommended
Example output
Cluster version is 4.15.33 Upgradeable=False Reason: AdminAckRequired Message: Kubernetes 1.29 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7031404 for details and instructions. Upstream is unset, so the cluster will use an appropriate default.Channel: eus-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16) Recommended updates: VERSION IMAGE 4.16.14 quay.io/openshift-release-dev/ocp-release@sha256:0521a0f1acd2d1b77f76259cb9bae9c743c60c37d9903806a3372c1414253658 4.16.13 quay.io/openshift-release-dev/ocp-release@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b 4.15.34 quay.io/openshift-release-dev/ocp-release@sha256:f2e0c593f6ed81250c11d0bac94dbaf63656223477b7e8693a652f933056af6e Supported but not recommended updates: Version: 4.16.15 Image: quay.io/openshift-release-dev/ocp-release@sha256:671bc35e Recommended: Unknown Reason: EvaluationFailed Message: Exposure to AzureRegistryImagePreservation is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 0 In Azure clusters, the in-cluster image registry may fail to preserve images on update. https://issues.redhat.com/browse/IR-461
NoteThe example shows a potential error that can affect clusters hosted in Microsoft Azure. It does not show risks for bare-metal clusters.
17.1.6.4.2. Acknowledging the y-stream release update
When moving between y-stream releases, you must run a patch command to explicitly acknowledge the update. In the output of the oc adm upgrade
command, a URL is provided that shows the specific command to run.
Before you acknowledge the update, verify that there are no Kubernetes APIs in use that are removed in the version you are updating to. For example, in OpenShift Container Platform 4.17, there are no API removals. See "Kubernetes API removals" for more information.
Procedure
Acknowledge the y-stream release upgrade by patching the
admin-acks
config map in theopenshift-config
namespace. For example, run the following command:$ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.15-kube-1.29-api-removals-in-4.16":"true"}}' --type=merge
Example output
configmap/admin-acks patched
Additional resources
17.1.6.5. Starting the y-stream control plane update
After you have determined the full new release that you are moving to, you can run the oc adm upgrade –to=x.y.z
command.
Procedure
Start the y-stream control plane update. For example, run the following command:
$ oc adm upgrade --to=4.16.14
Example output
Requested update to 4.16.14
You might move to a z-stream release that has potential issues with platforms other than the one you are running on. The following example shows a potential problem for cluster updates on Microsoft Azure:
$ oc adm upgrade --to=4.16.15
Example output
error: the update 4.16.15 is not one of the recommended updates, but is available as a conditional update. To accept the Recommended=Unknown risk and to proceed with update use --allow-not-recommended. Reason: EvaluationFailed Message: Exposure to AzureRegistryImagePreservation is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 0 In Azure clusters, the in-cluster image registry may fail to preserve images on update. https://issues.redhat.com/browse/IR-461
NoteThe example shows a potential error that can affect clusters hosted in Microsoft Azure. It does not show risks for bare-metal clusters.
$ oc adm upgrade --to=4.16.15 --allow-not-recommended
Example output
warning: with --allow-not-recommended you have accepted the risks with 4.14.11 and bypassed Recommended=Unknown EvaluationFailed: Exposure to AzureRegistryImagePreservation is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 0 In Azure clusters, the in-cluster image registry may fail to preserve images on update. https://issues.redhat.com/browse/IR-461 Requested update to 4.16.15
17.1.6.6. Monitoring the second part of a <y+1> cluster update
Monitor the second part of the cluster update to the <y+1> version.
Procedure
Monitor the progress of the second part of the <y+1> update. For example, to monitor the update from 4.15 to 4.16, run the following command:
$ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.15; oc get co | grep 4.16; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.33 True True 10m Working towards 4.16.14: 132 of 903 done (14% complete), waiting on kube-controller-manager, kube-scheduler NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.33 True False False 5d3h baremetal 4.15.33 True False False 5d4h cloud-controller-manager 4.15.33 True False False 5d4h cloud-credential 4.15.33 True False False 5d4h cluster-autoscaler 4.15.33 True False False 5d4h console 4.15.33 True False False 5d3h ... config-operator 4.16.14 True False False 5d4h etcd 4.16.14 True False False 5d4h kube-apiserver 4.16.14 True True False 5d4h NodeInstallerProgressing: 1 node is at revision 15; 2 nodes are at revision 17 NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d4h v1.28.13+2ca1a23 ctrl-plane-1 Ready control-plane,master 5d4h v1.28.13+2ca1a23 ctrl-plane-2 Ready control-plane,master 5d4h v1.28.13+2ca1a23 worker-0 Ready mcp-1,worker 5d4h v1.27.15+6147456 worker-1 Ready mcp-2,worker 5d4h v1.27.15+6147456 NAMESPACE NAME READY STATUS RESTARTS AGE openshift-kube-apiserver kube-apiserver-ctrl-plane-0 0/5 Pending 0 <invalid>
As soon as the last control plane node is complete, the cluster version is updated to the new EUS release. For example:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.14 True False 123m Cluster version is 4.16.14 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.14 True False False 5d6h baremetal 4.16.14 True False False 5d7h cloud-controller-manager 4.16.14 True False False 5d7h cloud-credential 4.16.14 True False False 5d7h cluster-autoscaler 4.16.14 True False False 5d7h config-operator 4.16.14 True False False 5d7h console 4.16.14 True False False 5d6h #... operator-lifecycle-manager-packageserver 4.16.14 True False False 5d7h service-ca 4.16.14 True False False 5d7h storage 4.16.14 True False False 5d7h NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d7h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d7h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d7h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d7h v1.27.15+6147456 worker-1 Ready mcp-2,worker 5d7h v1.27.15+6147456
Additional resources
17.1.6.7. Updating all the OLM Operators
In the second phase of a multi-version upgrade, you must approve all of the Operators and additionally add installations plans for any other Operators that you want to upgrade.
Follow the same procedure as outlined in "Updating the OLM Operators". Ensure that you also update any non-OLM Operators as required.
Procedure
Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:
$ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"
Check to see which Operators need to be updated:
$ oc get installplan -A | grep -E 'APPROVED|false'
Patch the
InstallPlan
resources for those Operators:$ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \ '{"spec":{"approved":true}}'
Monitor the namespace by running the following command:
$ oc get all -n metallb-system
When the update is complete, the required pods should be in a
Running
state, and the requiredReplicaSet
resources should be ready.
Verification
During the update the watch
command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE
column.
When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.
During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.
Additional resources
17.1.6.8. Updating the worker nodes
You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp
groups you created. Unpausing the mcp
group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.
In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.
This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.
You can check how many nodes are managed in an
mcp
group. Run the following command to get the list ofmcp
groups:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c9a52144456dbff9c9af9c5a37d1b614 True False False 3 3 3 0 36d mcp-1 rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e False False False 1 0 0 0 47h mcp-2 rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e False False False 1 0 0 0 47h worker rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0 True False False 0 0 0 0 36d
NoteYou decide how many
mcp
groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.Get the list of nodes in the cluster:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d8h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d8h v1.27.15+6147456 worker-1 Ready mcp-2,worker 5d8h v1.27.15+6147456
Confirm the
MachineConfigPool
groups that are paused:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 true mcp-2 true
NoteEach
MachineConfigPool
can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.Unpause the required
mcp
group to begin the upgrade:$ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'
Example output
machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched
Confirm that the required
mcp
group is unpaused:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 false mcp-2 true
As each
mcp
group is upgraded, continue to unpause and upgrade the remaining nodes.$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d8h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d8h v1.29.8+f10c92d worker-1 NotReady,SchedulingDisabled mcp-2,worker 5d8h v1.27.15+6147456
17.1.6.9. Verifying the health of the newly updated cluster
Run the following commands after updating the cluster to verify that the cluster is back up and running.
Procedure
Check the cluster version by running the following command:
$ oc get clusterversion
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.14 True False 4h38m Cluster version is 4.16.14
This should return the new cluster version and the
PROGRESSING
column should returnFalse
.Check that all nodes are ready:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d9h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d9h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d9h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d9h v1.29.8+f10c92d worker-1 Ready mcp-2,worker 5d9h v1.29.8+f10c92d
All nodes in the cluster should be in a
Ready
status and running the same version.Check that there are no paused
mcp
resources in the cluster:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 false mcp-2 false
Check that all cluster Operators are available:
$ oc get co
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.14 True False False 5d9h baremetal 4.16.14 True False False 5d9h cloud-controller-manager 4.16.14 True False False 5d10h cloud-credential 4.16.14 True False False 5d10h cluster-autoscaler 4.16.14 True False False 5d9h config-operator 4.16.14 True False False 5d9h console 4.16.14 True False False 5d9h control-plane-machine-set 4.16.14 True False False 5d9h csi-snapshot-controller 4.16.14 True False False 5d9h dns 4.16.14 True False False 5d9h etcd 4.16.14 True False False 5d9h image-registry 4.16.14 True False False 85m ingress 4.16.14 True False False 5d9h insights 4.16.14 True False False 5d9h kube-apiserver 4.16.14 True False False 5d9h kube-controller-manager 4.16.14 True False False 5d9h kube-scheduler 4.16.14 True False False 5d9h kube-storage-version-migrator 4.16.14 True False False 4h48m machine-api 4.16.14 True False False 5d9h machine-approver 4.16.14 True False False 5d9h machine-config 4.16.14 True False False 5d9h marketplace 4.16.14 True False False 5d9h monitoring 4.16.14 True False False 5d9h network 4.16.14 True False False 5d9h node-tuning 4.16.14 True False False 5d7h openshift-apiserver 4.16.14 True False False 5d9h openshift-controller-manager 4.16.14 True False False 5d9h openshift-samples 4.16.14 True False False 5h24m operator-lifecycle-manager 4.16.14 True False False 5d9h operator-lifecycle-manager-catalog 4.16.14 True False False 5d9h operator-lifecycle-manager-packageserver 4.16.14 True False False 5d9h service-ca 4.16.14 True False False 5d9h storage 4.16.14 True False False 5d9h
All cluster Operators should report
True
in theAVAILABLE
column.Check that all pods are healthy:
$ oc get po -A | grep -E -iv 'complete|running'
This should not return any pods.
NoteYou might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.
17.1.7. Completing the y-stream cluster update
Follow these steps to perform the y-stream cluster update and monitor the update through to completion. Completing a y-stream update is more straightforward than a Control Plane Only update.
17.1.7.1. Acknowledging the Control Plane Only or y-stream update
When you update to all versions from 4.11 and later, you must manually acknowledge that the update can continue.
Before you acknowledge the update, verify that there are no Kubernetes APIs in use that are removed in the version you are updating to. For example, in OpenShift Container Platform 4.17, there are no API removals. See "Kubernetes API removals" for more information.
Procedure
Run the following command:
$ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-<update_version_from>-kube-<kube_api_version>-api-removals-in-<update_version_to>":"true"}}' --type=merge
where:
- <update_version_from>
-
Is the cluster version you are moving from, for example,
4.14
. - <kube_api_version>
-
Is kube API version, for example,
1.28
. - <update_version_to>
-
Is the cluster version you are moving to, for example,
4.15
.
Verification
Verify the update. Run the following command:
$ oc get configmap admin-acks -n openshift-config -o json | jq .data
Example output
{ "ack-4.14-kube-1.28-api-removals-in-4.15": "true", "ack-4.15-kube-1.29-api-removals-in-4.16": "true" }
NoteIn this example, the cluster is updated from version 4.14 to 4.15, and then from 4.15 to 4.16 in a Control Plane Only update.
Additional resources
17.1.7.2. Starting the cluster update
When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.
You can verify that you are updating to a viable release by running the oc adm upgrade
command. The oc adm upgrade
command lists the compatible update releases.
Procedure
Start the update:
$ oc adm upgrade --to=4.15.33
Important- Control Plane Only update: Make sure you point to the interim <y+1> release path
- Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
- Z-stream update - Verify that there are no problems moving to that specific release
Example output
Requested update to 4.15.33 1
- 1
- The
Requested update
value changes depending on your particular update.
Additional resources
17.1.7.3. Monitoring the cluster update
You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.
Procedure
Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:
$ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.34 True True 4m6s Working towards 4.15.33: 111 of 873 done (12% complete), waiting on kube-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.34 True False False 4d22h baremetal 4.14.34 True False False 4d23h cloud-controller-manager 4.14.34 True False False 4d23h cloud-credential 4.14.34 True False False 4d23h cluster-autoscaler 4.14.34 True False False 4d23h console 4.14.34 True False False 4d22h ... storage 4.14.34 True False False 4d23h config-operator 4.15.33 True False False 4d23h etcd 4.15.33 True False False 4d23h NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 4d23h v1.27.15+6147456 ctrl-plane-1 Ready control-plane,master 4d23h v1.27.15+6147456 ctrl-plane-2 Ready control-plane,master 4d23h v1.27.15+6147456 worker-0 Ready mcp-1,worker 4d23h v1.27.15+6147456 worker-1 Ready mcp-2,worker 4d23h v1.27.15+6147456 NAMESPACE NAME READY STATUS RESTARTS AGE openshift-marketplace redhat-marketplace-rf86t 0/1 ContainerCreating 0 0s
Verification
During the update the watch
command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE
column.
When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.
During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.
As soon as the last control plane node reboot is complete, the cluster version is displayed as updated.
When the control plane update is complete a message such as the following is displayed. This example shows an update completed to the intermediate y-stream release.
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.33 True False 28m Cluster version is 4.15.33 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.33 True False False 5d baremetal 4.15.33 True False False 5d cloud-controller-manager 4.15.33 True False False 5d1h cloud-credential 4.15.33 True False False 5d1h cluster-autoscaler 4.15.33 True False False 5d config-operator 4.15.33 True False False 5d console 4.15.33 True False False 5d ... service-ca 4.15.33 True False False 5d storage 4.15.33 True False False 5d NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d v1.28.13+2ca1a23 ctrl-plane-1 Ready control-plane,master 5d v1.28.13+2ca1a23 ctrl-plane-2 Ready control-plane,master 5d v1.28.13+2ca1a23 worker-0 Ready mcp-1,worker 5d v1.28.13+2ca1a23 worker-1 Ready mcp-2,worker 5d v1.28.13+2ca1a23
17.1.7.4. Updating the OLM Operators
In telco environments, software needs to vetted before it is loaded onto a production cluster. Production clusters are also configured in a disconnected network, which means that they are not always directly connected to the internet. Because the clusters are in a disconnected network, the OpenShift Operators are configured for manual update during installation so that new versions can be managed on a cluster-by-cluster basis. Perform the following procedure to move the Operators to the newer versions.
Procedure
Check to see which Operators need to be updated:
$ oc get installplan -A | grep -E 'APPROVED|false'
Example output
NAMESPACE NAME CSV APPROVAL APPROVED metallb-system install-nwjnh metallb-operator.v4.16.0-202409202304 Manual false openshift-nmstate install-5r7wr kubernetes-nmstate-operator.4.16.0-202409251605 Manual false
Patch the
InstallPlan
resources for those Operators:$ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \ '{"spec":{"approved":true}}'
Example output
installplan.operators.coreos.com/install-nwjnh patched
Monitor the namespace by running the following command:
$ oc get all -n metallb-system
Example output
NAME READY STATUS RESTARTS AGE pod/metallb-operator-controller-manager-69b5f884c-8bp22 0/1 ContainerCreating 0 4s pod/metallb-operator-controller-manager-77895bdb46-bqjdx 1/1 Running 0 4m1s pod/metallb-operator-webhook-server-5d9b968896-vnbhk 0/1 ContainerCreating 0 4s pod/metallb-operator-webhook-server-d76f9c6c8-57r4w 1/1 Running 0 4m1s ... NAME DESIRED CURRENT READY AGE replicaset.apps/metallb-operator-controller-manager-69b5f884c 1 1 0 4s replicaset.apps/metallb-operator-controller-manager-77895bdb46 1 1 1 4m1s replicaset.apps/metallb-operator-controller-manager-99b76f88 0 0 0 4m40s replicaset.apps/metallb-operator-webhook-server-5d9b968896 1 1 0 4s replicaset.apps/metallb-operator-webhook-server-6f7dbfdb88 0 0 0 4m40s replicaset.apps/metallb-operator-webhook-server-d76f9c6c8 1 1 1 4m1s
When the update is complete, the required pods should be in a
Running
state, and the requiredReplicaSet
resources should be ready:NAME READY STATUS RESTARTS AGE pod/metallb-operator-controller-manager-69b5f884c-8bp22 1/1 Running 0 25s pod/metallb-operator-webhook-server-5d9b968896-vnbhk 1/1 Running 0 25s ... NAME DESIRED CURRENT READY AGE replicaset.apps/metallb-operator-controller-manager-69b5f884c 1 1 1 25s replicaset.apps/metallb-operator-controller-manager-77895bdb46 0 0 0 4m22s replicaset.apps/metallb-operator-webhook-server-5d9b968896 1 1 1 25s replicaset.apps/metallb-operator-webhook-server-d76f9c6c8 0 0 0 4m22s
Verification
Verify that the Operators do not need to be updated for a second time:
$ oc get installplan -A | grep -E 'APPROVED|false'
There should be no output returned.
NoteSometimes you have to approve an update twice because some Operators have interim z-stream release versions that need to be installed before the final version.
Additional resources
17.1.7.5. Updating the worker nodes
You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp
groups you created. Unpausing the mcp
group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.
In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.
This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.
You can check how many nodes are managed in an
mcp
group. Run the following command to get the list ofmcp
groups:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c9a52144456dbff9c9af9c5a37d1b614 True False False 3 3 3 0 36d mcp-1 rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e False False False 1 0 0 0 47h mcp-2 rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e False False False 1 0 0 0 47h worker rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0 True False False 0 0 0 0 36d
NoteYou decide how many
mcp
groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.Get the list of nodes in the cluster:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d8h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d8h v1.27.15+6147456 worker-1 Ready mcp-2,worker 5d8h v1.27.15+6147456
Confirm the
MachineConfigPool
groups that are paused:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 true mcp-2 true
NoteEach
MachineConfigPool
can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.Unpause the required
mcp
group to begin the upgrade:$ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'
Example output
machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched
Confirm that the required
mcp
group is unpaused:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 false mcp-2 true
As each
mcp
group is upgraded, continue to unpause and upgrade the remaining nodes.$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d8h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d8h v1.29.8+f10c92d worker-1 NotReady,SchedulingDisabled mcp-2,worker 5d8h v1.27.15+6147456
17.1.7.6. Verifying the health of the newly updated cluster
Run the following commands after updating the cluster to verify that the cluster is back up and running.
Procedure
Check the cluster version by running the following command:
$ oc get clusterversion
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.14 True False 4h38m Cluster version is 4.16.14
This should return the new cluster version and the
PROGRESSING
column should returnFalse
.Check that all nodes are ready:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d9h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d9h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d9h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d9h v1.29.8+f10c92d worker-1 Ready mcp-2,worker 5d9h v1.29.8+f10c92d
All nodes in the cluster should be in a
Ready
status and running the same version.Check that there are no paused
mcp
resources in the cluster:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 false mcp-2 false
Check that all cluster Operators are available:
$ oc get co
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.14 True False False 5d9h baremetal 4.16.14 True False False 5d9h cloud-controller-manager 4.16.14 True False False 5d10h cloud-credential 4.16.14 True False False 5d10h cluster-autoscaler 4.16.14 True False False 5d9h config-operator 4.16.14 True False False 5d9h console 4.16.14 True False False 5d9h control-plane-machine-set 4.16.14 True False False 5d9h csi-snapshot-controller 4.16.14 True False False 5d9h dns 4.16.14 True False False 5d9h etcd 4.16.14 True False False 5d9h image-registry 4.16.14 True False False 85m ingress 4.16.14 True False False 5d9h insights 4.16.14 True False False 5d9h kube-apiserver 4.16.14 True False False 5d9h kube-controller-manager 4.16.14 True False False 5d9h kube-scheduler 4.16.14 True False False 5d9h kube-storage-version-migrator 4.16.14 True False False 4h48m machine-api 4.16.14 True False False 5d9h machine-approver 4.16.14 True False False 5d9h machine-config 4.16.14 True False False 5d9h marketplace 4.16.14 True False False 5d9h monitoring 4.16.14 True False False 5d9h network 4.16.14 True False False 5d9h node-tuning 4.16.14 True False False 5d7h openshift-apiserver 4.16.14 True False False 5d9h openshift-controller-manager 4.16.14 True False False 5d9h openshift-samples 4.16.14 True False False 5h24m operator-lifecycle-manager 4.16.14 True False False 5d9h operator-lifecycle-manager-catalog 4.16.14 True False False 5d9h operator-lifecycle-manager-packageserver 4.16.14 True False False 5d9h service-ca 4.16.14 True False False 5d9h storage 4.16.14 True False False 5d9h
All cluster Operators should report
True
in theAVAILABLE
column.Check that all pods are healthy:
$ oc get po -A | grep -E -iv 'complete|running'
This should not return any pods.
NoteYou might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.
17.1.8. Completing the z-stream cluster update
Follow these steps to perform the z-stream cluster update and monitor the update through to completion. Completing a z-stream update is more straightforward than a Control Plane Only or y-stream update.
17.1.8.1. Starting the cluster update
When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.
You can verify that you are updating to a viable release by running the oc adm upgrade
command. The oc adm upgrade
command lists the compatible update releases.
Procedure
Start the update:
$ oc adm upgrade --to=4.15.33
Important- Control Plane Only update: Make sure you point to the interim <y+1> release path
- Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
- Z-stream update - Verify that there are no problems moving to that specific release
Example output
Requested update to 4.15.33 1
- 1
- The
Requested update
value changes depending on your particular update.
Additional resources
17.1.8.2. Updating the worker nodes
You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp
groups you created. Unpausing the mcp
group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.
In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.
This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.
You can check how many nodes are managed in an
mcp
group. Run the following command to get the list ofmcp
groups:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c9a52144456dbff9c9af9c5a37d1b614 True False False 3 3 3 0 36d mcp-1 rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e False False False 1 0 0 0 47h mcp-2 rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e False False False 1 0 0 0 47h worker rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0 True False False 0 0 0 0 36d
NoteYou decide how many
mcp
groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.Get the list of nodes in the cluster:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d8h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d8h v1.27.15+6147456 worker-1 Ready mcp-2,worker 5d8h v1.27.15+6147456
Confirm the
MachineConfigPool
groups that are paused:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 true mcp-2 true
NoteEach
MachineConfigPool
can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.Unpause the required
mcp
group to begin the upgrade:$ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'
Example output
machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched
Confirm that the required
mcp
group is unpaused:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 false mcp-2 true
As each
mcp
group is upgraded, continue to unpause and upgrade the remaining nodes.$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d8h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d8h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d8h v1.29.8+f10c92d worker-1 NotReady,SchedulingDisabled mcp-2,worker 5d8h v1.27.15+6147456
17.1.8.3. Verifying the health of the newly updated cluster
Run the following commands after updating the cluster to verify that the cluster is back up and running.
Procedure
Check the cluster version by running the following command:
$ oc get clusterversion
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.14 True False 4h38m Cluster version is 4.16.14
This should return the new cluster version and the
PROGRESSING
column should returnFalse
.Check that all nodes are ready:
$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ctrl-plane-0 Ready control-plane,master 5d9h v1.29.8+f10c92d ctrl-plane-1 Ready control-plane,master 5d9h v1.29.8+f10c92d ctrl-plane-2 Ready control-plane,master 5d9h v1.29.8+f10c92d worker-0 Ready mcp-1,worker 5d9h v1.29.8+f10c92d worker-1 Ready mcp-2,worker 5d9h v1.29.8+f10c92d
All nodes in the cluster should be in a
Ready
status and running the same version.Check that there are no paused
mcp
resources in the cluster:$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker
Example output
MCP Paused --- ------ master false mcp-1 false mcp-2 false
Check that all cluster Operators are available:
$ oc get co
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.14 True False False 5d9h baremetal 4.16.14 True False False 5d9h cloud-controller-manager 4.16.14 True False False 5d10h cloud-credential 4.16.14 True False False 5d10h cluster-autoscaler 4.16.14 True False False 5d9h config-operator 4.16.14 True False False 5d9h console 4.16.14 True False False 5d9h control-plane-machine-set 4.16.14 True False False 5d9h csi-snapshot-controller 4.16.14 True False False 5d9h dns 4.16.14 True False False 5d9h etcd 4.16.14 True False False 5d9h image-registry 4.16.14 True False False 85m ingress 4.16.14 True False False 5d9h insights 4.16.14 True False False 5d9h kube-apiserver 4.16.14 True False False 5d9h kube-controller-manager 4.16.14 True False False 5d9h kube-scheduler 4.16.14 True False False 5d9h kube-storage-version-migrator 4.16.14 True False False 4h48m machine-api 4.16.14 True False False 5d9h machine-approver 4.16.14 True False False 5d9h machine-config 4.16.14 True False False 5d9h marketplace 4.16.14 True False False 5d9h monitoring 4.16.14 True False False 5d9h network 4.16.14 True False False 5d9h node-tuning 4.16.14 True False False 5d7h openshift-apiserver 4.16.14 True False False 5d9h openshift-controller-manager 4.16.14 True False False 5d9h openshift-samples 4.16.14 True False False 5h24m operator-lifecycle-manager 4.16.14 True False False 5d9h operator-lifecycle-manager-catalog 4.16.14 True False False 5d9h operator-lifecycle-manager-packageserver 4.16.14 True False False 5d9h service-ca 4.16.14 True False False 5d9h storage 4.16.14 True False False 5d9h
All cluster Operators should report
True
in theAVAILABLE
column.Check that all pods are healthy:
$ oc get po -A | grep -E -iv 'complete|running'
This should not return any pods.
NoteYou might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.
17.2. Troubleshooting and maintaining telco core CNF clusters
17.2.1. Troubleshooting and maintaining telco core CNF clusters
Troubleshooting and maintenance are weekly tasks that can be a challenge if you do not have the tools to reach your goal, whether you want to update a component or investigate an issue. Part of the challenge is knowing where and how to search for tools and answers.
To maintain and troubleshoot a bare-metal environment where high-bandwidth network throughput is required, see the following procedures.
This troubleshooting information is not a reference for configuring OpenShift Container Platform or developing Cloud-native Network Function (CNF) applications.
For information about developing CNF applications for telco, see Red Hat Best Practices for Kubernetes.
17.2.1.1. Cloud-native Network Functions
If you are starting to use OpenShift Container Platform for telecommunications Cloud-native Network Function (CNF) applications, learning about CNFs can help you understand the issues that you might encounter.
To learn more about CNFs and their evolution, see VNF and CNF, what’s the difference?.
17.2.1.2. Getting Support
If you experience difficulty with a procedure, visit the Red Hat Customer Portal. From the Customer Portal, you can find help in various ways:
- Search or browse through the Red Hat Knowledgebase of articles and solutions about Red Hat products.
- Submit a support case to Red Hat Support.
- Access other product documentation.
To identify issues with your deployment, you can use the debugging tool or check the health endpoint of your deployment. After you have debugged or obtained health information about your deployment, you can search the Red Hat Knowledgebase for a solution or file a support ticket.
17.2.1.2.1. About the Red Hat Knowledgebase
The Red Hat Knowledgebase provides rich content aimed at helping you make the most of Red Hat’s products and technologies. The Red Hat Knowledgebase consists of articles, product documentation, and videos outlining best practices on installing, configuring, and using Red Hat products. In addition, you can search for solutions to known issues, each providing concise root cause descriptions and remedial steps.
17.2.1.2.2. Searching the Red Hat Knowledgebase
In the event of an OpenShift Container Platform issue, you can perform an initial search to determine if a solution already exists within the Red Hat Knowledgebase.
Prerequisites
- You have a Red Hat Customer Portal account.
Procedure
- Log in to the Red Hat Customer Portal.
- Click Search.
In the search field, input keywords and strings relating to the problem, including:
- OpenShift Container Platform components (such as etcd)
- Related procedure (such as installation)
- Warnings, error messages, and other outputs related to explicit failures
- Click the Enter key.
- Optional: Select the OpenShift Container Platform product filter.
- Optional: Select the Documentation content type filter.
17.2.1.2.3. Submitting a support case
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. -
You have installed the OpenShift CLI (
oc
). - You have a Red Hat Customer Portal account.
- You have a Red Hat Standard or Premium subscription.
Procedure
- Log in to the Customer Support page of the Red Hat Customer Portal.
- Click Get support.
On the Cases tab of the Customer Support page:
- Optional: Change the pre-filled account and owner details if needed.
- Select the appropriate category for your issue, such as Bug or Defect, and click Continue.
Enter the following information:
- In the Summary field, enter a concise but descriptive problem summary and further details about the symptoms being experienced, as well as your expectations.
- Select OpenShift Container Platform from the Product drop-down menu.
- Select 4.17 from the Version drop-down.
- Review the list of suggested Red Hat Knowledgebase solutions for a potential match against the problem that is being reported. If the suggested articles do not address the issue, click Continue.
- Review the updated list of suggested Red Hat Knowledgebase solutions for a potential match against the problem that is being reported. The list is refined as you provide more information during the case creation process. If the suggested articles do not address the issue, click Continue.
- Ensure that the account information presented is as expected, and if not, amend accordingly.
Check that the autofilled OpenShift Container Platform Cluster ID is correct. If it is not, manually obtain your cluster ID.
To manually obtain your cluster ID using the OpenShift Container Platform web console:
-
Navigate to Home
Overview. - Find the value in the Cluster ID field of the Details section.
-
Navigate to Home
Alternatively, it is possible to open a new support case through the OpenShift Container Platform web console and have your cluster ID autofilled.
-
From the toolbar, navigate to (?) Help
Open Support Case. - The Cluster ID value is autofilled.
-
From the toolbar, navigate to (?) Help
To obtain your cluster ID using the OpenShift CLI (
oc
), run the following command:$ oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}'
Complete the following questions where prompted and then click Continue:
- What are you experiencing? What are you expecting to happen?
- Define the value or impact to you or the business.
- Where are you experiencing this behavior? What environment?
- When does this behavior occur? Frequency? Repeatedly? At certain times?
-
Upload relevant diagnostic data files and click Continue. It is recommended to include data gathered using the
oc adm must-gather
command as a starting point, plus any issue specific data that is not collected by that command. - Input relevant case management details and click Continue.
- Preview the case details and click Submit.
17.2.2. General troubleshooting
When you encounter a problem, the first step is to find the specific area where the issue is happening. To narrow down the potential problematic areas, complete one or more tasks:
- Query your cluster
- Check your pod logs
- Debug a pod
- Review events
17.2.2.1. Querying your cluster
Get information about your cluster so that you can more accurately find potential problems.
Procedure
Switch into a project by running the following command:
$ oc project <project_name>
Query your cluster version, cluster Operator, and node within that namespace by running the following command:
$ oc get clusterversion,clusteroperator,node
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.16.11 True False 62d Cluster version is 4.16.11 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE clusteroperator.config.openshift.io/authentication 4.16.11 True False False 62d clusteroperator.config.openshift.io/baremetal 4.16.11 True False False 62d clusteroperator.config.openshift.io/cloud-controller-manager 4.16.11 True False False 62d clusteroperator.config.openshift.io/cloud-credential 4.16.11 True False False 62d clusteroperator.config.openshift.io/cluster-autoscaler 4.16.11 True False False 62d clusteroperator.config.openshift.io/config-operator 4.16.11 True False False 62d clusteroperator.config.openshift.io/console 4.16.11 True False False 62d clusteroperator.config.openshift.io/control-plane-machine-set 4.16.11 True False False 62d clusteroperator.config.openshift.io/csi-snapshot-controller 4.16.11 True False False 62d clusteroperator.config.openshift.io/dns 4.16.11 True False False 62d clusteroperator.config.openshift.io/etcd 4.16.11 True False False 62d clusteroperator.config.openshift.io/image-registry 4.16.11 True False False 55d clusteroperator.config.openshift.io/ingress 4.16.11 True False False 62d clusteroperator.config.openshift.io/insights 4.16.11 True False False 62d clusteroperator.config.openshift.io/kube-apiserver 4.16.11 True False False 62d clusteroperator.config.openshift.io/kube-controller-manager 4.16.11 True False False 62d clusteroperator.config.openshift.io/kube-scheduler 4.16.11 True False False 62d clusteroperator.config.openshift.io/kube-storage-version-migrator 4.16.11 True False False 62d clusteroperator.config.openshift.io/machine-api 4.16.11 True False False 62d clusteroperator.config.openshift.io/machine-approver 4.16.11 True False False 62d clusteroperator.config.openshift.io/machine-config 4.16.11 True False False 62d clusteroperator.config.openshift.io/marketplace 4.16.11 True False False 62d clusteroperator.config.openshift.io/monitoring 4.16.11 True False False 62d clusteroperator.config.openshift.io/network 4.16.11 True False False 62d clusteroperator.config.openshift.io/node-tuning 4.16.11 True False False 62d clusteroperator.config.openshift.io/openshift-apiserver 4.16.11 True False False 62d clusteroperator.config.openshift.io/openshift-controller-manager 4.16.11 True False False 62d clusteroperator.config.openshift.io/openshift-samples 4.16.11 True False False 35d clusteroperator.config.openshift.io/operator-lifecycle-manager 4.16.11 True False False 62d clusteroperator.config.openshift.io/operator-lifecycle-manager-catalog 4.16.11 True False False 62d clusteroperator.config.openshift.io/operator-lifecycle-manager-packageserver 4.16.11 True False False 62d clusteroperator.config.openshift.io/service-ca 4.16.11 True False False 62d clusteroperator.config.openshift.io/storage 4.16.11 True False False 62d NAME STATUS ROLES AGE VERSION node/ctrl-plane-0 Ready control-plane,master,worker 62d v1.29.7 node/ctrl-plane-1 Ready control-plane,master,worker 62d v1.29.7 node/ctrl-plane-2 Ready control-plane,master,worker 62d v1.29.7
For more information, see "oc get" and "Reviewing pod status".
Additional resources
17.2.2.2. Checking pod logs
Get logs from the pod so that you can review the logs for issues.
Procedure
List the pods by running the following command:
$ oc get pod
Example output
NAME READY STATUS RESTARTS AGE busybox-1 1/1 Running 168 (34m ago) 7d busybox-2 1/1 Running 119 (9m20s ago) 4d23h busybox-3 1/1 Running 168 (43m ago) 7d busybox-4 1/1 Running 168 (43m ago) 7d
Check pod log files by running the following command:
$ oc logs -n <namespace> busybox-1
For more information, see "oc logs", "Logging", and "Inspecting pod and container logs".
Additional resources
17.2.2.3. Describing a pod
Describing a pod gives you information about that pod to help with troubleshooting. The Events
section provides detailed information about the pod and the containers inside of it.
Procedure
Describe a pod by running the following command:
$ oc describe pod -n <namespace> busybox-1
Example output
Name: busybox-1 Namespace: busy Priority: 0 Service Account: default Node: worker-3/192.168.0.0 Start Time: Mon, 27 Nov 2023 14:41:25 -0500 Labels: app=busybox pod-template-hash=<hash> Annotations: k8s.ovn.org/pod-networks: … Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 41m (x170 over 7d1h) kubelet Container image "quay.io/quay/busybox:latest" already present on machine Normal Created 41m (x170 over 7d1h) kubelet Created container busybox Normal Started 41m (x170 over 7d1h) kubelet Started container busybox
For more information, see "oc describe".
Additional resources
17.2.2.4. Reviewing events
You can review the events in a given namespace to find potential issues.
Procedure
Check for events in your namespace by running the following command:
$ oc get events -n <namespace> --sort-by=".metadata.creationTimestamp" 1
- 1
- Adding the
--sort-by=".metadata.creationTimestamp"
flag places the most recent events at the end of the output.
Optional: If the events within your specified namespace do not provide enough information, expand your query to all namespaces by running the following command:
$ oc get events -A --sort-by=".metadata.creationTimestamp" 1
- 1
- The
--sort-by=".metadata.creationTimestamp"
flag places the most recent events at the end of the output.
To filter the results of all events from a cluster, you can use the
grep
command. For example, if you are looking for errors, the errors can appear in two different sections of the output: theTYPE
orMESSAGE
sections. With thegrep
command, you can search for keywords, such aserror
orfailed
.For example, search for a message that contains
warning
orerror
by running the following command:$ oc get events -A | grep -Ei "warning|error"
Example output
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE openshift 59s Warning FailedMount pod/openshift-1 MountVolume.SetUp failed for volume "v4-0-config-user-idp-0-file-data" : references non-existent secret key: test
Optional: To clean up the events and see only recurring events, you can delete the events in the relevant namespace by running the following command:
$ oc delete events -n <namespace> --all
For more information, see "Watching cluster events".
Additional resources
17.2.2.5. Connecting to a pod
You can directly connect to a currently running pod with the oc rsh
command, which provides you with a shell on that pod.
In pods that run a low-latency application, latency issues can occur when you run the oc rsh
command. Use the oc rsh
command only if you cannot connect to the node by using the oc debug
command.
Procedure
Connect to your pod by running the following command:
$ oc rsh -n <namespace> busybox-1
For more information, see "oc rsh" and "Accessing running pods".
Additional resources
17.2.2.6. Debugging a pod
In certain cases, you do not want to directly interact with your pod that is in production.
To avoid interfering with running traffic, you can use a secondary pod that is a copy of your original pod. The secondary pod uses the same components as that of the original pod but does not have running traffic.
Procedure
List the pods by running the following command:
$ oc get pod
Example output
NAME READY STATUS RESTARTS AGE busybox-1 1/1 Running 168 (34m ago) 7d busybox-2 1/1 Running 119 (9m20s ago) 4d23h busybox-3 1/1 Running 168 (43m ago) 7d busybox-4 1/1 Running 168 (43m ago) 7d
Debug a pod by running the following command:
$ oc debug -n <namespace> busybox-1
Example output
Starting pod/busybox-1-debug, command was: sleep 3600 Pod IP: 10.133.2.11
If you do not see a shell prompt, press Enter.
For more information, see "oc debug" and "Starting debug pods with root access".
Additional resources
17.2.2.7. Running a command on a pod
If you want to run a command or set of commands on a pod without directly logging into it, you can use the oc exec -it
command. You can interact with the pod quickly to get process or output information from the pod. A common use case is to run the oc exec -it
command inside a script to run the same command on multiple pods in a replica set or deployment.
In pods that run a low-latency application, the oc exec
command can cause latency issues.
Procedure
To run a command on a pod without logging into it, run the following command:
$ oc exec -it <pod> -- <command>
For more information, see "oc exec" and "Executing remote commands in containers".
Additional resources
17.2.3. Cluster maintenance
In telco networks, you must pay more attention to certain configurations due the nature of bare-metal deployments. You can troubleshoot more effectively by completing these tasks:
- Monitor for failed or failing hardware components
- Periodically check the status of the cluster Operators
For hardware monitoring, contact your hardware vendor to find the appropriate logging tool for your specific hardware.
17.2.3.1. Checking cluster Operators
Periodically check the status of your cluster Operators to find issues early.
Procedure
Check the status of the cluster Operators by running the following command:
$ oc get co
17.2.3.2. Watching for failed pods
To reduce troubleshooting time, regularly monitor for failed pods in your cluster.
Procedure
To watch for failed pods, run the following command:
$ oc get po -A | grep -Eiv 'complete|running'
17.2.4. Security
Implementing a robust cluster security profile is important for building resilient telco networks.
17.2.4.1. Authentication
Determine which identity providers are in your cluster. For more information about supported identity providers, see "Supported identity providers" in Authentication and authorization.
After you know which providers are configured, you can inspect the openshift-authentication
namespace to determine if there are potential issues.
Procedure
Check the events in the
openshift-authentication
namespace by running the following command:$ oc get events -n openshift-authentication --sort-by='.metadata.creationTimestamp'
Check the pods in the
openshift-authentication
namespace by running the following command:$ oc get pod -n openshift-authentication
Optional: If you need more information, check the logs of one of the running pods by running the following command:
$ oc logs -n openshift-authentication <pod_name>
Additional resources
17.2.5. Certificate maintenance
Certificate maintenance is required for continuous cluster authentication. As a cluster administrator, you must manually renew certain certificates, while others are automatically renewed by the cluster.
Learn about certificates in OpenShift Container Platform and how to maintain them by using the following resources:
17.2.5.1. Certificates manually managed by the administrator
The following certificates must be renewed by a cluster administrator:
- Proxy certificates
- User-provisioned certificates for the API server
17.2.5.1.1. Managing proxy certificates
Proxy certificates allow users to specify one or more custom certificate authority (CA) certificates that are used by platform components when making egress connections.
Certain CAs set expiration dates and you might need to renew these certificates every two years.
If you did not originally set the requested certificates, you can determine the certificate expiration in several ways. Most Cloud-native Network Functions (CNFs) use certificates that are not specifically designed for browser-based connectivity. Therefore, you need to pull the certificate from the ConfigMap
object of your deployment.
Procedure
To get the expiration date, run the following command against the certificate file:
$ openssl x509 -enddate -noout -in <cert_file_name>.pem
For more information about determining how and when to renew your proxy certificates, see "Proxy certificates" in Security and compliance.
Additional resources
17.2.5.1.2. User-provisioned API server certificates
The API server is accessible by clients that are external to the cluster at api.<cluster_name>.<base_domain>
. You might want clients to access the API server at a different hostname or without the need to distribute the cluster-managed certificate authority (CA) certificates to the clients. You must set a custom default certificate to be used by the API server when serving content.
For more information, see "User-provided certificates for the API server" in Security and compliance
Additional resources
17.2.5.2. Certificates managed by the cluster
You only need to check cluster-managed certificates if you detect an issue in the logs. The following certificates are automatically managed by the cluster:
- Service CA certificates
- Node certificates
- Bootstrap certificates
- etcd certificates
- OLM certificates
- Machine Config Operator certificates
- Monitoring and cluster logging Operator component certificates
- Control plane certificates
- Ingress certificates
Additional resources
17.2.5.2.1. Certificates managed by etcd
The etcd certificates are used for encrypted communication between etcd member peers as well as encrypted client traffic. The certificates are renewed automatically within the cluster provided that communication between all nodes and all services is current. Therefore, if your cluster might lose communication between components during a specific period of time, which is close to the end of the etcd certificate lifetime, it is recommended to renew the certificate in advance. For example, communication can be lost during an upgrade due to nodes rebooting at different times.
You can manually renew etcd certificates by running the following command:
$ for each in $(oc get secret -n openshift-etcd | grep "kubernetes.io/tls" | grep -e \ "etcd-peer\|etcd-serving" | awk '{print $1}'); do oc get secret $each -n openshift-etcd -o \ jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -enddate; done
For more information about updating etcd certificates, see Checking etcd certificate expiry in OpenShift 4. For more information about etcd certificates, see "etcd certificates" in Security and compliance.
Additional resources
17.2.5.2.2. Node certificates
Node certificates are self-signed certificates, which means that they are signed by the cluster and they originate from an internal certificate authority (CA) that is generated by the bootstrap process.
After the cluster is installed, the cluster automatically renews the node certificates.
For more information, see "Node certificates" in Security and compliance.
Additional resources
17.2.5.2.3. Service CA certificates
The service-ca
is an Operator that creates a self-signed certificate authority (CA) when an OpenShift Container Platform cluster is deployed. This allows user to add certificates to their deployments without manually creating them. Service CA certificates are self-signed certificates.
For more information, see "Service CA certificates" in Security and compliance.
Additional resources
17.2.6. Machine Config Operator
The Machine Config Operator provides useful information to cluster administrators and controls what is running directly on the bare-metal host.
The Machine Config Operator differentiates between different groups of nodes in the cluster, allowing control plane nodes and worker nodes to run with different configurations. These groups of nodes run worker or application pods, which are called MachineConfigPool
(mcp
) groups. The same machine config is applied on all nodes or only on one MCP in the cluster.
For more information about how and why to apply MCPs in a telco core cluster, see Applying MachineConfigPool labels to nodes before the update.
For more information about the Machine Config Operator, see Machine Config Operator.
17.2.6.1. Purpose of the Machine Config Operator
The Machine Config Operator (MCO) manages and applies configuration and updates of Red Hat Enterprise Linux CoreOS (RHCOS) and container runtime, including everything between the kernel and kubelet. Managing RHCOS is important since most telecommunications companies run on bare-metal hardware and use some sort of hardware accelerator or kernel modification. Applying machine configuration to RHCOS manually can cause problems because the MCO monitors each node and what is applied to it.
You must consider these minor components and how the MCO can help you manage your clusters effectively.
You must use the MCO to perform all changes on worker or control plane nodes. Do not manually make changes to RHCOS or node files.
17.2.6.2. Applying several machine config files at the same time
When you need to change the machine config for a group of nodes in the cluster, also known as machine config pools (MCPs), sometimes the changes must be applied with several different machine config files. The nodes need to restart for the machine config file to be applied. After each machine config file is applied to the cluster, all nodes restart that are affected by the machine config file.
To prevent the nodes from restarting for each machine config file, you can apply all of the changes at the same time by pausing each MCP that is updated by the new machine config file.
Procedure
Pause the affected MCP by running the following command:
$ oc patch mcp/<mcp_name> --type merge --patch '{"spec":{"paused":true}}'
After you apply all machine config changes to the cluster, run the following command:
$ oc patch mcp/<mcp_name> --type merge --patch '{"spec":{"paused":false}}'
This allows the nodes in your MCP to reboot into the new configurations.
17.2.7. Bare-metal node maintenance
You can connect to a node for general troubleshooting. However, in some cases, you need to perform troubleshooting or maintenance tasks on certain hardware components. This section discusses topics that you need to perform that hardware maintenance.
17.2.7.1. Connecting to a bare-metal node in your cluster
You can connect to bare-metal cluster nodes for general maintenance tasks.
Configuring the cluster node from the host operating system is not recommended or supported.
To troubleshoot your nodes, you can do the following tasks:
- Retrieve logs from node
- Use debugging
- Use SSH to connect to the node
Use SSH only if you cannot connect to the node with the oc debug
command.
Procedure
Retrieve the logs from a node by running the following command:
$ oc adm node-logs <node_name> -u crio
Use debugging by running the following command:
$ oc debug node/<node_name>
Set
/host
as the root directory within the debug shell. The debug pod mounts the host’s root file system in/host
within the pod. By changing the root directory to/host
, you can run binaries contained in the host’s executable paths:# chroot /host
Output
You are now logged in as root on the node
Optional: Use SSH to connect to the node by running the following command:
$ ssh core@<node_name>
17.2.7.2. Moving applications to pods within the cluster
For scheduled hardware maintenance, you need to consider how to move your application pods to other nodes within the cluster without affecting the pod workload.
Procedure
Mark the node as unschedulable by running the following command:
$ oc adm cordon <node_name>
When the node is unschedulable, no pods can be scheduled on the node. For more information, see "Working with nodes".
When moving CNF applications, you might need to verify ahead of time that there are enough additional worker nodes in the cluster due to anti-affinity and pod disruption budget.
Additional resources
17.2.7.3. DIMM memory replacement
Dual in-line memory module (DIMM) problems sometimes only appear after a server reboots. You can check the log files for these problems.
When you perform a standard reboot and the server does not start, you can see a message in the console that there is a faulty DIMM memory. In that case, you can acknowledge the faulty DIMM and continue rebooting if the remaining memory is sufficient. Then, you can schedule a maintenance window to replace the faulty DIMM.
Sometimes, a message in the event logs indicates a bad memory module. In these cases, you can schedule the memory replacement before the server is rebooted.
Additional resources
17.2.7.4. Disk replacement
If you do not have disk redundancy configured on your node through hardware or software redundant array of independent disks (RAID), you need to check the following:
- Does the disk contain running pod images?
- Does the disk contain persistent data for pods?
For more information, see "OpenShift Container Platform storage overview" in Storage.
17.2.7.5. Cluster network card replacement
When you replace a network card, the MAC address changes. The MAC address can be part of the DHCP or SR-IOV Operator configuration, router configuration, firewall rules, or application Cloud-native Network Function (CNF) configuration. Before you bring back a node online after replacing a network card, you must verify that these configurations are up-to-date.
If you do not have specific procedures for MAC address changes within the network, contact your network administrator or network hardware vendor.
17.3. Observability
17.3.1. Observability in OpenShift Container Platform
OpenShift Container Platform generates a large amount of data, such as performance metrics and logs from both the platform and the workloads running on it. As an administrator, you can use various tools to collect and analyze all the data available. What follows is an outline of best practices for system engineers, architects, and administrators configuring the observability stack.
Unless explicitly stated, the material in this document refers to both Edge and Core deployments.
17.3.1.1. Understanding the monitoring stack
The monitoring stack uses the following components:
- Prometheus collects and analyzes metrics from OpenShift Container Platform components and from workloads, if configured to do so.
- Alertmanager is a component of Prometheus that handles routing, grouping, and silencing of alerts.
- Thanos handles long term storage of metrics.
Figure 17.2. OpenShift Container Platform monitoring architecture

For a single-node OpenShift cluster, you should disable Alertmanager and Thanos because the cluster sends all metrics to the hub cluster for analysis and retention.
Additional resources
17.3.1.2. Key performance metrics
Depending on your system, there can be hundreds of available measurements.
Here are some key metrics that you should pay attention to:
-
etcd
response times - API response times
- Pod restarts and scheduling
- Resource usage
- OVN health
- Overall cluster operator health
A good rule to follow is that if you decide that a metric is important, there should be an alert for it.
You can check the available metrics by running the following command:
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -qsk http://localhost:9090/api/v1/metadata | jq '.data
17.3.1.2.1. Example queries in PromQL
The following tables show some queries that you can explore in the metrics query browser using the OpenShift Container Platform console.
The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser. You can get the OpenShift Console FQDN by running the following command:
$ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].host}'
Metric | Query |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Combined |
|
Metric | Query |
---|---|
|
|
|
|
Leader elections |
|
Network latency |
|
Metric | Query |
---|---|
Degraded operators |
|
Total degraded operators per cluster |
|
17.3.1.2.2. Recommendations for storage of metrics
Out of the box, Prometheus does not back up saved metrics with persistent storage. If you restart the Prometheus pods, all metrics data are lost. You should configure the monitoring stack to use the back-end storage that is available on the platform. To meet the high IO demands of Prometheus you should use local storage.
For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus.
Red Hat OpenShift Data Foundation (ODF), which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster.
To keep system resource requirements low on a RAN single-node OpenShift or far edge cluster, you should not provision backend storage for the monitoring stack. Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.
17.3.1.3. Monitoring the edge
Single-node OpenShift at the edge keeps the footprint of the platform components to a minimum. The following procedure is an example of how you can configure a single-node OpenShift node with a small monitoring footprint.
Prerequisites
- For environments that use Red Hat Advanced Cluster Management (RHACM), you have enabled the Observability service.
- The hub cluster is running Red Hat OpenShift Data Foundation (ODF).
Procedure
Create a
ConfigMap
CR, and save it asmonitoringConfigMap.yaml
, as in the following example:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | alertmanagerMain: enabled: false telemeterClient: enabled: false prometheusK8s: retention: 24h
On the single-node OpenShift, apply the
ConfigMap
CR by running the following command:$ oc apply -f monitoringConfigMap.yaml
Create a
NameSpace
CR, and save it asmonitoringNamespace.yaml
, as in the following example:apiVersion: v1 kind: Namespace metadata: name: open-cluster-management-observability
On the hub cluster, apply the
Namespace
CR on the hub cluster by running the following command:$ oc apply -f monitoringNamespace.yaml
Create an
ObjectBucketClaim
CR, and save it asmonitoringObjectBucketClaim.yaml
, as in the following example:apiVersion: objectbucket.io/v1alpha1 kind: ObjectBucketClaim metadata: name: multi-cloud-observability namespace: open-cluster-management-observability spec: storageClassName: openshift-storage.noobaa.io generateBucketName: acm-multi
On the hub cluster, apply the
ObjectBucketClaim
CR, by running the following command:$ oc apply -f monitoringObjectBucketClaim.yaml
Create a
Secret
CR, and save it asmonitoringSecret.yaml
, as in the following example:apiVersion: v1 kind: Secret metadata: name: multiclusterhub-operator-pull-secret namespace: open-cluster-management-observability stringData: .dockerconfigjson: 'PULL_SECRET'
On the hub cluster, apply the
Secret
CR by running the following command:$ oc apply -f monitoringSecret.yaml
Get the keys for the NooBaa service and the backend bucket name from the hub cluster by running the following commands:
$ NOOBAA_ACCESS_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')
$ NOOBAA_SECRET_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')
$ OBJECT_BUCKET=$(oc get objectbucketclaim -n open-cluster-management-observability multi-cloud-observability -o json | jq -r .spec.bucketName)
Create a
Secret
CR for bucket storage and save it asmonitoringBucketSecret.yaml
, as in the following example:apiVersion: v1 kind: Secret metadata: name: thanos-object-storage namespace: open-cluster-management-observability type: Opaque stringData: thanos.yaml: | type: s3 config: bucket: ${OBJECT_BUCKET} endpoint: s3.openshift-storage.svc insecure: true access_key: ${NOOBAA_ACCESS_KEY} secret_key: ${NOOBAA_SECRET_KEY}
On the hub cluster, apply the
Secret
CR by running the following command:$ oc apply -f monitoringBucketSecret.yaml
Create the
MultiClusterObservability
CR and save it asmonitoringMultiClusterObservability.yaml
, as in the following example:apiVersion: observability.open-cluster-management.io/v1beta2 kind: MultiClusterObservability metadata: name: observability spec: advanced: retentionConfig: blockDuration: 2h deleteDelay: 48h retentionInLocal: 24h retentionResolutionRaw: 3d enableDownsampling: false observabilityAddonSpec: enableMetrics: true interval: 300 storageConfig: alertmanagerStorageSize: 10Gi compactStorageSize: 100Gi metricObjectStorage: key: thanos.yaml name: thanos-object-storage receiveStorageSize: 25Gi ruleStorageSize: 10Gi storeStorageSize: 25Gi
On the hub cluster, apply the
MultiClusterObservability
CR by running the following command:$ oc apply -f monitoringMultiClusterObservability.yaml
Verification
Check the routes and pods in the namespace to validate that the services have deployed on the hub cluster by running the following command:
$ oc get routes,pods -n open-cluster-management-observability
Example output
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/alertmanager alertmanager-open-cluster-management-observability.cloud.example.com /api/v2 alertmanager oauth-proxy reencrypt/Redirect None route.route.openshift.io/grafana grafana-open-cluster-management-observability.cloud.example.com grafana oauth-proxy reencrypt/Redirect None 1 route.route.openshift.io/observatorium-api observatorium-api-open-cluster-management-observability.cloud.example.com observability-observatorium-api public passthrough/None None route.route.openshift.io/rbac-query-proxy rbac-query-proxy-open-cluster-management-observability.cloud.example.com rbac-query-proxy https reencrypt/Redirect None NAME READY STATUS RESTARTS AGE pod/observability-alertmanager-0 3/3 Running 0 1d pod/observability-alertmanager-1 3/3 Running 0 1d pod/observability-alertmanager-2 3/3 Running 0 1d pod/observability-grafana-685b47bb47-dq4cw 3/3 Running 0 1d <...snip…> pod/observability-thanos-store-shard-0-0 1/1 Running 0 1d pod/observability-thanos-store-shard-1-0 1/1 Running 0 1d pod/observability-thanos-store-shard-2-0 1/1 Running 0 1d
- 1
- A dashboard is accessible at the grafana route listed. You can use this to view metrics across all managed clusters.
For more information on observability in Red Hat Advanced Cluster Management, see Observability.
17.3.1.4. Alerting
OpenShift Container Platform includes a large number of alert rules, which can change from release to release.
17.3.1.4.1. Viewing default alerts
Use the following procedure to review all of the alert rules in a cluster.
Procedure
To review all the alert rules in a cluster, you can run the following command:
$ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml
Rules can include a description and provide a link to additional information and mitigation steps. For example, this is the rule for
etcdHighFsyncDurations
:- alert: etcdHighFsyncDurations annotations: description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.' runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md summary: etcd cluster 99th percentile fsync durations are too high. expr: | histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 1 for: 10m labels: severity: critical
17.3.1.4.2. Alert notifications
You can view alerts in the OpenShift Container Platform console, however an administrator should configure an external receiver to forward the alerts to. OpenShift Container Platform supports the following receiver types:
- PagerDuty: a 3rd party incident response platform
- Webhook: an arbitrary API endpoint that receives an alert via a POST request and can take any necessary action
- Email: sends an email to designated address
- Slack: sends a notification to either a slack channel or an individual user
Additional resources
17.3.1.5. Workload monitoring
By default, OpenShift Container Platform does not collect metrics for application workloads. You can configure a cluster to collect workload metrics.
Prerequisites
- You have defined endpoints to gather workload metrics on the cluster.
Procedure
Create a
ConfigMap
CR and save it asmonitoringConfigMap.yaml
, as in the following example:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true 1
- 1
- Set to
true
to enable workload monitoring.
Apply the
ConfigMap
CR by running the following command:$ oc apply -f monitoringConfigMap.yaml
Create a
ServiceMonitor
CR, and save it asmonitoringServiceMonitor.yaml
, as in the following example:apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: ui name: myapp namespace: myns spec: endpoints: 1 - interval: 30s port: ui-http scheme: http path: /healthz 2 selector: matchLabels: app: ui
Apply the
ServiceMonitor
CR by running the following command:$ oc apply -f monitoringServiceMonitor.yaml
Prometheus scrapes the path /metrics
by default, however you can define a custom path. It is up to the vendor of the application to expose this endpoint for scraping, with metrics that they deem relevant.
17.3.1.5.1. Creating a workload alert
You can enable alerts for user workloads on a cluster.
Procedure
Create a
ConfigMap
CR, and save it asmonitoringConfigMap.yaml
, as in the following example:apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true 1 # ...
- 1
- Set to
true
to enable workload monitoring.
Apply the
ConfigMap
CR by running the following command:$ oc apply -f monitoringConfigMap.yaml
Create a YAML file for alerting rules,
monitoringAlertRule.yaml
, as in the following example:apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: myapp-alert namespace: myns spec: groups: - name: example rules: - alert: InternalErrorsAlert expr: flask_http_request_total{status="500"} > 0 # ...
Apply the alert rule by running the following command:
$ oc apply -f monitoringAlertRule.yaml
17.4. Security
17.4.1. Security basics
Security is a critical component of telecommunications deployments on OpenShift Container Platform, particularly when running cloud-native network functions (CNFs).
You can enhance security for high-bandwidth network deployments in telecommunications (telco) environments by following key security considerations. By implementing these standards and best practices, you can strengthen security in telco-specific use cases.
17.4.1.1. RBAC overview
Role-based access control (RBAC) objects determine whether a user is allowed to perform a given action within a project.
Cluster administrators can use the cluster roles and bindings to control who has various access levels to the OpenShift Container Platform platform itself and all projects.
Developers can use local roles and bindings to control who has access to their projects. Note that authorization is a separate step from authentication, which is more about determining the identity of who is taking the action.
Authorization is managed using the following authorization objects:
- Rules
- Are sets of permitted actions on specific objects. For example, a rule can determine whether a user or service account can create pods. Each rule specifies an API resource, the resource within that API, and the allowed action.
- Roles
Are collections of rules that define what actions users or groups can perform. You can associate or bind rules to multiple users or groups. A role file can contain one or more rules that specify the actions and resources allowed for that role.
Roles are categorized into the following types:
- Cluster roles: You can define cluster roles at the cluster level. They are not tied to a single namespace. They can apply across all namespaces or specific namespaces when you bind them to users, groups, or service accounts.
- Project roles: You can create project roles within a specific namespace, and they only apply to that namespace. You can assign permissions to specific users to create roles and role bindings within their namespace, ensuring they do not affect other namespaces.
- Bindings
Are associations between users and/or groups with a role. You can create a role binding to connect the rules in a role to a specific user ID or group. This brings together the role and the user or group, defining what actions they can perform.
NoteYou can bind more than one role to a user or group.
For more information on RBAC, see "Using RBAC to define and apply permissions".
Operational RBAC considerations
To reduce operational overhead, it is important to manage access through groups rather than handling individual user IDs across multiple clusters. By managing groups at an organizational level, you can streamline access control and simplify administration across your organization.
Additional resources
17.4.1.2. Security accounts overview
A service account is an OpenShift Container Platform account that allows a component to directly access the API. Service accounts are API objects that exist within each project. Service accounts provide a flexible way to control API access without sharing a regular user’s credentials.
You can use service accounts to apply role-based access control (RBAC) to pods. By assigning service accounts to workloads, such as pods and deployments, you can grant additional permissions, such as pulling from different registries. This also allows you to assign lower privileges to service accounts, reducing the security footprint of the pods that run under them.
For more information about service accounts, see "Understanding and creating service accounts".
Additional resources
17.4.1.3. Identity provider configuration
Configuring an identity provider is the first step in setting up users on the cluster. You can manage groups at the organizational level by using an identity provider.
The identity provider can pull in specific user groups that are maintained at the organizational level, rather than the cluster level. This allows you to add and remove users from groups that follow your organization’s established practices.
You must set up a cron job to run frequently to pull any changes into the cluster.
You can use an identity provider to manage access levels for specific groups within your organization. For example, you can perform the following actions to manage access levels:
-
Assign the
cluster-admin
role to teams that require cluster-level privileges. - Grant application administrators specific privileges to manage only their respective projects.
-
Provide operational teams with
view
access across the cluster to enable monitoring without allowing modifications.
For information about configuring an identity provider, see "Understanding identity provider configuration".
Additional resources
17.4.1.4. Replacing the kubeadmin user with a cluster-admin user
The kubeadmin
user with the cluster-admin
privileges is created on every cluster by default. To enhance the cluster security, you can replace the`kubeadmin` user with a cluster-admin
user and then disable or remove the kubeadmin
user.
Prerequisites
-
You have created a user with
cluster-admin
privileges. -
You have installed the OpenShift CLI (
oc
). - You have administrative access to a virtual vault for secure storage.
Procedure
-
Create an emergency
cluster-admin
user by using thehtpasswd
identity provider. For more information, see "About htpasswd authentication". Assign the
cluster-admin
privileges to the new user by running the following command:$ oc adm policy add-cluster-role-to-user cluster-admin <emergency_user>
Verify the emergency user access:
- Log in to the cluster using the new emergency user.
Confirm that the user has
cluster-admin
privileges by running the following command:$ oc whoami
Ensure the output shows the emergency user’s ID.
Store the password or authentication key for the emergency user securely in a virtual vault.
NoteFollow the best practices of your organization for securing sensitive credentials.
Disable or remove the
kubeadmin
user to reduce security risks by running the following command:$ oc delete secrets kubeadmin -n kube-system
Additional resources
17.4.1.5. Security considerations for telco CNFs
Telco workloads handle vast amounts of sensitive data and demand high reliability. A single security vulnerability can lead to broader cluster-wide compromises. With numerous components running on a single-node OpenShift cluster, each component must be secured to prevent any breach from escalating. Ensuring security across the entire infrastructure, including all components, is essential to maintaining the integrity of the telco network and avoiding vulnerabilities.
The following key security features are essential for telco:
- Security Context Constraints (SCCs): Provide granular control over pod security in the OpenShift clusters.
- Pod Security Admission (PSA): Kubernetes-native pod security controls.
- Encryption: Ensures data confidentiality in high-throughput network environments.
17.4.1.6. Advancement of pod security in Kubernetes and OpenShift Container Platform
Kubernetes initially had limited pod security. When OpenShift Container Platform integrated Kubernetes, Red Hat added pod security through Security Context Constraints (SCCs). In Kubernetes version 1.3, PodSecurityPolicy
(PSP) was introduced as a similar feature. However, Pod Security Admission (PSA) was introduced in Kubernetes version 1.21, which resulted in the deprecation of PSP in Kubernetes version 1.25.
PSA also became available in OpenShift Container Platform version 4.11. While PSA improves pod security, it lacks some features provided by SCCs that are still necessary for telco use cases. Therefore, OpenShift Container Platform continues to support both PSA and SCCs.
17.4.1.7. Key areas for CNF deployment
The cloud-native network function (CNF) deployment contains the following key areas:
- Core
- The first deployments of CNFs occurred in the core of the wireless network. Deploying CNFs in the core typically means racks of servers placed in central offices or data centers. These servers are connected to both the internet and the Radio Access Network (RAN), but they are often behind multiple security firewalls or sometimes disconnected from the internet altogether. This type of setup is called an offline or disconnected cluster.
- RAN
- After CNFs were successfully tested in the core network and found to be effective, they were considered for deployment in the Radio Access Network (RAN). Deploying CNFs in RAN requires a large number of servers (up to 100,000 in a large deployment). These servers are located near cellular towers and typically run as single-node OpenShift clusters, with the need for high scalability.
17.4.1.8. Telco-specific infrastructure
- Hardware requirements
- In telco networks, clusters are primarily built on bare-metal hardware. This means that the operating system (op-system-first) is installed directly on the physical machines, without using virtual machines. This reduces network connectivity complexity, minimizes latency, and optimizes CPU usage for applications.
- Network requirements
- Telco networks require much higher bandwidth compared to standard IT networks. Telco networks commonly use dual-port 25 GB connections or 100 GB Network Interface Cards (NICs) to handle massive data throughput. Security is critical, requiring encrypted connections and secure endpoints to protect sensitive personal data.
17.4.1.9. Lifecycle management
Upgrades are critical for security. When a vulnerability is discovered, it is patched in the latest z-stream release. This fix is then rolled back through each lower y-stream release until all supported versions are patched. Releases that are no longer supported do not receive patches. Therefore, it is important to upgrade OpenShift Container Platform clusters regularly to stay within a supported release and ensure they remain protected against vulnerabilities.
For more information about lifecycle management and upgrades, see "Upgrading a telco core CNF clusters".
Additional resources
17.4.1.10. Evolution of Network Functions to CNFs
Network Functions (NFs) began as Physical Network Functions (PNFs), which were purpose-built hardware devices operating independently. Over time, PNFs evolved into Virtual Network Functions (VNFs), which virtualized their capabilities while controlling resources like CPU, memory, storage, and network.
As technology advanced further, VNFs transitioned to cloud-native network functions (CNFs). CNFs run in lightweight, secure, and scalable containers. They enforce stringent restrictions, including non-root execution and minimal host interference, to enhance security and performance.
PNFs had unrestricted root access to operate independently without interference. With the shift to VNFs, resource usage was controlled, but processes could still run as root within their virtual machines. In contrast, CNFs restrict root access and limit container capabilities to prevent interference with other containers or the host operating system.
The main challenges in migrating to CNFs are as follows:
- Breaking down monolithic network functions into smaller, containerized processes.
- Adhering to cloud-native principles, such as non-root execution and isolation, while maintaining telco-grade performance and reliability.
17.4.2. Host security
17.4.2.1. Red Hat Enterprise Linux CoreOS (RHCOS)
Red Hat Enterprise Linux CoreOS (RHCOS) is different from Red Hat Enterprise Linux (RHEL) in key areas. For more information, see "About RHCOS".
From a telco perspective, a major distinction is the control of rpm-ostree
, which is updated through the Machine Config Operator.
RHCOS follows the same immutable design used for pods in OpenShift Container Platform. This ensures that the operating system remains consistent across the cluster. For information about RHCOS architecture, see "Red Hat Enterprise Linux CoreOS (RHCOS)".
To manage hosts effectively while maintaining security, avoid direct access whenever possible. Instead, you can use the following methods for host management:
- Debug pod
- Direct SSHs
- Console access
Review the following RHCOS secruity mechanisms that are integral to maintaining host security:
- Linux namespaces
- Provide isolation for processes and resources. Each container keeps its processes and files within its own namespace. If a user escapes from the container namespace, they could gain access to the host operating system, potentially compromising security.
- Security-Enhanced Linux (SELinux)
Enforces mandatory access controls to restrict access to files and directories by processes. It adds an extra layer of security by preventing unauthorized access to files if a process tries to break its confinement.
SELinux follows the security policy of denying everything unless explicitly allowed. If a process attempts to modify or access a file without permission, SELinux denies access. For more information, see Introduction to SELinux.
- Linux capabilities
- Assign specific privileges to processes at a granular level, minimizing the need for full root permissions. For more information, see "Linux capabilities".
- Control groups (cgroups)
- Allocate and manage system resources, such as CPU and memory for processes and containers, ensuring efficient usage. As of OpenShift Container Platform 4.16, there are two versions of cgroups. cgroup v2 is now configured by default.
- CRI-O
- Serves as a lightweight container runtime that enforces security boundaries and manages container workloads.
Additional resources
17.4.2.2. Command-line host access
Direct access to a host must be restricted to avoid modifying the host or accessing pods that should not be accessed. For users who need direct access to a host, it is recommended to use an external authenticator, like SSSD with LDAP, to manage access. This helps maintain consistency across the cluster through the Machine Config Operator.
Do not configure direct access to the root ID on any OpenShift Container Platform cluster server.
You can connect to a node in the cluster using the following methods:
- Using debug pod
This is the recommended method to access a node. To debug or connect to a node, run the following command:
$ oc debug node/<worker_node_name>
After connecting to the node, run the following command to get access to the root file system:
# chroot /host
This gives you root access within a debug pod on the node. For more information, see "Starting debug pods with root access".
- Direct SSH
Avoid using the root user. Instead, use the core user ID (or your own ID). To connect to the node using SSH, run the following command:
$ ssh core@<worker_node_name>
ImportantThe core user ID is initially given
sudo
privileges within the cluster.If you cannot connect to a node using SSH, see How to connect to OpenShift Container Platform 4.x Cluster nodes using SSH bastion pod to add your SSH key to the core user.
After connecting to the node using SSH, run the following command to get access to the root shell:
$ sudo -i
- Console Access
Ensure that consoles are secure. Do not allow direct login with the root ID, instead use individual IDs.
NoteFollow the best practices of your organization for securing console access.
Additional resources
17.4.2.3. Linux capabilities
Linux capabilities define the actions a process can perform on the host system. By default, pods are granted several capabilities unless security measures are applied. These default capabilities are as follows:
-
CHOWN
-
DAC_OVERRIDE
-
FSETID
-
FOWNER
-
SETGID
-
SETUID
-
SETPCAP
-
NET_BIND_SERVICE
-
KILL
You can modify which capabilities that a pod can receive by configuring Security Context Constraints (SCCs).
You must not assign the following capabilities to a pod:
-
SYS_ADMIN
: A powerful capability that grants elevated privileges. Allowing this capability can break security boundaries and pose a significant security risk. -
NET_ADMIN
: Allows control over networking, like SR-IOV ports, but can be replaced with alternative solutions in modern setups.
For more information about Linux capabilities, see Linux capabilities man page.
17.4.3. Security context constraints
Similar to the way that RBAC resources control user access, administrators can use security context constraints (SCCs) to control permissions for pods. These permissions determine the actions that a pod can perform and what resources it can access. You can use SCCs to define a set of conditions that a pod must run.
Security context constraints allow an administrator to control the following security constraints:
-
Whether a pod can run privileged containers with the
allowPrivilegedContainer
flag -
Whether a pod is constrained with the
allowPrivilegeEscalation
flag - The capabilities that a container can request
- The use of host directories as volumes
- The SELinux context of the container
- The container user ID
- The use of host namespaces and networking
-
The allocation of an
FSGroup
that owns the pod volumes - The configuration of allowable supplemental groups
- Whether a container requires write access to its root file system
- The usage of volume types
-
The configuration of allowable
seccomp
profiles
Default SCCs are created during installation and when you install some Operators or other components. As a cluster administrator, you can also create your own SCCs by using the OpenShift CLI (oc
).
For information about default security context constraints, see Default security context constraints.
Do not modify the default SCCs. Customizing the default SCCs can lead to issues when some of the platform pods deploy or OpenShift Container Platform is upgraded. Additionally, the default SCC values are reset to the defaults during some cluster upgrades, which discards all customizations to those SCCs.
Instead of modifying the default SCCs, create and modify your own SCCs as needed. For detailed steps, see Creating security context constraints.
You can use the following basic SCCs:
-
restricted
-
restricted-v2
The restricted-v2
SCC is the most restrictive SCC provided by a new installation and is used by default for authenticated users. It aligns with Pod Security Admission (PSA) restrictions and improves security, as the original restricted
SCC is less restrictive. It also helps transition from the original SCCs to v2 across multiple releases. Eventually, the original SCCs get deprecated. Therefore, it is recommended to use the restricted-v2
SCC.
You can examine the restricted-v2
SCC by running the following command:
$ oc describe scc restricted-v2
Example output
Name: restricted-v2 Priority: <none> Access: Users: <none> Groups: <none> Settings: Allow Privileged: false Allow Privilege Escalation: false Default Add Capabilities: <none> Required Drop Capabilities: ALL Allowed Capabilities: NET_BIND_SERVICE Allowed Seccomp Profiles: runtime/default Allowed Volume Types: configMap,downwardAPI,emptyDir,ephemeral,persistentVolumeClaim,projected,secret Allowed Flexvolumes: <all> Allowed Unsafe Sysctls: <none> Forbidden Sysctls: <none> Allow Host Network: false Allow Host Ports: false Allow Host PID: false Allow Host IPC: false Read Only Root Filesystem: false Run As User Strategy: MustRunAsRange UID: <none> UID Range Min: <none> UID Range Max: <none> SELinux Context Strategy: MustRunAs User: <none> Role: <none> Type: <none> Level: <none> FSGroup Strategy: MustRunAs Ranges: <none> Supplemental Groups Strategy: RunAsAny Ranges: <none>
The restricted-v2
SCC explicitly denies everything except what it explicitly allows. The following settings define the allowed capabilities and security restrictions:
-
Default add capabilities: Set to
<none>
. It means that no capabilities are added to a pod by default. -
Required drop capabilities: Set to
ALL
. This drops all the default Linux capabilities of a pod. -
Allowed capabilities:
NET_BIND_SERVICE
. A pod can request this capability, but it is not added by default. -
Allowed
seccomp
profiles:runtime/default
.
For more information, see Managing security context constraints.