Nodes
Configuring and managing nodes in OpenShift Container Platform
Abstract
Chapter 1. Overview of nodes
1.1. About nodes
A node is a virtual or bare-metal machine in a Kubernetes cluster. Worker nodes host your application containers, grouped as pods. The control plane nodes run services that are required to control the Kubernetes cluster. In OpenShift Container Platform, the control plane nodes contain more than just the Kubernetes services for managing the OpenShift Container Platform cluster.
Having stable and healthy nodes in a cluster is fundamental to the smooth functioning of your hosted application. In OpenShift Container Platform, you can access, manage, and monitor a node through the Node
object representing the node. Using the OpenShift CLI (oc
) or the web console, you can perform the following operations on a node.
The following components of a node are responsible for maintaining the running of pods and providing the Kubernetes runtime environment.
- Container runtime
- The container runtime is responsible for running containers. Kubernetes offers several runtimes such as containerd, cri-o, rktlet, and Docker.
- Kubelet
- Kubelet runs on nodes and reads the container manifests. It ensures that the defined containers have started and are running. The kubelet process maintains the state of work and the node server. Kubelet manages network rules and port forwarding. The kubelet manages containers that are created by Kubernetes only.
- Kube-proxy
- Kube-proxy runs on every node in the cluster and maintains the network traffic between the Kubernetes resources. A Kube-proxy ensures that the networking environment is isolated and accessible.
- DNS
- Cluster DNS is a DNS server which serves DNS records for Kubernetes services. Containers started by Kubernetes automatically include this DNS server in their DNS searches.
Read operations
The read operations allow an administrator or a developer to get information about nodes in an OpenShift Container Platform cluster.
- List all the nodes in a cluster.
- Get information about a node, such as memory and CPU usage, health, status, and age.
- List pods running on a node.
Management operations
As an administrator, you can easily manage a node in an OpenShift Container Platform cluster through several tasks:
-
Add or update node labels. A label is a key-value pair applied to a
Node
object. You can control the scheduling of pods using labels. -
Change node configuration using a custom resource definition (CRD), or the
kubeletConfig
object. -
Configure nodes to allow or disallow the scheduling of pods. Healthy worker nodes with a
Ready
status allow pod placement by default while the control plane nodes do not; you can change this default behavior by configuring the worker nodes to be unschedulable and the control plane nodes to be schedulable. -
Allocate resources for nodes using the
system-reserved
setting. You can allow OpenShift Container Platform to automatically determine the optimalsystem-reserved
CPU and memory resources for your nodes, or you can manually determine and set the best resources for your nodes. - Configure the number of pods that can run on a node based on the number of processor cores on the node, a hard limit, or both.
- Reboot a node gracefully using pod anti-affinity.
- Delete a node from a cluster by scaling down the cluster using a compute machine set. To delete a node from a bare-metal cluster, you must first drain all pods on the node and then manually delete the node.
Enhancement operations
OpenShift Container Platform allows you to do more than just access and manage nodes; as an administrator, you can perform the following tasks on nodes to make the cluster more efficient, application-friendly, and to provide a better environment for your developers.
- Manage node-level tuning for high-performance applications that require some level of kernel tuning by using the Node Tuning Operator.
- Enable TLS security profiles on the node to protect communication between the kubelet and the Kubernetes API server.
- Run background tasks on nodes automatically with daemon sets. You can create and use daemon sets to create shared storage, run a logging pod on every node, or deploy a monitoring agent on all nodes.
- Free node resources using garbage collection. You can ensure that your nodes are running efficiently by removing terminated containers and the images not referenced by any running pods.
- Add kernel arguments to a set of nodes.
- Configure an OpenShift Container Platform cluster to have worker nodes at the network edge (remote worker nodes). For information on the challenges of having remote worker nodes in an OpenShift Container Platform cluster and some recommended approaches for managing pods on a remote worker node, see Using remote worker nodes at the network edge.
1.2. About pods
A pod is one or more containers deployed together on a node. As a cluster administrator, you can define a pod, assign it to run on a healthy node that is ready for scheduling, and manage. A pod runs as long as the containers are running. You cannot change a pod once it is defined and is running. Some operations you can perform when working with pods are:
Read operations
As an administrator, you can get information about pods in a project through the following tasks:
- List pods associated with a project, including information such as the number of replicas and restarts, current status, and age.
- View pod usage statistics such as CPU, memory, and storage consumption.
Management operations
The following list of tasks provides an overview of how an administrator can manage pods in an OpenShift Container Platform cluster.
Control scheduling of pods using the advanced scheduling features available in OpenShift Container Platform:
- Node-to-pod binding rules such as pod affinity, node affinity, and anti-affinity.
- Node labels and selectors.
- Taints and tolerations.
- Pod topology spread constraints.
- Secondary scheduling.
- Configure the descheduler to evict pods based on specific strategies so that the scheduler reschedules the pods to more appropriate nodes.
- Configure how pods behave after a restart using pod controllers and restart policies.
- Limit both egress and ingress traffic on a pod.
- Add and remove volumes to and from any object that has a pod template. A volume is a mounted file system available to all the containers in a pod. Container storage is ephemeral; you can use volumes to persist container data.
Enhancement operations
You can work with pods more easily and efficiently with the help of various tools and features available in OpenShift Container Platform. The following operations involve using those tools and features to better manage pods.
Operation | User | More information |
---|---|---|
Create and use a horizontal pod autoscaler. | Developer | You can use a horizontal pod autoscaler to specify the minimum and the maximum number of pods you want to run, as well as the CPU utilization or memory utilization your pods should target. Using a horizontal pod autoscaler, you can automatically scale pods. |
Administrator and developer | As an administrator, use a vertical pod autoscaler to better use cluster resources by monitoring the resources and the resource requirements of workloads. As a developer, use a vertical pod autoscaler to ensure your pods stay up during periods of high demand by scheduling pods to nodes that have enough resources for each pod. | |
Provide access to external resources using device plugins. | Administrator | A device plugin is a gRPC service running on nodes (external to the kubelet), which manages specific hardware resources. You can deploy a device plugin to provide a consistent and portable solution to consume hardware devices across clusters. |
Provide sensitive data to pods using the | Administrator |
Some applications need sensitive information, such as passwords and usernames. You can use the |
1.3. About containers
A container is the basic unit of an OpenShift Container Platform application, which comprises the application code packaged along with its dependencies, libraries, and binaries. Containers provide consistency across environments and multiple deployment targets: physical servers, virtual machines (VMs), and private or public cloud.
Linux container technologies are lightweight mechanisms for isolating running processes and limiting access to only designated resources. As an administrator, You can perform various tasks on a Linux container, such as:
OpenShift Container Platform provides specialized containers called Init containers. Init containers run before application containers and can contain utilities or setup scripts not present in an application image. You can use an Init container to perform tasks before the rest of a pod is deployed.
Apart from performing specific tasks on nodes, pods, and containers, you can work with the overall OpenShift Container Platform cluster to keep the cluster efficient and the application pods highly available.
1.4. About autoscaling pods on a node
OpenShift Container Platform offers three tools that you can use to automatically scale the number of pods on your nodes and the resources allocated to pods.
- Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) can automatically increase or decrease the scale of a replication controller or deployment configuration, based on metrics collected from the pods that belong to that replication controller or deployment configuration.
For more information, see Automatically scaling pods with the horizontal pod autoscaler.
- Custom Metrics Autoscaler
The Custom Metrics Autoscaler can automatically increase or decrease the number of pods for a deployment, stateful set, custom resource, or job based on custom metrics that are not based only on CPU or memory.
For more information, see Custom Metrics Autoscaler Operator overview.
- Vertical Pod Autoscaler
The Vertical Pod Autoscaler (VPA) can automatically review the historic and current CPU and memory resources for containers in pods and can update the resource limits and requests based on the usage values it learns.
For more information, see Automatically adjust pod resource levels with the vertical pod autoscaler.
1.5. Glossary of common terms for OpenShift Container Platform nodes
This glossary defines common terms that are used in the node content.
- Container
- It is a lightweight and executable image that comprises software and all its dependencies. Containers virtualize the operating system, as a result, you can run containers anywhere from a data center to a public or private cloud to even a developer’s laptop.
- Daemon set
- Ensures that a replica of the pod runs on eligible nodes in an OpenShift Container Platform cluster.
- egress
- The process of data sharing externally through a network’s outbound traffic from a pod.
- garbage collection
- The process of cleaning up cluster resources, such as terminated containers and images that are not referenced by any running pods.
- Horizontal Pod Autoscaler(HPA)
- Implemented as a Kubernetes API resource and a controller. You can use the HPA to specify the minimum and maximum number of pods that you want to run. You can also specify the CPU or memory utilization that your pods should target. The HPA scales out and scales in pods when a given CPU or memory threshold is crossed.
- Ingress
- Incoming traffic to a pod.
- Job
- A process that runs to completion. A job creates one or more pod objects and ensures that the specified pods are successfully completed.
- Labels
- You can use labels, which are key-value pairs, to organise and select subsets of objects, such as a pod.
- Node
- A worker machine in the OpenShift Container Platform cluster. A node can be either be a virtual machine (VM) or a physical machine.
- Node Tuning Operator
- You can use the Node Tuning Operator to manage node-level tuning by using the TuneD daemon. It ensures custom tuning specifications are passed to all containerized TuneD daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
- Self Node Remediation Operator
- The Operator runs on the cluster nodes and identifies and reboots nodes that are unhealthy.
- Pod
- One or more containers with shared resources, such as volume and IP addresses, running in your OpenShift Container Platform cluster. A pod is the smallest compute unit defined, deployed, and managed.
- Toleration
- Indicates that the pod is allowed (but not required) to be scheduled on nodes or node groups with matching taints. You can use tolerations to enable the scheduler to schedule pods with matching taints.
- Taint
- A core object that comprises a key,value, and effect. Taints and tolerations work together to ensure that pods are not scheduled on irrelevant nodes.
Chapter 2. Working with pods
2.1. Using pods
A pod is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
2.1.1. Understanding pods
Pods are the rough equivalent of a machine instance (physical or virtual) to a Container. Each pod is allocated its own internal IP address, therefore owning its entire port space, and containers within pods can share their local storage and networking.
Pods have a lifecycle; they are defined, then they are assigned to run on a node, then they run until their container(s) exit or they are removed for some other reason. Pods, depending on policy and exit code, might be removed after exiting, or can be retained to enable access to the logs of their containers.
OpenShift Container Platform treats pods as largely immutable; changes cannot be made to a pod definition while it is running. OpenShift Container Platform implements changes by terminating an existing pod and recreating it with modified configuration, base image(s), or both. Pods are also treated as expendable, and do not maintain state when recreated. Therefore pods should usually be managed by higher-level controllers, rather than directly by users.
For the maximum number of pods per OpenShift Container Platform node host, see the Cluster Limits.
Bare pods that are not managed by a replication controller will be not rescheduled upon node disruption.
2.1.2. Example pod configurations
OpenShift Container Platform leverages the Kubernetes concept of a pod, which is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
The following is an example definition of a pod. It demonstrates many features of pods, most of which are discussed in other topics and thus only briefly mentioned here:
Pod
object definition (YAML)
kind: Pod apiVersion: v1 metadata: name: example labels: environment: production app: abc 1 spec: restartPolicy: Always 2 securityContext: 3 runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: 4 - name: abc args: - sleep - "1000000" volumeMounts: 5 - name: cache-volume mountPath: /cache 6 image: registry.access.redhat.com/ubi7/ubi-init:latest 7 securityContext: allowPrivilegeEscalation: false runAsNonRoot: true capabilities: drop: ["ALL"] resources: limits: memory: "100Mi" cpu: "1" requests: memory: "100Mi" cpu: "1" volumes: 8 - name: cache-volume emptyDir: sizeLimit: 500Mi
- 1
- Pods can be "tagged" with one or more labels, which can then be used to select and manage groups of pods in a single operation. The labels are stored in key/value format in the
metadata
hash. - 2
- The pod restart policy with possible values
Always
,OnFailure
, andNever
. The default value isAlways
. - 3
- OpenShift Container Platform defines a security context for containers which specifies whether they are allowed to run as privileged containers, run as a user of their choice, and more. The default context is very restrictive but administrators can modify this as needed.
- 4
containers
specifies an array of one or more container definitions.- 5
- The container specifies where external storage volumes are mounted within the container.
- 6
- Specify the volumes to provide for the pod. Volumes mount at the specified path. Do not mount to the container root,
/
, or any path that is the same in the host and the container. This can corrupt your host system if the container is sufficiently privileged, such as the host/dev/pts
files. It is safe to mount the host by using/host
. - 7
- Each container in the pod is instantiated from its own container image.
- 8
- The pod defines storage volumes that are available to its container(s) to use.
If you attach persistent volumes that have high file counts to pods, those pods can fail or can take a long time to start. For more information, see When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve "Ready" state?.
This pod definition does not include attributes that are filled by OpenShift Container Platform automatically after the pod is created and its lifecycle begins. The Kubernetes pod documentation has details about the functionality and purpose of pods.
2.1.3. Additional resources
- For more information on pods and storage see Understanding persistent storage and Understanding ephemeral storage.
2.2. Viewing pods
As an administrator, you can view the pods in your cluster and to determine the health of those pods and the cluster as a whole.
2.2.1. About pods
OpenShift Container Platform leverages the Kubernetes concept of a pod, which is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed. Pods are the rough equivalent of a machine instance (physical or virtual) to a container.
You can view a list of pods associated with a specific project or view usage statistics about pods.
2.2.2. Viewing pods in a project
You can view a list of pods associated with the current project, including the number of replica, the current status, number or restarts and the age of the pod.
Procedure
To view the pods in a project:
Change to the project:
$ oc project <project-name>
Run the following command:
$ oc get pods
For example:
$ oc get pods
Example output
NAME READY STATUS RESTARTS AGE console-698d866b78-bnshf 1/1 Running 2 165m console-698d866b78-m87pm 1/1 Running 2 165m
Add the
-o wide
flags to view the pod IP address and the node where the pod is located.$ oc get pods -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE console-698d866b78-bnshf 1/1 Running 2 166m 10.128.0.24 ip-10-0-152-71.ec2.internal <none> console-698d866b78-m87pm 1/1 Running 2 166m 10.129.0.23 ip-10-0-173-237.ec2.internal <none>
2.2.3. Viewing pod usage statistics
You can display usage statistics about pods, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.
Prerequisites
-
You must have
cluster-reader
permission to view the usage statistics. - Metrics must be installed to view the usage statistics.
Procedure
To view the usage statistics:
Run the following command:
$ oc adm top pods
For example:
$ oc adm top pods -n openshift-console
Example output
NAME CPU(cores) MEMORY(bytes) console-7f58c69899-q8c8k 0m 22Mi console-7f58c69899-xhbgg 0m 25Mi downloads-594fcccf94-bcxk8 3m 18Mi downloads-594fcccf94-kv4p6 2m 15Mi
Run the following command to view the usage statistics for pods with labels:
$ oc adm top pod --selector=''
You must choose the selector (label query) to filter on. Supports
=
,==
, and!=
.For example:
$ oc adm top pod --selector='name=my-pod'
2.2.4. Viewing resource logs
You can view the log for various resources in the OpenShift CLI (oc
) and web console. Logs read from the tail, or end, of the log.
Prerequisites
-
Access to the OpenShift CLI (
oc
).
Procedure (UI)
In the OpenShift Container Platform console, navigate to Workloads → Pods or navigate to the pod through the resource you want to investigate.
NoteSome resources, such as builds, do not have pods to query directly. In such instances, you can locate the Logs link on the Details page for the resource.
- Select a project from the drop-down menu.
- Click the name of the pod you want to investigate.
- Click Logs.
Procedure (CLI)
View the log for a specific pod:
$ oc logs -f <pod_name> -c <container_name>
where:
-f
- Optional: Specifies that the output follows what is being written into the logs.
<pod_name>
- Specifies the name of the pod.
<container_name>
- Optional: Specifies the name of a container. When a pod has more than one container, you must specify the container name.
For example:
$ oc logs ruby-58cd97df55-mww7r
$ oc logs -f ruby-57f7f4855b-znl92 -c ruby
The contents of log files are printed out.
View the log for a specific resource:
$ oc logs <object_type>/<resource_name> 1
- 1
- Specifies the resource type and name.
For example:
$ oc logs deployment/ruby
The contents of log files are printed out.
2.3. Configuring an OpenShift Container Platform cluster for pods
As an administrator, you can create and maintain an efficient cluster for pods.
By keeping your cluster efficient, you can provide a better environment for your developers using such tools as what a pod does when it exits, ensuring that the required number of pods is always running, when to restart pods designed to run only once, limit the bandwidth available to pods, and how to keep pods running during disruptions.
2.3.1. Configuring how pods behave after restart
A pod restart policy determines how OpenShift Container Platform responds when Containers in that pod exit. The policy applies to all Containers in that pod.
The possible values are:
-
Always
- Tries restarting a successfully exited Container on the pod continuously, with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes. The default isAlways
. -
OnFailure
- Tries restarting a failed Container on the pod with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes. -
Never
- Does not try to restart exited or failed Containers on the pod. Pods immediately fail and exit.
After the pod is bound to a node, the pod will never be bound to another node. This means that a controller is necessary in order for a pod to survive node failure:
Condition | Controller Type | Restart Policy |
---|---|---|
Pods that are expected to terminate (such as batch computations) | Job |
|
Pods that are expected to not terminate (such as web servers) | Replication controller |
|
Pods that must run one-per-machine | Daemon set | Any |
If a Container on a pod fails and the restart policy is set to OnFailure
, the pod stays on the node and the Container is restarted. If you do not want the Container to restart, use a restart policy of Never
.
If an entire pod fails, OpenShift Container Platform starts a new pod. Developers must address the possibility that applications might be restarted in a new pod. In particular, applications must handle temporary files, locks, incomplete output, and so forth caused by previous runs.
Kubernetes architecture expects reliable endpoints from cloud providers. When a cloud provider is down, the kubelet prevents OpenShift Container Platform from restarting.
If the underlying cloud provider endpoints are not reliable, do not install a cluster using cloud provider integration. Install the cluster as if it was in a no-cloud environment. It is not recommended to toggle cloud provider integration on or off in an installed cluster.
For details on how OpenShift Container Platform uses restart policy with failed Containers, see the Example States in the Kubernetes documentation.
2.3.2. Limiting the bandwidth available to pods
You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. The limits you place on a pod do not affect the bandwidth of other pods.
Procedure
To limit the bandwidth on a pod:
Write an object definition JSON file, and specify the data traffic speed using
kubernetes.io/ingress-bandwidth
andkubernetes.io/egress-bandwidth
annotations. For example, to limit both pod egress and ingress bandwidth to 10M/s:Limited
Pod
object definition{ "kind": "Pod", "spec": { "containers": [ { "image": "openshift/hello-openshift", "name": "hello-openshift" } ] }, "apiVersion": "v1", "metadata": { "name": "iperf-slow", "annotations": { "kubernetes.io/ingress-bandwidth": "10M", "kubernetes.io/egress-bandwidth": "10M" } } }
Create the pod using the object definition:
$ oc create -f <file_or_dir_path>
2.3.3. Understanding how to use pod disruption budgets to specify the number of pods that must be up
A pod disruption budget allows the specification of safety constraints on pods during operations, such as draining a node for maintenance.
PodDisruptionBudget
is an API object that specifies the minimum number or percentage of replicas that must be up at a time. Setting these in projects can be helpful during node maintenance (such as scaling a cluster down or a cluster upgrade) and is only honored on voluntary evictions (not on node failures).
A PodDisruptionBudget
object’s configuration consists of the following key parts:
- A label selector, which is a label query over a set of pods.
An availability level, which specifies the minimum number of pods that must be available simultaneously, either:
-
minAvailable
is the number of pods must always be available, even during a disruption. -
maxUnavailable
is the number of pods can be unavailable during a disruption.
-
Available
refers to the number of pods that has condition Ready=True
. Ready=True
refers to the pod that is able to serve requests and should be added to the load balancing pools of all matching services.
A maxUnavailable
of 0%
or 0
or a minAvailable
of 100%
or equal to the number of replicas is permitted but can block nodes from being drained.
The default setting for maxUnavailable
is 1
for all the machine config pools in OpenShift Container Platform. It is recommended to not change this value and update one control plane node at a time. Do not change this value to 3
for the control plane pool.
You can check for pod disruption budgets across all projects with the following:
$ oc get poddisruptionbudget --all-namespaces
The following example contains some values that are specific to OpenShift Container Platform on AWS.
Example output
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE openshift-apiserver openshift-apiserver-pdb N/A 1 1 121m openshift-cloud-controller-manager aws-cloud-controller-manager 1 N/A 1 125m openshift-cloud-credential-operator pod-identity-webhook 1 N/A 1 117m openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-pdb N/A 1 1 121m openshift-cluster-storage-operator csi-snapshot-controller-pdb N/A 1 1 122m openshift-cluster-storage-operator csi-snapshot-webhook-pdb N/A 1 1 122m openshift-console console N/A 1 1 116m #...
The PodDisruptionBudget
is considered healthy when there are at least minAvailable
pods running in the system. Every pod above that limit can be evicted.
Depending on your pod priority and preemption settings, lower-priority pods might be removed despite their pod disruption budget requirements.
2.3.3.1. Specifying the number of pods that must be up with pod disruption budgets
You can use a PodDisruptionBudget
object to specify the minimum number or percentage of replicas that must be up at a time.
Procedure
To configure a pod disruption budget:
Create a YAML file with the an object definition similar to the following:
apiVersion: policy/v1 1 kind: PodDisruptionBudget metadata: name: my-pdb spec: minAvailable: 2 2 selector: 3 matchLabels: name: my-pod
- 1
PodDisruptionBudget
is part of thepolicy/v1
API group.- 2
- The minimum number of pods that must be available simultaneously. This can be either an integer or a string specifying a percentage, for example,
20%
. - 3
- A label query over a set of resources. The result of
matchLabels
andmatchExpressions
are logically conjoined. Leave this parameter blank, for exampleselector {}
, to select all pods in the project.
Or:
apiVersion: policy/v1 1 kind: PodDisruptionBudget metadata: name: my-pdb spec: maxUnavailable: 25% 2 selector: 3 matchLabels: name: my-pod
- 1
PodDisruptionBudget
is part of thepolicy/v1
API group.- 2
- The maximum number of pods that can be unavailable simultaneously. This can be either an integer or a string specifying a percentage, for example,
20%
. - 3
- A label query over a set of resources. The result of
matchLabels
andmatchExpressions
are logically conjoined. Leave this parameter blank, for exampleselector {}
, to select all pods in the project.
Run the following command to add the object to project:
$ oc create -f </path/to/file> -n <project_name>
2.3.3.2. Specifying the eviction policy for unhealthy pods
When you use pod disruption budgets (PDBs) to specify how many pods must be available simultaneously, you can also define the criteria for how unhealthy pods are considered for eviction.
You can choose one of the following policies:
- IfHealthyBudget
- Running pods that are not yet healthy can be evicted only if the guarded application is not disrupted.
- AlwaysAllow
Running pods that are not yet healthy can be evicted regardless of whether the criteria in the pod disruption budget is met. This policy can help evict malfunctioning applications, such as ones with pods stuck in the
CrashLoopBackOff
state or failing to report theReady
status.NoteIt is recommended to set the
unhealthyPodEvictionPolicy
field toAlwaysAllow
in thePodDisruptionBudget
object to support the eviction of misbehaving applications during a node drain. The default behavior is to wait for the application pods to become healthy before the drain can proceed.
Procedure
Create a YAML file that defines a
PodDisruptionBudget
object and specify the unhealthy pod eviction policy:Example
pod-disruption-budget.yaml
fileapiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-pdb spec: minAvailable: 2 selector: matchLabels: name: my-pod unhealthyPodEvictionPolicy: AlwaysAllow 1
- 1
- Choose either
IfHealthyBudget
orAlwaysAllow
as the unhealthy pod eviction policy. The default isIfHealthyBudget
when theunhealthyPodEvictionPolicy
field is empty.
Create the
PodDisruptionBudget
object by running the following command:$ oc create -f pod-disruption-budget.yaml
With a PDB that has the AlwaysAllow
unhealthy pod eviction policy set, you can now drain nodes and evict the pods for a malfunctioning application guarded by this PDB.
Additional resources
- Enabling features using feature gates
- Unhealthy Pod Eviction Policy in the Kubernetes documentation
2.3.4. Preventing pod removal using critical pods
There are a number of core components that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster might stop working properly if a critical add-on is evicted.
Pods marked as critical are not allowed to be evicted.
Procedure
To make a pod critical:
Create a
Pod
spec or edit existing pods to include thesystem-cluster-critical
priority class:apiVersion: v1 kind: Pod metadata: name: my-pdb spec: template: metadata: name: critical-pod priorityClassName: system-cluster-critical 1 # ...
- 1
- Default priority class for pods that should never be evicted from a node.
Alternatively, you can specify
system-node-critical
for pods that are important to the cluster but can be removed if necessary.Create the pod:
$ oc create -f <file-name>.yaml
2.3.5. Reducing pod timeouts when using persistent volumes with high file counts
If a storage volume contains many files (~1,000,000 or greater), you might experience pod timeouts.
This can occur because, when volumes are mounted, OpenShift Container Platform recursively changes the ownership and permissions of the contents of each volume in order to match the fsGroup
specified in a pod’s securityContext
. For large volumes, checking and changing the ownership and permissions can be time consuming, resulting in a very slow pod startup.
You can reduce this delay by applying one of the following workarounds:
- Use a security context constraint (SCC) to skip the SELinux relabeling for a volume.
-
Use the
fsGroupChangePolicy
field inside an SCC to control the way that OpenShift Container Platform checks and manages ownership and permissions for a volume. - Use the Cluster Resource Override Operator to automatically apply an SCC to skip the SELinux relabeling.
- Use a runtime class to skip the SELinux relabeling for a volume.
For information, see When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve "Ready" state?.
2.4. Automatically scaling pods with the horizontal pod autoscaler
As a developer, you can use a horizontal pod autoscaler (HPA) to specify how OpenShift Container Platform should automatically increase or decrease the scale of a replication controller or deployment configuration, based on metrics collected from the pods that belong to that replication controller or deployment configuration. You can create an HPA for any deployment, deployment config, replica set, replication controller, or stateful set.
For information on scaling pods based on custom metrics, see Automatically scaling pods based on custom metrics.
It is recommended to use a Deployment
object or ReplicaSet
object unless you need a specific feature or behavior provided by other objects. For more information on these objects, see Understanding deployments.
2.4.1. Understanding horizontal pod autoscalers
You can create a horizontal pod autoscaler to specify the minimum and maximum number of pods you want to run, as well as the CPU utilization or memory utilization your pods should target.
After you create a horizontal pod autoscaler, OpenShift Container Platform begins to query the CPU and/or memory resource metrics on the pods. When these metrics are available, the horizontal pod autoscaler computes the ratio of the current metric utilization with the desired metric utilization, and scales up or down accordingly. The query and scaling occurs at a regular interval, but can take one to two minutes before metrics become available.
For replication controllers, this scaling corresponds directly to the replicas of the replication controller. For deployment configurations, scaling corresponds directly to the replica count of the deployment configuration. Note that autoscaling applies only to the latest deployment in the Complete
phase.
OpenShift Container Platform automatically accounts for resources and prevents unnecessary autoscaling during resource spikes, such as during start up. Pods in the unready
state have 0 CPU
usage when scaling up and the autoscaler ignores the pods when scaling down. Pods without known metrics have 0% CPU
usage when scaling up and 100% CPU
when scaling down. This allows for more stability during the HPA decision. To use this feature, you must configure readiness checks to determine if a new pod is ready for use.
To use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics.
2.4.1.1. Supported metrics
The following metrics are supported by horizontal pod autoscalers:
Metric | Description | API version |
---|---|---|
CPU utilization | Number of CPU cores used. Can be used to calculate a percentage of the pod’s requested CPU. |
|
Memory utilization | Amount of memory used. Can be used to calculate a percentage of the pod’s requested memory. |
|
For memory-based autoscaling, memory usage must increase and decrease proportionally to the replica count. On average:
- An increase in replica count must lead to an overall decrease in memory (working set) usage per-pod.
- A decrease in replica count must lead to an overall increase in per-pod memory usage.
Use the OpenShift Container Platform web console to check the memory behavior of your application and ensure that your application meets these requirements before using memory-based autoscaling.
The following example shows autoscaling for the image-registry
Deployment
object. The initial deployment requires 3 pods. The HPA object increases the minimum to 5. If CPU usage on the pods reaches 75%, the pods increase to 7:
$ oc autoscale deployment/image-registry --min=5 --max=7 --cpu-percent=75
Example output
horizontalpodautoscaler.autoscaling/image-registry autoscaled
Sample HPA for the image-registry
Deployment
object with minReplicas
set to 3
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: image-registry namespace: default spec: maxReplicas: 7 minReplicas: 3 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: image-registry targetCPUUtilizationPercentage: 75 status: currentReplicas: 5 desiredReplicas: 0
View the new state of the deployment:
$ oc get deployment image-registry
There are now 5 pods in the deployment:
Example output
NAME REVISION DESIRED CURRENT TRIGGERED BY image-registry 1 5 5 config
2.4.2. How does the HPA work?
The horizontal pod autoscaler (HPA) extends the concept of pod auto-scaling. The HPA lets you create and manage a group of load-balanced nodes. The HPA automatically increases or decreases the number of pods when a given CPU or memory threshold is crossed.
Figure 2.1. High level workflow of the HPA
The HPA is an API resource in the Kubernetes autoscaling API group. The autoscaler works as a control loop with a default of 15 seconds for the sync period. During this period, the controller manager queries the CPU, memory utilization, or both, against what is defined in the YAML file for the HPA. The controller manager obtains the utilization metrics from the resource metrics API for per-pod resource metrics like CPU or memory, for each pod that is targeted by the HPA.
If a utilization value target is set, the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each pod. The controller then takes the average of utilization across all targeted pods and produces a ratio that is used to scale the number of desired replicas. The HPA is configured to fetch metrics from metrics.k8s.io
, which is provided by the metrics server. Because of the dynamic nature of metrics evaluation, the number of replicas can fluctuate during scaling for a group of replicas.
To implement the HPA, all targeted pods must have a resource request set on their containers.
2.4.3. About requests and limits
The scheduler uses the resource request that you specify for containers in a pod, to decide which node to place the pod on. The kubelet enforces the resource limit that you specify for a container to ensure that the container is not allowed to use more than the specified limit. The kubelet also reserves the request amount of that system resource specifically for that container to use.
How to use resource metrics?
In the pod specifications, you must specify the resource requests, such as CPU and memory. The HPA uses this specification to determine the resource utilization and then scales the target up or down.
For example, the HPA object uses the following metric source:
type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60
In this example, the HPA keeps the average utilization of the pods in the scaling target at 60%. Utilization is the ratio between the current resource usage to the requested resource of the pod.
2.4.4. Best practices
All pods must have resource requests configured
The HPA makes a scaling decision based on the observed CPU or memory utilization values of pods in an OpenShift Container Platform cluster. Utilization values are calculated as a percentage of the resource requests of each pod. Missing resource request values can affect the optimal performance of the HPA.
Configure the cool down period
During horizontal pod autoscaling, there might be a rapid scaling of events without a time gap. Configure the cool down period to prevent frequent replica fluctuations. You can specify a cool down period by configuring the stabilizationWindowSeconds
field. The stabilization window is used to restrict the fluctuation of replicas count when the metrics used for scaling keep fluctuating. The autoscaling algorithm uses this window to infer a previous desired state and avoid unwanted changes to workload scale.
For example, a stabilization window is specified for the scaleDown
field:
behavior: scaleDown: stabilizationWindowSeconds: 300
In the above example, all desired states for the past 5 minutes are considered. This approximates a rolling maximum, and avoids having the scaling algorithm frequently remove pods only to trigger recreating an equivalent pod just moments later.
2.4.4.1. Scaling policies
The autoscaling/v2
API allows you to add scaling policies to a horizontal pod autoscaler. A scaling policy controls how the OpenShift Container Platform horizontal pod autoscaler (HPA) scales pods. Scaling policies allow you to restrict the rate that HPAs scale pods up or down by setting a specific number or specific percentage to scale in a specified period of time. You can also define a stabilization window, which uses previously computed desired states to control scaling if the metrics are fluctuating. You can create multiple policies for the same scaling direction, and determine which policy is used, based on the amount of change. You can also restrict the scaling by timed iterations. The HPA scales pods during an iteration, then performs scaling, as needed, in further iterations.
Sample HPA object with a scaling policy
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: hpa-resource-metrics-memory namespace: default spec: behavior: scaleDown: 1 policies: 2 - type: Pods 3 value: 4 4 periodSeconds: 60 5 - type: Percent value: 10 6 periodSeconds: 60 selectPolicy: Min 7 stabilizationWindowSeconds: 300 8 scaleUp: 9 policies: - type: Pods value: 5 10 periodSeconds: 70 - type: Percent value: 12 11 periodSeconds: 80 selectPolicy: Max stabilizationWindowSeconds: 0 ...
- 1
- Specifies the direction for the scaling policy, either
scaleDown
orscaleUp
. This example creates a policy for scaling down. - 2
- Defines the scaling policy.
- 3
- Determines if the policy scales by a specific number of pods or a percentage of pods during each iteration. The default value is
pods
. - 4
- Limits the amount of scaling, either the number of pods or percentage of pods, during each iteration. There is no default value for scaling down by number of pods.
- 5
- Determines the length of a scaling iteration. The default value is
15
seconds. - 6
- The default value for scaling down by percentage is 100%.
- 7
- Determines which policy to use first, if multiple policies are defined. Specify
Max
to use the policy that allows the highest amount of change,Min
to use the policy that allows the lowest amount of change, orDisabled
to prevent the HPA from scaling in that policy direction. The default value isMax
. - 8
- Determines the time period the HPA should look back at desired states. The default value is
0
. - 9
- This example creates a policy for scaling up.
- 10
- Limits the amount of scaling up by the number of pods. The default value for scaling up the number of pods is 4%.
- 11
- Limits the amount of scaling up by the percentage of pods. The default value for scaling up by percentage is 100%.
Example policy for scaling down
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: hpa-resource-metrics-memory namespace: default spec: ... minReplicas: 20 ... behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 4 periodSeconds: 30 - type: Percent value: 10 periodSeconds: 60 selectPolicy: Max scaleUp: selectPolicy: Disabled
In this example, when the number of pods is greater than 40, the percent-based policy is used for scaling down, as that policy results in a larger change, as required by the selectPolicy
.
If there are 80 pod replicas, in the first iteration the HPA reduces the pods by 8, which is 10% of the 80 pods (based on the type: Percent
and value: 10
parameters), over one minute (periodSeconds: 60
). For the next iteration, the number of pods is 72. The HPA calculates that 10% of the remaining pods is 7.2, which it rounds up to 8 and scales down 8 pods. On each subsequent iteration, the number of pods to be scaled is re-calculated based on the number of remaining pods. When the number of pods falls below 40, the pods-based policy is applied, because the pod-based number is greater than the percent-based number. The HPA reduces 4 pods at a time (type: Pods
and value: 4
), over 30 seconds (periodSeconds: 30
), until there are 20 replicas remaining (minReplicas
).
The selectPolicy: Disabled
parameter prevents the HPA from scaling up the pods. You can manually scale up by adjusting the number of replicas in the replica set or deployment set, if needed.
If set, you can view the scaling policy by using the oc edit
command:
$ oc edit hpa hpa-resource-metrics-memory
Example output
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: annotations: autoscaling.alpha.kubernetes.io/behavior:\ '{"ScaleUp":{"StabilizationWindowSeconds":0,"SelectPolicy":"Max","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":15},{"Type":"Percent","Value":100,"PeriodSeconds":15}]},\ "ScaleDown":{"StabilizationWindowSeconds":300,"SelectPolicy":"Min","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":60},{"Type":"Percent","Value":10,"PeriodSeconds":60}]}}' ...
2.4.5. Creating a horizontal pod autoscaler by using the web console
From the web console, you can create a horizontal pod autoscaler (HPA) that specifies the minimum and maximum number of pods you want to run on a Deployment
or DeploymentConfig
object. You can also define the amount of CPU or memory usage that your pods should target.
An HPA cannot be added to deployments that are part of an Operator-backed service, Knative service, or Helm chart.
Procedure
To create an HPA in the web console:
- In the Topology view, click the node to reveal the side pane.
From the Actions drop-down list, select Add HorizontalPodAutoscaler to open the Add HorizontalPodAutoscaler form.
Figure 2.2. Add HorizontalPodAutoscaler
From the Add HorizontalPodAutoscaler form, define the name, minimum and maximum pod limits, the CPU and memory usage, and click Save.
NoteIf any of the values for CPU and memory usage are missing, a warning is displayed.
To edit an HPA in the web console:
- In the Topology view, click the node to reveal the side pane.
- From the Actions drop-down list, select Edit HorizontalPodAutoscaler to open the Edit Horizontal Pod Autoscaler form.
- From the Edit Horizontal Pod Autoscaler form, edit the minimum and maximum pod limits and the CPU and memory usage, and click Save.
While creating or editing the horizontal pod autoscaler in the web console, you can switch from Form view to YAML view.
To remove an HPA in the web console:
- In the Topology view, click the node to reveal the side panel.
- From the Actions drop-down list, select Remove HorizontalPodAutoscaler.
- In the confirmation pop-up window, click Remove to remove the HPA.
2.4.6. Creating a horizontal pod autoscaler for CPU utilization by using the CLI
Using the OpenShift Container Platform CLI, you can create a horizontal pod autoscaler (HPA) to automatically scale an existing Deployment
, DeploymentConfig
, ReplicaSet
, ReplicationController
, or StatefulSet
object. The HPA scales the pods associated with that object to maintain the CPU usage you specify.
It is recommended to use a Deployment
object or ReplicaSet
object unless you need a specific feature or behavior provided by other objects.
The HPA increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified CPU utilization across all pods.
When autoscaling for CPU utilization, you can use the oc autoscale
command and specify the minimum and maximum number of pods you want to run at any given time and the average CPU utilization your pods should target. If you do not specify a minimum, the pods are given default values from the OpenShift Container Platform server.
To autoscale for a specific CPU value, create a HorizontalPodAutoscaler
object with the target CPU and pod limits.
Prerequisites
To use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu
and Memory
displayed under Usage
.
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Example output
Name: openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Namespace: openshift-kube-scheduler Labels: <none> Annotations: <none> API Version: metrics.k8s.io/v1beta1 Containers: Name: wait-for-host-port Usage: Memory: 0 Name: scheduler Usage: Cpu: 8m Memory: 45440Ki Kind: PodMetrics Metadata: Creation Timestamp: 2019-05-23T18:47:56Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Timestamp: 2019-05-23T18:47:56Z Window: 1m0s Events: <none>
Procedure
To create a horizontal pod autoscaler for CPU utilization:
Perform one of the following:
To scale based on the percent of CPU utilization, create a
HorizontalPodAutoscaler
object for an existing object:$ oc autoscale <object_type>/<name> \1 --min <number> \2 --max <number> \3 --cpu-percent=<percent> 4
- 1
- Specify the type and name of the object to autoscale. The object must exist and be a
Deployment
,DeploymentConfig
/dc
,ReplicaSet
/rs
,ReplicationController
/rc
, orStatefulSet
. - 2
- Optionally, specify the minimum number of replicas when scaling down.
- 3
- Specify the maximum number of replicas when scaling up.
- 4
- Specify the target average CPU utilization over all the pods, represented as a percent of requested CPU. If not specified or negative, a default autoscaling policy is used.
For example, the following command shows autoscaling for the
image-registry
Deployment
object. The initial deployment requires 3 pods. The HPA object increases the minimum to 5. If CPU usage on the pods reaches 75%, the pods will increase to 7:$ oc autoscale deployment/image-registry --min=5 --max=7 --cpu-percent=75
To scale for a specific CPU value, create a YAML file similar to the following for an existing object:
Create a YAML file similar to the following:
apiVersion: autoscaling/v2 1 kind: HorizontalPodAutoscaler metadata: name: cpu-autoscale 2 namespace: default spec: scaleTargetRef: apiVersion: apps/v1 3 kind: Deployment 4 name: example 5 minReplicas: 1 6 maxReplicas: 10 7 metrics: 8 - type: Resource resource: name: cpu 9 target: type: AverageValue 10 averageValue: 500m 11
- 1
- Use the
autoscaling/v2
API. - 2
- Specify a name for this horizontal pod autoscaler object.
- 3
- Specify the API version of the object to scale:
-
For a
Deployment
,ReplicaSet
,Statefulset
object, useapps/v1
. -
For a
ReplicationController
, usev1
. -
For a
DeploymentConfig
, useapps.openshift.io/v1
.
-
For a
- 4
- Specify the type of object. The object must be a
Deployment
,DeploymentConfig
/dc
,ReplicaSet
/rs
,ReplicationController
/rc
, orStatefulSet
. - 5
- Specify the name of the object to scale. The object must exist.
- 6
- Specify the minimum number of replicas when scaling down.
- 7
- Specify the maximum number of replicas when scaling up.
- 8
- Use the
metrics
parameter for memory utilization. - 9
- Specify
cpu
for CPU utilization. - 10
- Set to
AverageValue
. - 11
- Set to
averageValue
with the targeted CPU value.
Create the horizontal pod autoscaler:
$ oc create -f <file-name>.yaml
Verify that the horizontal pod autoscaler was created:
$ oc get hpa cpu-autoscale
Example output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE cpu-autoscale Deployment/example 173m/500m 1 10 1 20m
2.4.7. Creating a horizontal pod autoscaler object for memory utilization by using the CLI
Using the OpenShift Container Platform CLI, you can create a horizontal pod autoscaler (HPA) to automatically scale an existing Deployment
, DeploymentConfig
, ReplicaSet
, ReplicationController
, or StatefulSet
object. The HPA scales the pods associated with that object to maintain the average memory utilization you specify, either a direct value or a percentage of requested memory.
It is recommended to use a Deployment
object or ReplicaSet
object unless you need a specific feature or behavior provided by other objects.
The HPA increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified memory utilization across all pods.
For memory utilization, you can specify the minimum and maximum number of pods and the average memory utilization your pods should target. If you do not specify a minimum, the pods are given default values from the OpenShift Container Platform server.
Prerequisites
To use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu
and Memory
displayed under Usage
.
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-129-223.compute.internal -n openshift-kube-scheduler
Example output
Name: openshift-kube-scheduler-ip-10-0-129-223.compute.internal Namespace: openshift-kube-scheduler Labels: <none> Annotations: <none> API Version: metrics.k8s.io/v1beta1 Containers: Name: wait-for-host-port Usage: Cpu: 0 Memory: 0 Name: scheduler Usage: Cpu: 8m Memory: 45440Ki Kind: PodMetrics Metadata: Creation Timestamp: 2020-02-14T22:21:14Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-129-223.compute.internal Timestamp: 2020-02-14T22:21:14Z Window: 5m0s Events: <none>
Procedure
To create a horizontal pod autoscaler for memory utilization:
Create a YAML file for one of the following:
To scale for a specific memory value, create a
HorizontalPodAutoscaler
object similar to the following for an existing object:apiVersion: autoscaling/v2 1 kind: HorizontalPodAutoscaler metadata: name: hpa-resource-metrics-memory 2 namespace: default spec: scaleTargetRef: apiVersion: apps/v1 3 kind: Deployment 4 name: example 5 minReplicas: 1 6 maxReplicas: 10 7 metrics: 8 - type: Resource resource: name: memory 9 target: type: AverageValue 10 averageValue: 500Mi 11 behavior: 12 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 4 periodSeconds: 60 - type: Percent value: 10 periodSeconds: 60 selectPolicy: Max
- 1
- Use the
autoscaling/v2
API. - 2
- Specify a name for this horizontal pod autoscaler object.
- 3
- Specify the API version of the object to scale:
-
For a
Deployment
,ReplicaSet
, orStatefulset
object, useapps/v1
. -
For a
ReplicationController
, usev1
. -
For a
DeploymentConfig
, useapps.openshift.io/v1
.
-
For a
- 4
- Specify the type of object. The object must be a
Deployment
,DeploymentConfig
,ReplicaSet
,ReplicationController
, orStatefulSet
. - 5
- Specify the name of the object to scale. The object must exist.
- 6
- Specify the minimum number of replicas when scaling down.
- 7
- Specify the maximum number of replicas when scaling up.
- 8
- Use the
metrics
parameter for memory utilization. - 9
- Specify
memory
for memory utilization. - 10
- Set the type to
AverageValue
. - 11
- Specify
averageValue
and a specific memory value. - 12
- Optional: Specify a scaling policy to control the rate of scaling up or down.
To scale for a percentage, create a
HorizontalPodAutoscaler
object similar to the following for an existing object:apiVersion: autoscaling/v2 1 kind: HorizontalPodAutoscaler metadata: name: memory-autoscale 2 namespace: default spec: scaleTargetRef: apiVersion: apps/v1 3 kind: Deployment 4 name: example 5 minReplicas: 1 6 maxReplicas: 10 7 metrics: 8 - type: Resource resource: name: memory 9 target: type: Utilization 10 averageUtilization: 50 11 behavior: 12 scaleUp: stabilizationWindowSeconds: 180 policies: - type: Pods value: 6 periodSeconds: 120 - type: Percent value: 10 periodSeconds: 120 selectPolicy: Max
- 1
- Use the
autoscaling/v2
API. - 2
- Specify a name for this horizontal pod autoscaler object.
- 3
- Specify the API version of the object to scale:
-
For a ReplicationController, use
v1
. -
For a DeploymentConfig, use
apps.openshift.io/v1
. -
For a Deployment, ReplicaSet, Statefulset object, use
apps/v1
.
-
For a ReplicationController, use
- 4
- Specify the type of object. The object must be a
Deployment
,DeploymentConfig
,ReplicaSet
,ReplicationController
, orStatefulSet
. - 5
- Specify the name of the object to scale. The object must exist.
- 6
- Specify the minimum number of replicas when scaling down.
- 7
- Specify the maximum number of replicas when scaling up.
- 8
- Use the
metrics
parameter for memory utilization. - 9
- Specify
memory
for memory utilization. - 10
- Set to
Utilization
. - 11
- Specify
averageUtilization
and a target average memory utilization over all the pods, represented as a percent of requested memory. The target pods must have memory requests configured. - 12
- Optional: Specify a scaling policy to control the rate of scaling up or down.
Create the horizontal pod autoscaler:
$ oc create -f <file-name>.yaml
For example:
$ oc create -f hpa.yaml
Example output
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
Verify that the horizontal pod autoscaler was created:
$ oc get hpa hpa-resource-metrics-memory
Example output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE hpa-resource-metrics-memory Deployment/example 2441216/500Mi 1 10 1 20m
$ oc describe hpa hpa-resource-metrics-memory
Example output
Name: hpa-resource-metrics-memory Namespace: default Labels: <none> Annotations: <none> CreationTimestamp: Wed, 04 Mar 2020 16:31:37 +0530 Reference: Deployment/example Metrics: ( current / target ) resource memory on pods: 2441216 / 500Mi Min replicas: 1 Max replicas: 10 ReplicationController pods: 1 current / 1 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory resource ScalingLimited False DesiredWithinRange the desired count is within the acceptable range Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulRescale 6m34s horizontal-pod-autoscaler New size: 1; reason: All metrics below target
2.4.8. Understanding horizontal pod autoscaler status conditions by using the CLI
You can use the status conditions set to determine whether or not the horizontal pod autoscaler (HPA) is able to scale and whether or not it is currently restricted in any way.
The HPA status conditions are available with the v2
version of the autoscaling API.
The HPA responds with the following status conditions:
The
AbleToScale
condition indicates whether HPA is able to fetch and update metrics, as well as whether any backoff-related conditions could prevent scaling.-
A
True
condition indicates scaling is allowed. -
A
False
condition indicates scaling is not allowed for the reason specified.
-
A
The
ScalingActive
condition indicates whether the HPA is enabled (for example, the replica count of the target is not zero) and is able to calculate desired metrics.-
A
True
condition indicates metrics is working properly. -
A
False
condition generally indicates a problem with fetching metrics.
-
A
The
ScalingLimited
condition indicates that the desired scale was capped by the maximum or minimum of the horizontal pod autoscaler.-
A
True
condition indicates that you need to raise or lower the minimum or maximum replica count in order to scale. A
False
condition indicates that the requested scaling is allowed.$ oc describe hpa cm-test
Example output
Name: cm-test Namespace: prom Labels: <none> Annotations: <none> CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000 Reference: ReplicationController/cm-test Metrics: ( current / target ) "http_requests" on pods: 66m / 500m Min replicas: 1 Max replicas: 4 ReplicationController pods: 1 current / 1 desired Conditions: 1 Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_request ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range Events:
- 1
- The horizontal pod autoscaler status messages.
-
A
The following is an example of a pod that is unable to scale:
Example output
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale False FailedGetScale the HPA controller was unable to get the target's current scale: no matches for kind "ReplicationController" in group "apps" Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedGetScale 6s (x3 over 36s) horizontal-pod-autoscaler no matches for kind "ReplicationController" in group "apps"
The following is an example of a pod that could not obtain the needed metrics for scaling:
Example output
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
The following is an example of a pod where the requested autoscaling was less than the required minimums:
Example output
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_request ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
2.4.8.1. Viewing horizontal pod autoscaler status conditions by using the CLI
You can view the status conditions set on a pod by the horizontal pod autoscaler (HPA).
The horizontal pod autoscaler status conditions are available with the v2
version of the autoscaling API.
Prerequisites
To use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu
and Memory
displayed under Usage
.
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Example output
Name: openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Namespace: openshift-kube-scheduler Labels: <none> Annotations: <none> API Version: metrics.k8s.io/v1beta1 Containers: Name: wait-for-host-port Usage: Memory: 0 Name: scheduler Usage: Cpu: 8m Memory: 45440Ki Kind: PodMetrics Metadata: Creation Timestamp: 2019-05-23T18:47:56Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Timestamp: 2019-05-23T18:47:56Z Window: 1m0s Events: <none>
Procedure
To view the status conditions on a pod, use the following command with the name of the pod:
$ oc describe hpa <pod-name>
For example:
$ oc describe hpa cm-test
The conditions appear in the Conditions
field in the output.
Example output
Name: cm-test
Namespace: prom
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000
Reference: ReplicationController/cm-test
Metrics: ( current / target )
"http_requests" on pods: 66m / 500m
Min replicas: 1
Max replicas: 4
ReplicationController pods: 1 current / 1 desired
Conditions: 1
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_request
ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
2.4.9. Additional resources
- For more information on replication controllers and deployment controllers, see Understanding deployments and deployment configs.
- For an example on the usage of HPA, see Horizontal Pod Autoscaling of Quarkus Application Based on Memory Utilization.
2.5. Automatically adjust pod resource levels with the vertical pod autoscaler
The OpenShift Container Platform Vertical Pod Autoscaler Operator (VPA) automatically reviews the historic and current CPU and memory resources for containers in pods and can update the resource limits and requests based on the usage values it learns. The VPA uses individual custom resources (CR) to update all of the pods associated with a workload object, such as a Deployment
, DeploymentConfig
, StatefulSet
, Job
, DaemonSet
, ReplicaSet
, or ReplicationController
, in a project.
The VPA helps you to understand the optimal CPU and memory usage for your pods and can automatically maintain pod resources through the pod lifecycle.
2.5.1. About the Vertical Pod Autoscaler Operator
The Vertical Pod Autoscaler Operator (VPA) is implemented as an API resource and a custom resource (CR). The CR determines the actions that the VPA Operator should take with the pods associated with a specific workload object, such as a daemon set, replication controller, and so forth, in a project.
The VPA Operator consists of three components, each of which has its own pod in the VPA namespace:
- Recommender
- The VPA recommender monitors the current and past resource consumption and, based on this data, determines the optimal CPU and memory resources for the pods in the associated workload object.
- Updater
- The VPA updater checks if the pods in the associated workload object have the correct resources. If the resources are correct, the updater takes no action. If the resources are not correct, the updater kills the pod so that they can be recreated by their controllers with the updated requests.
- Admission controller
- The VPA admission controller sets the correct resource requests on each new pod in the associated workload object, whether the pod is new or was recreated by its controller due to the VPA updater actions.
You can use the default recommender or use your own alternative recommender to autoscale based on your own algorithms.
The default recommender automatically computes historic and current CPU and memory usage for the containers in those pods and uses this data to determine optimized resource limits and requests to ensure that these pods are operating efficiently at all times. For example, the default recommender suggests reduced resources for pods that are requesting more resources than they are using and increased resources for pods that are not requesting enough.
The VPA then automatically deletes any pods that are out of alignment with these recommendations one at a time, so that your applications can continue to serve requests with no downtime. The workload objects then re-deploy the pods with the original resource limits and requests. The VPA uses a mutating admission webhook to update the pods with optimized resource limits and requests before the pods are admitted to a node. If you do not want the VPA to delete pods, you can view the VPA resource limits and requests and manually update the pods as needed.
By default, workload objects must specify a minimum of two replicas in order for the VPA to automatically delete their pods. Workload objects that specify fewer replicas than this minimum are not deleted. If you manually delete these pods, when the workload object redeploys the pods, the VPA does update the new pods with its recommendations. You can change this minimum by modifying the VerticalPodAutoscalerController
object as shown in Changing the VPA minimum value.
For example, if you have a pod that uses 50% of the CPU but only requests 10%, the VPA determines that the pod is consuming more CPU than requested and deletes the pod. The workload object, such as replica set, restarts the pods and the VPA updates the new pod with its recommended resources.
For developers, you can use the VPA to help ensure your pods stay up during periods of high demand by scheduling pods onto nodes that have appropriate resources for each pod.
Administrators can use the VPA to better utilize cluster resources, such as preventing pods from reserving more CPU resources than needed. The VPA monitors the resources that workloads are actually using and adjusts the resource requirements so capacity is available to other workloads. The VPA also maintains the ratios between limits and requests that are specified in initial container configuration.
If you stop running the VPA or delete a specific VPA CR in your cluster, the resource requests for the pods already modified by the VPA do not change. Any new pods get the resources defined in the workload object, not the previous recommendations made by the VPA.
2.5.2. Installing the Vertical Pod Autoscaler Operator
You can use the OpenShift Container Platform web console to install the Vertical Pod Autoscaler Operator (VPA).
Procedure
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose VerticalPodAutoscaler from the list of available Operators, and click Install.
-
On the Install Operator page, ensure that the Operator recommended namespace option is selected. This installs the Operator in the mandatory
openshift-vertical-pod-autoscaler
namespace, which is automatically created if it does not exist. - Click Install.
Verifiction
Verify the installation by listing the VPA Operator components:
- Navigate to Workloads → Pods.
-
Select the
openshift-vertical-pod-autoscaler
project from the drop-down menu and verify that there are four pods running. - Navigate to Workloads → Deployments to verify that there are four deployments running.
Optional: Verify the installation in the OpenShift Container Platform CLI using the following command:
$ oc get all -n openshift-vertical-pod-autoscaler
The output shows four pods and four deployments:
Example output
NAME READY STATUS RESTARTS AGE pod/vertical-pod-autoscaler-operator-85b4569c47-2gmhc 1/1 Running 0 3m13s pod/vpa-admission-plugin-default-67644fc87f-xq7k9 1/1 Running 0 2m56s pod/vpa-recommender-default-7c54764b59-8gckt 1/1 Running 0 2m56s pod/vpa-updater-default-7f6cc87858-47vw9 1/1 Running 0 2m56s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/vpa-webhook ClusterIP 172.30.53.206 <none> 443/TCP 2m56s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/vertical-pod-autoscaler-operator 1/1 1 1 3m13s deployment.apps/vpa-admission-plugin-default 1/1 1 1 2m56s deployment.apps/vpa-recommender-default 1/1 1 1 2m56s deployment.apps/vpa-updater-default 1/1 1 1 2m56s NAME DESIRED CURRENT READY AGE replicaset.apps/vertical-pod-autoscaler-operator-85b4569c47 1 1 1 3m13s replicaset.apps/vpa-admission-plugin-default-67644fc87f 1 1 1 2m56s replicaset.apps/vpa-recommender-default-7c54764b59 1 1 1 2m56s replicaset.apps/vpa-updater-default-7f6cc87858 1 1 1 2m56s
2.5.3. Moving the Vertical Pod Autoscaler Operator components
The Vertical Pod Autoscaler Operator (VPA) and each component has its own pod in the VPA namespace on the control plane nodes. You can move the VPA Operator and component pods to infrastructure or worker nodes by adding a node selector to the VPA subscription and the VerticalPodAutoscalerController
CR.
You can create and use infrastructure nodes to host only infrastructure components, such as the default router, the integrated container image registry, and the components for cluster metrics and monitoring. These infrastructure nodes are not counted toward the total number of subscriptions that are required to run the environment. For more information, see Creating infrastructure machine sets.
You can move the components to the same node or separate nodes as appropriate for your organization.
The following example shows the default deployment of the VPA pods to the control plane nodes.
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES vertical-pod-autoscaler-operator-6c75fcc9cd-5pb6z 1/1 Running 0 7m59s 10.128.2.24 c416-tfsbj-master-1 <none> <none> vpa-admission-plugin-default-6cb78d6f8b-rpcrj 1/1 Running 0 5m37s 10.129.2.22 c416-tfsbj-master-1 <none> <none> vpa-recommender-default-66846bd94c-dsmpp 1/1 Running 0 5m37s 10.129.2.20 c416-tfsbj-master-0 <none> <none> vpa-updater-default-db8b58df-2nkvf 1/1 Running 0 5m37s 10.129.2.21 c416-tfsbj-master-1 <none> <none>
Procedure
Move the VPA Operator pod by adding a node selector to the
Subscription
custom resource (CR) for the VPA Operator:Edit the CR:
$ oc edit Subscription vertical-pod-autoscaler -n openshift-vertical-pod-autoscaler
Add a node selector to match the node role label on the node where you want to install the VPA Operator pod:
apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: labels: operators.coreos.com/vertical-pod-autoscaler.openshift-vertical-pod-autoscaler: "" name: vertical-pod-autoscaler # ... spec: config: nodeSelector: node-role.kubernetes.io/<node_role>: "" 1
NoteIf the infra node uses taints, you need to add a toleration to the
Subscription
CR.For example:
apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: labels: operators.coreos.com/vertical-pod-autoscaler.openshift-vertical-pod-autoscaler: "" name: vertical-pod-autoscaler # ... spec: config: nodeSelector: node-role.kubernetes.io/infra: "" tolerations: 1 - key: "node-role.kubernetes.io/infra" operator: "Exists" effect: "NoSchedule"
- 1
- Specifies a toleration for a taint on the node where you want to move the VPA Operator pod.
Move each VPA component by adding node selectors to the
VerticalPodAutoscaler
custom resource (CR):Edit the CR:
$ oc edit VerticalPodAutoscalerController default -n openshift-vertical-pod-autoscaler
Add node selectors to match the node role label on the node where you want to install the VPA components:
apiVersion: autoscaling.openshift.io/v1 kind: VerticalPodAutoscalerController metadata: name: default namespace: openshift-vertical-pod-autoscaler # ... spec: deploymentOverrides: admission: container: resources: {} nodeSelector: node-role.kubernetes.io/<node_role>: "" 1 recommender: container: resources: {} nodeSelector: node-role.kubernetes.io/<node_role>: "" 2 updater: container: resources: {} nodeSelector: node-role.kubernetes.io/<node_role>: "" 3
NoteIf a target node uses taints, you need to add a toleration to the
VerticalPodAutoscalerController
CR.For example:
apiVersion: autoscaling.openshift.io/v1 kind: VerticalPodAutoscalerController metadata: name: default namespace: openshift-vertical-pod-autoscaler # ... spec: deploymentOverrides: admission: container: resources: {} nodeSelector: node-role.kubernetes.io/worker: "" tolerations: 1 - key: "my-example-node-taint-key" operator: "Exists" effect: "NoSchedule" recommender: container: resources: {} nodeSelector: node-role.kubernetes.io/worker: "" tolerations: 2 - key: "my-example-node-taint-key" operator: "Exists" effect: "NoSchedule" updater: container: resources: {} nodeSelector: node-role.kubernetes.io/worker: "" tolerations: 3 - key: "my-example-node-taint-key" operator: "Exists" effect: "NoSchedule"
- 1
- Specifies a toleration for the admission controller pod for a taint on the node where you want to install the pod.
- 2
- Specifies a toleration for the recommender pod for a taint on the node where you want to install the pod.
- 3
- Specifies a toleration for the updater pod for a taint on the node where you want to install the pod.
Verification
You can verify the pods have moved by using the following command:
$ oc get pods -n openshift-vertical-pod-autoscaler -o wide
The pods are no longer deployed to the control plane nodes.
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES vertical-pod-autoscaler-operator-6c75fcc9cd-5pb6z 1/1 Running 0 7m59s 10.128.2.24 c416-tfsbj-infra-eastus3-2bndt <none> <none> vpa-admission-plugin-default-6cb78d6f8b-rpcrj 1/1 Running 0 5m37s 10.129.2.22 c416-tfsbj-infra-eastus1-lrgj8 <none> <none> vpa-recommender-default-66846bd94c-dsmpp 1/1 Running 0 5m37s 10.129.2.20 c416-tfsbj-infra-eastus1-lrgj8 <none> <none> vpa-updater-default-db8b58df-2nkvf 1/1 Running 0 5m37s 10.129.2.21 c416-tfsbj-infra-eastus1-lrgj8 <none> <none>
Additional resources
2.5.4. About Using the Vertical Pod Autoscaler Operator
To use the Vertical Pod Autoscaler Operator (VPA), you create a VPA custom resource (CR) for a workload object in your cluster. The VPA learns and applies the optimal CPU and memory resources for the pods associated with that workload object. You can use a VPA with a deployment, stateful set, job, daemon set, replica set, or replication controller workload object. The VPA CR must be in the same project as the pods you want to monitor.
You use the VPA CR to associate a workload object and specify which mode the VPA operates in:
-
The
Auto
andRecreate
modes automatically apply the VPA CPU and memory recommendations throughout the pod lifetime. The VPA deletes any pods in the project that are out of alignment with its recommendations. When redeployed by the workload object, the VPA updates the new pods with its recommendations. -
The
Initial
mode automatically applies VPA recommendations only at pod creation. -
The
Off
mode only provides recommended resource limits and requests, allowing you to manually apply the recommendations. Theoff
mode does not update pods.
You can also use the CR to opt-out certain containers from VPA evaluation and updates.
For example, a pod has the following limits and requests:
resources: limits: cpu: 1 memory: 500Mi requests: cpu: 500m memory: 100Mi
After creating a VPA that is set to auto
, the VPA learns the resource usage and deletes the pod. When redeployed, the pod uses the new resource limits and requests:
resources: limits: cpu: 50m memory: 1250Mi requests: cpu: 25m memory: 262144k
You can view the VPA recommendations using the following command:
$ oc get vpa <vpa-name> --output yaml
After a few minutes, the output shows the recommendations for CPU and memory requests, similar to the following:
Example output
... status: ... recommendation: containerRecommendations: - containerName: frontend lowerBound: cpu: 25m memory: 262144k target: cpu: 25m memory: 262144k uncappedTarget: cpu: 25m memory: 262144k upperBound: cpu: 262m memory: "274357142" - containerName: backend lowerBound: cpu: 12m memory: 131072k target: cpu: 12m memory: 131072k uncappedTarget: cpu: 12m memory: 131072k upperBound: cpu: 476m memory: "498558823" ...
The output shows the recommended resources, target
, the minimum recommended resources, lowerBound
, the highest recommended resources, upperBound
, and the most recent resource recommendations, uncappedTarget
.
The VPA uses the lowerBound
and upperBound
values to determine if a pod needs to be updated. If a pod has resource requests below the lowerBound
values or above the upperBound
values, the VPA terminates and recreates the pod with the target
values.
2.5.4.1. Changing the VPA minimum value
By default, workload objects must specify a minimum of two replicas in order for the VPA to automatically delete and update their pods. As a result, workload objects that specify fewer than two replicas are not automatically acted upon by the VPA. The VPA does update new pods from these workload objects if the pods are restarted by some process external to the VPA. You can change this cluster-wide minimum value by modifying the minReplicas
parameter in the VerticalPodAutoscalerController
custom resource (CR).
For example, if you set minReplicas
to 3
, the VPA does not delete and update pods for workload objects that specify fewer than three replicas.
If you set minReplicas
to 1
, the VPA can delete the only pod for a workload object that specifies only one replica. You should use this setting with one-replica objects only if your workload can tolerate downtime whenever the VPA deletes a pod to adjust its resources. To avoid unwanted downtime with one-replica objects, configure the VPA CRs with the podUpdatePolicy
set to Initial
, which automatically updates the pod only when it is restarted by some process external to the VPA, or Off
, which allows you to update the pod manually at an appropriate time for your application.
Example VerticalPodAutoscalerController
object
apiVersion: autoscaling.openshift.io/v1
kind: VerticalPodAutoscalerController
metadata:
creationTimestamp: "2021-04-21T19:29:49Z"
generation: 2
name: default
namespace: openshift-vertical-pod-autoscaler
resourceVersion: "142172"
uid: 180e17e9-03cc-427f-9955-3b4d7aeb2d59
spec:
minReplicas: 3 1
podMinCPUMillicores: 25
podMinMemoryMb: 250
recommendationOnly: false
safetyMarginFraction: 0.15
- 1
- Specify the minimum number of replicas in a workload object for the VPA to act on. Any objects with replicas fewer than the minimum are not automatically deleted by the VPA.
2.5.4.2. Automatically applying VPA recommendations
To use the VPA to automatically update pods, create a VPA CR for a specific workload object with updateMode
set to Auto
or Recreate
.
When the pods are created for the workload object, the VPA constantly monitors the containers to analyze their CPU and memory needs. The VPA deletes any pods that do not meet the VPA recommendations for CPU and memory. When redeployed, the pods use the new resource limits and requests based on the VPA recommendations, honoring any pod disruption budget set for your applications. The recommendations are added to the status
field of the VPA CR for reference.
By default, workload objects must specify a minimum of two replicas in order for the VPA to automatically delete their pods. Workload objects that specify fewer replicas than this minimum are not deleted. If you manually delete these pods, when the workload object redeploys the pods, the VPA does update the new pods with its recommendations. You can change this minimum by modifying the VerticalPodAutoscalerController
object as shown in Changing the VPA minimum value.
Example VPA CR for the Auto
mode
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: vpa-recommender spec: targetRef: apiVersion: "apps/v1" kind: Deployment 1 name: frontend 2 updatePolicy: updateMode: "Auto" 3
- 1
- The type of workload object you want this VPA CR to manage.
- 2
- The name of the workload object you want this VPA CR to manage.
- 3
- Set the mode to
Auto
orRecreate
:-
Auto
. The VPA assigns resource requests on pod creation and updates the existing pods by terminating them when the requested resources differ significantly from the new recommendation. -
Recreate
. The VPA assigns resource requests on pod creation and updates the existing pods by terminating them when the requested resources differ significantly from the new recommendation. This mode should be used rarely, only if you need to ensure that the pods are restarted whenever the resource request changes.
-
Before a VPA can determine recommendations for resources and apply the recommended resources to new pods, operating pods must exist and be running in the project.
If a workload’s resource usage, such as CPU and memory, is consistent, the VPA can determine recommendations for resources in a few minutes. If a workload’s resource usage is inconsistent, the VPA must collect metrics at various resource usage intervals for the VPA to make an accurate recommendation.
2.5.4.3. Automatically applying VPA recommendations on pod creation
To use the VPA to apply the recommended resources only when a pod is first deployed, create a VPA CR for a specific workload object with updateMode
set to Initial
.
Then, manually delete any pods associated with the workload object that you want to use the VPA recommendations. In the Initial
mode, the VPA does not delete pods and does not update the pods as it learns new resource recommendations.
Example VPA CR for the Initial
mode
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: vpa-recommender spec: targetRef: apiVersion: "apps/v1" kind: Deployment 1 name: frontend 2 updatePolicy: updateMode: "Initial" 3
Before a VPA can determine recommended resources and apply the recommendations to new pods, operating pods must exist and be running in the project.
To obtain the most accurate recommendations from the VPA, wait at least 8 days for the pods to run and for the VPA to stabilize.
2.5.4.4. Manually applying VPA recommendations
To use the VPA to only determine the recommended CPU and memory values, create a VPA CR for a specific workload object with updateMode
set to off
.
When the pods are created for that workload object, the VPA analyzes the CPU and memory needs of the containers and records those recommendations in the status
field of the VPA CR. The VPA does not update the pods as it determines new resource recommendations.
Example VPA CR for the Off
mode
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: vpa-recommender spec: targetRef: apiVersion: "apps/v1" kind: Deployment 1 name: frontend 2 updatePolicy: updateMode: "Off" 3
You can view the recommendations using the following command.
$ oc get vpa <vpa-name> --output yaml
With the recommendations, you can edit the workload object to add CPU and memory requests, then delete and redeploy the pods using the recommended resources.
Before a VPA can determine recommended resources and apply the recommendations to new pods, operating pods must exist and be running in the project.
To obtain the most accurate recommendations from the VPA, wait at least 8 days for the pods to run and for the VPA to stabilize.
2.5.4.5. Exempting containers from applying VPA recommendations
If your workload object has multiple containers and you do not want the VPA to evaluate and act on all of the containers, create a VPA CR for a specific workload object and add a resourcePolicy
to opt-out specific containers.
When the VPA updates the pods with recommended resources, any containers with a resourcePolicy
are not updated and the VPA does not present recommendations for those containers in the pod.
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: vpa-recommender spec: targetRef: apiVersion: "apps/v1" kind: Deployment 1 name: frontend 2 updatePolicy: updateMode: "Auto" 3 resourcePolicy: 4 containerPolicies: - containerName: my-opt-sidecar mode: "Off"
- 1
- The type of workload object you want this VPA CR to manage.
- 2
- The name of the workload object you want this VPA CR to manage.
- 3
- Set the mode to
Auto
,Recreate
, orOff
. TheRecreate
mode should be used rarely, only if you need to ensure that the pods are restarted whenever the resource request changes. - 4
- Specify the containers you want to opt-out and set
mode
toOff
.
For example, a pod has two containers, the same resource requests and limits:
# ... spec: containers: - name: frontend resources: limits: cpu: 1 memory: 500Mi requests: cpu: 500m memory: 100Mi - name: backend resources: limits: cpu: "1" memory: 500Mi requests: cpu: 500m memory: 100Mi # ...
After launching a VPA CR with the backend
container set to opt-out, the VPA terminates and recreates the pod with the recommended resources applied only to the frontend
container:
... spec: containers: name: frontend resources: limits: cpu: 50m memory: 1250Mi requests: cpu: 25m memory: 262144k ... name: backend resources: limits: cpu: "1" memory: 500Mi requests: cpu: 500m memory: 100Mi ...
2.5.4.6. Performance tuning the VPA Operator
As a cluster administrator, you can tune the performance of your Vertical Pod Autoscaler Operator (VPA) to limit the rate at which the VPA makes requests of the Kubernetes API server and to specify the CPU and memory resources for the VPA recommender, updater, and admission controller component pods.
Additionally, you can configure the VPA Operator to monitor only those workloads that are being managed by a VPA custom resource (CR). By default, the VPA Operator monitors every workload in the cluster. This allows the VPA Operator to accrue and store 8 days of historical data for all workloads, which the Operator can use if a new VPA CR is created for a workload. However, this causes the VPA Operator to use significant CPU and memory, which could cause the Operator to fail, particularly on larger clusters. By configuring the VPA Operator to monitor only workloads with a VPA CR, you can save on CPU and memory resources. One trade-off is that if you have a workload that has been running, and you create a VPA CR to manage that workload, the VPA Operator does not have any historical data for that workload. As a result, the initial recommendations are not as useful as those after the workload had been running for some time.
These tunings allow you to ensure the VPA has sufficient resources to operate at peak efficiency and to prevent throttling and a possible delay in pod admissions.
You can perform the following tunings on the VPA components by editing the VerticalPodAutoscalerController
custom resource (CR):
-
To prevent throttling and pod admission delays, set the queries-per-second (QPS) and burst rates for VPA requests of the Kubernetes API server by using the
kube-api-qps
andkube-api-burst
parameters. -
To ensure sufficient CPU and memory, set the CPU and memory requests for VPA component pods by using the standard
cpu
andmemory
resource requests. -
To configure the VPA Operator to monitor only workloads that are being managed by a VPA CR, set the
memory-saver
parameter totrue
for the recommender component.
The following example VPA controller CR sets the VPA API QPS and burts rates, configures the component pod resource requests, and sets memory-saver
to true
for the recommender:
Example VerticalPodAutoscalerController
CR
apiVersion: autoscaling.openshift.io/v1 kind: VerticalPodAutoscalerController metadata: name: default namespace: openshift-vertical-pod-autoscaler spec: deploymentOverrides: admission: 1 container: args: 2 - '--kube-api-qps=30.0' - '--kube-api-burst=40.0' resources: requests: 3 cpu: 40m memory: 40Mi recommender: 4 container: args: - '--kube-api-qps=20.0' - '--kube-api-burst=60.0' - '--memory-saver=true' 5 resources: requests: cpu: 60m memory: 60Mi updater: 6 container: args: - '--kube-api-qps=20.0' - '--kube-api-burst=80.0' resources: requests: cpu: 80m memory: 80Mi minReplicas: 2 podMinCPUMillicores: 25 podMinMemoryMb: 250 recommendationOnly: false safetyMarginFraction: 0.15
- 1
- Specifies the tuning parameters for the VPA admission controller.
- 2
- Specifies the API QPS and burst rates for the VPA admission controller.
-
kube-api-qps
: Specifies the queries per second (QPS) limit when making requests to Kubernetes API server. The default is5.0
. -
kube-api-burst
: Specifies the burst limit when making requests to Kubernetes API server. The default is10.0
.
-
- 3
- Specifies the CPU and memory requests for the VPA admission controller pod.
- 4
- Specifies the tuning parameters for the VPA recommender.
- 5
- Specifies that the VPA Operator monitors only workloads with a VPA CR. The default is
false
. - 6
- Specifies the tuning parameters for the VPA updater.
You can verify that the settings were applied to each VPA component pod.
Example updater pod
apiVersion: v1 kind: Pod metadata: name: vpa-updater-default-d65ffb9dc-hgw44 namespace: openshift-vertical-pod-autoscaler # ... spec: containers: - args: - --logtostderr - --v=1 - --min-replicas=2 - --kube-api-qps=20.0 - --kube-api-burst=80.0 # ... resources: requests: cpu: 80m memory: 80Mi # ...
Example admission controller pod
apiVersion: v1 kind: Pod metadata: name: vpa-admission-plugin-default-756999448c-l7tsd namespace: openshift-vertical-pod-autoscaler # ... spec: containers: - args: - --logtostderr - --v=1 - --tls-cert-file=/data/tls-certs/tls.crt - --tls-private-key=/data/tls-certs/tls.key - --client-ca-file=/data/tls-ca-certs/service-ca.crt - --webhook-timeout-seconds=10 - --kube-api-qps=30.0 - --kube-api-burst=40.0 # ... resources: requests: cpu: 40m memory: 40Mi # ...
Example recommender pod
apiVersion: v1 kind: Pod metadata: name: vpa-recommender-default-74c979dbbc-znrd2 namespace: openshift-vertical-pod-autoscaler # ... spec: containers: - args: - --logtostderr - --v=1 - --recommendation-margin-fraction=0.15 - --pod-recommendation-min-cpu-millicores=25 - --pod-recommendation-min-memory-mb=250 - --kube-api-qps=20.0 - --kube-api-burst=60.0 - --memory-saver=true # ... resources: requests: cpu: 60m memory: 60Mi # ...
2.5.4.7. Using an alternative recommender
You can use your own recommender to autoscale based on your own algorithms. If you do not specify an alternative recommender, OpenShift Container Platform uses the default recommender, which suggests CPU and memory requests based on historical usage. Because there is no universal recommendation policy that applies to all types of workloads, you might want to create and deploy different recommenders for specific workloads.
For example, the default recommender might not accurately predict future resource usage when containers exhibit certain resource behaviors, such as cyclical patterns that alternate between usage spikes and idling as used by monitoring applications, or recurring and repeating patterns used with deep learning applications. Using the default recommender with these usage behaviors might result in significant over-provisioning and Out of Memory (OOM) kills for your applications.
Instructions for how to create a recommender are beyond the scope of this documentation,
Procedure
To use an alternative recommender for your pods:
Create a service account for the alternative recommender and bind that service account to the required cluster role:
apiVersion: v1 1 kind: ServiceAccount metadata: name: alt-vpa-recommender-sa namespace: <namespace_name> --- apiVersion: rbac.authorization.k8s.io/v1 2 kind: ClusterRoleBinding metadata: name: system:example-metrics-reader roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:metrics-reader subjects: - kind: ServiceAccount name: alt-vpa-recommender-sa namespace: <namespace_name> --- apiVersion: rbac.authorization.k8s.io/v1 3 kind: ClusterRoleBinding metadata: name: system:example-vpa-actor roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:vpa-actor subjects: - kind: ServiceAccount name: alt-vpa-recommender-sa namespace: <namespace_name> --- apiVersion: rbac.authorization.k8s.io/v1 4 kind: ClusterRoleBinding metadata: name: system:example-vpa-target-reader-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:vpa-target-reader subjects: - kind: ServiceAccount name: alt-vpa-recommender-sa namespace: <namespace_name>
- 1
- Creates a service account for the recommender in the namespace where the recommender is deployed.
- 2
- Binds the recommender service account to the
metrics-reader
role. Specify the namespace where the recommender is to be deployed. - 3
- Binds the recommender service account to the
vpa-actor
role. Specify the namespace where the recommender is to be deployed. - 4
- Binds the recommender service account to the
vpa-target-reader
role. Specify the namespace where the recommender is to be deployed.
To add the alternative recommender to the cluster, create a Deployment object similar to the following:
apiVersion: apps/v1 kind: Deployment metadata: name: alt-vpa-recommender namespace: <namespace_name> spec: replicas: 1 selector: matchLabels: app: alt-vpa-recommender template: metadata: labels: app: alt-vpa-recommender spec: containers: 1 - name: recommender image: quay.io/example/alt-recommender:latest 2 imagePullPolicy: Always resources: limits: cpu: 200m memory: 1000Mi requests: cpu: 50m memory: 500Mi ports: - name: prometheus containerPort: 8942 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL seccompProfile: type: RuntimeDefault serviceAccountName: alt-vpa-recommender-sa 3 securityContext: runAsNonRoot: true
A new pod is created for the alternative recommender in the same namespace.
$ oc get pods
Example output
NAME READY STATUS RESTARTS AGE frontend-845d5478d-558zf 1/1 Running 0 4m25s frontend-845d5478d-7z9gx 1/1 Running 0 4m25s frontend-845d5478d-b7l4j 1/1 Running 0 4m25s vpa-alt-recommender-55878867f9-6tp5v 1/1 Running 0 9s
Configure a VPA CR that includes the name of the alternative recommender
Deployment
object.Example VPA CR to include the alternative recommender
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: vpa-recommender namespace: <namespace_name> spec: recommenders: - name: alt-vpa-recommender 1 targetRef: apiVersion: "apps/v1" kind: Deployment 2 name: frontend
2.5.5. Using the Vertical Pod Autoscaler Operator
You can use the Vertical Pod Autoscaler Operator (VPA) by creating a VPA custom resource (CR). The CR indicates which pods it should analyze and determines the actions the VPA should take with those pods.
Prerequisites
- The workload object that you want to autoscale must exist.
- If you want to use an alternative recommender, a deployment including that recommender must exist.
Procedure
To create a VPA CR for a specific workload object:
Change to the project where the workload object you want to scale is located.
Create a VPA CR YAML file:
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: vpa-recommender spec: targetRef: apiVersion: "apps/v1" kind: Deployment 1 name: frontend 2 updatePolicy: updateMode: "Auto" 3 resourcePolicy: 4 containerPolicies: - containerName: my-opt-sidecar mode: "Off" recommenders: 5 - name: my-recommender
- 1
- Specify the type of workload object you want this VPA to manage:
Deployment
,StatefulSet
,Job
,DaemonSet
,ReplicaSet
, orReplicationController
. - 2
- Specify the name of an existing workload object you want this VPA to manage.
- 3
- Specify the VPA mode:
-
auto
to automatically apply the recommended resources on pods associated with the controller. The VPA terminates existing pods and creates new pods with the recommended resource limits and requests. -
recreate
to automatically apply the recommended resources on pods associated with the workload object. The VPA terminates existing pods and creates new pods with the recommended resource limits and requests. Therecreate
mode should be used rarely, only if you need to ensure that the pods are restarted whenever the resource request changes. -
initial
to automatically apply the recommended resources when pods associated with the workload object are created. The VPA does not update the pods as it learns new resource recommendations. -
off
to only generate resource recommendations for the pods associated with the workload object. The VPA does not update the pods as it learns new resource recommendations and does not apply the recommendations to new pods.
-
- 4
- Optional. Specify the containers you want to opt-out and set the mode to
Off
. - 5
- Optional. Specify an alternative recommender.
Create the VPA CR:
$ oc create -f <file-name>.yaml
After a few moments, the VPA learns the resource usage of the containers in the pods associated with the workload object.
You can view the VPA recommendations using the following command:
$ oc get vpa <vpa-name> --output yaml
The output shows the recommendations for CPU and memory requests, similar to the following:
Example output
... status: ... recommendation: containerRecommendations: - containerName: frontend lowerBound: 1 cpu: 25m memory: 262144k target: 2 cpu: 25m memory: 262144k uncappedTarget: 3 cpu: 25m memory: 262144k upperBound: 4 cpu: 262m memory: "274357142" - containerName: backend lowerBound: cpu: 12m memory: 131072k target: cpu: 12m memory: 131072k uncappedTarget: cpu: 12m memory: 131072k upperBound: cpu: 476m memory: "498558823" ...
2.5.6. Uninstalling the Vertical Pod Autoscaler Operator
You can remove the Vertical Pod Autoscaler Operator (VPA) from your OpenShift Container Platform cluster. After uninstalling, the resource requests for the pods already modified by an existing VPA CR do not change. Any new pods get the resources defined in the workload object, not the previous recommendations made by the Vertical Pod Autoscaler Operator.
You can remove a specific VPA CR by using the oc delete vpa <vpa-name>
command. The same actions apply for resource requests as uninstalling the vertical pod autoscaler.
After removing the VPA Operator, it is recommended that you remove the other components associated with the Operator to avoid potential issues.
Prerequisites
- The Vertical Pod Autoscaler Operator must be installed.
Procedure
- In the OpenShift Container Platform web console, click Operators → Installed Operators.
- Switch to the openshift-vertical-pod-autoscaler project.
- For the VerticalPodAutoscaler Operator, click the Options menu and select Uninstall Operator.
- Optional: To remove all operands associated with the Operator, in the dialog box, select Delete all operand instances for this operator checkbox.
- Click Uninstall.
Optional: Use the OpenShift CLI to remove the VPA components:
Delete the VPA namespace:
$ oc delete namespace openshift-vertical-pod-autoscaler
Delete the VPA custom resource definition (CRD) objects:
$ oc delete crd verticalpodautoscalercheckpoints.autoscaling.k8s.io
$ oc delete crd verticalpodautoscalercontrollers.autoscaling.openshift.io
$ oc delete crd verticalpodautoscalers.autoscaling.k8s.io
Deleting the CRDs removes the associated roles, cluster roles, and role bindings.
NoteThis action removes from the cluster all user-created VPA CRs. If you re-install the VPA, you must create these objects again.
Delete the
MutatingWebhookConfiguration
object by running the following command:$ oc delete MutatingWebhookConfiguration vpa-webhook-config
Delete the VPA Operator:
$ oc delete operator/vertical-pod-autoscaler.openshift-vertical-pod-autoscaler
2.6. Providing sensitive data to pods by using secrets
Some applications need sensitive information, such as passwords and user names, that you do not want developers to have.
As an administrator, you can use Secret
objects to provide this information without exposing that information in clear text.
2.6.1. Understanding secrets
The Secret
object type provides a mechanism to hold sensitive information such as passwords, OpenShift Container Platform client configuration files, private source repository credentials, and so on. Secrets decouple sensitive content from the pods. You can mount secrets into containers using a volume plugin or the system can use secrets to perform actions on behalf of a pod.
Key properties include:
- Secret data can be referenced independently from its definition.
- Secret data volumes are backed by temporary file-storage facilities (tmpfs) and never come to rest on a node.
- Secret data can be shared within a namespace.
YAML Secret
object definition
apiVersion: v1 kind: Secret metadata: name: test-secret namespace: my-namespace type: Opaque 1 data: 2 username: <username> 3 password: <password> stringData: 4 hostname: myapp.mydomain.com 5
- 1
- Indicates the structure of the secret’s key names and values.
- 2
- The allowable format for the keys in the
data
field must meet the guidelines in the DNS_SUBDOMAIN value in the Kubernetes identifiers glossary. - 3
- The value associated with keys in the
data
map must be base64 encoded. - 4
- Entries in the
stringData
map are converted to base64 and the entry will then be moved to thedata
map automatically. This field is write-only; the value will only be returned via thedata
field. - 5
- The value associated with keys in the
stringData
map is made up of plain text strings.
You must create a secret before creating the pods that depend on that secret.
When creating secrets:
- Create a secret object with secret data.
- Update the pod’s service account to allow the reference to the secret.
-
Create a pod, which consumes the secret as an environment variable or as a file (using a
secret
volume).
2.6.1.1. Types of secrets
The value in the type
field indicates the structure of the secret’s key names and values. The type can be used to enforce the presence of user names and keys in the secret object. If you do not want validation, use the opaque
type, which is the default.
Specify one of the following types to trigger minimal server-side validation to ensure the presence of specific key names in the secret data:
-
kubernetes.io/basic-auth
: Use with Basic authentication -
kubernetes.io/dockercfg
: Use as an image pull secret -
kubernetes.io/dockerconfigjson
: Use as an image pull secret -
kubernetes.io/service-account-token
: Use to obtain a legacy service account API token -
kubernetes.io/ssh-auth
: Use with SSH key authentication -
kubernetes.io/tls
: Use with TLS certificate authorities
Specify type: Opaque
if you do not want validation, which means the secret does not claim to conform to any convention for key names or values. An opaque secret, allows for unstructured key:value
pairs that can contain arbitrary values.
You can specify other arbitrary types, such as example.com/my-secret-type
. These types are not enforced server-side, but indicate that the creator of the secret intended to conform to the key/value requirements of that type.
For examples of creating different types of secrets, see Understanding how to create secrets.
2.6.1.2. Secret data keys
Secret keys must be in a DNS subdomain.
2.6.1.3. Automatically generated image pull secrets
By default, OpenShift Container Platform creates an image pull secret for each service account.
Prior to OpenShift Container Platform 4.16, a long-lived service account API token secret was also generated for each service account that was created. Starting with OpenShift Container Platform 4.16, this service account API token secret is no longer created.
After upgrading to 4.16, any existing long-lived service account API token secrets are not deleted and will continue to function. For information about detecting long-lived API tokens that are in use in your cluster or deleting them if they are not needed, see the Red Hat Knowledgebase article Long-lived service account API tokens in OpenShift Container Platform.
This image pull secret is necessary to integrate the OpenShift image registry into the cluster’s user authentication and authorization system.
However, if you do not enable the ImageRegistry
capability or if you disable the integrated OpenShift image registry in the Cluster Image Registry Operator’s configuration, an image pull secret is not generated for each service account.
When the integrated OpenShift image registry is disabled on a cluster that previously had it enabled, the previously generated image pull secrets are deleted automatically.
2.6.2. Understanding how to create secrets
As an administrator you must create a secret before developers can create the pods that depend on that secret.
When creating secrets:
Create a secret object that contains the data you want to keep secret. The specific data required for each secret type is descibed in the following sections.
Example YAML object that creates an opaque secret
apiVersion: v1 kind: Secret metadata: name: test-secret type: Opaque 1 data: 2 username: <username> password: <password> stringData: 3 hostname: myapp.mydomain.com secret.properties: | property1=valueA property2=valueB
Use either the
data
orstringdata
fields, not both.Update the pod’s service account to reference the secret:
YAML of a service account that uses a secret
apiVersion: v1 kind: ServiceAccount ... secrets: - name: test-secret
Create a pod, which consumes the secret as an environment variable or as a file (using a
secret
volume):YAML of a pod populating files in a volume with secret data
apiVersion: v1 kind: Pod metadata: name: secret-example-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: secret-test-container image: busybox command: [ "/bin/sh", "-c", "cat /etc/secret-volume/*" ] volumeMounts: 1 - name: secret-volume mountPath: /etc/secret-volume 2 readOnly: true 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] volumes: - name: secret-volume secret: secretName: test-secret 4 restartPolicy: Never
- 1
- Add a
volumeMounts
field to each container that needs the secret. - 2
- Specifies an unused directory name where you would like the secret to appear. Each key in the secret data map becomes the filename under
mountPath
. - 3
- Set to
true
. If true, this instructs the driver to provide a read-only volume. - 4
- Specifies the name of the secret.
YAML of a pod populating environment variables with secret data
apiVersion: v1 kind: Pod metadata: name: secret-example-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: secret-test-container image: busybox command: [ "/bin/sh", "-c", "export" ] env: - name: TEST_SECRET_USERNAME_ENV_VAR valueFrom: secretKeyRef: 1 name: test-secret key: username securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] restartPolicy: Never
- 1
- Specifies the environment variable that consumes the secret key.
YAML of a build config populating environment variables with secret data
apiVersion: build.openshift.io/v1 kind: BuildConfig metadata: name: secret-example-bc spec: strategy: sourceStrategy: env: - name: TEST_SECRET_USERNAME_ENV_VAR valueFrom: secretKeyRef: 1 name: test-secret key: username from: kind: ImageStreamTag namespace: openshift name: 'cli:latest'
- 1
- Specifies the environment variable that consumes the secret key.
2.6.2.1. Secret creation restrictions
To use a secret, a pod needs to reference the secret. A secret can be used with a pod in three ways:
- To populate environment variables for containers.
- As files in a volume mounted on one or more of its containers.
- By kubelet when pulling images for the pod.
Volume type secrets write data into the container as a file using the volume mechanism. Image pull secrets use service accounts for the automatic injection of the secret into all pods in a namespace.
When a template contains a secret definition, the only way for the template to use the provided secret is to ensure that the secret volume sources are validated and that the specified object reference actually points to a Secret
object. Therefore, a secret needs to be created before any pods that depend on it. The most effective way to ensure this is to have it get injected automatically through the use of a service account.
Secret API objects reside in a namespace. They can only be referenced by pods in that same namespace.
Individual secrets are limited to 1MB in size. This is to discourage the creation of large secrets that could exhaust apiserver and kubelet memory. However, creation of a number of smaller secrets could also exhaust memory.
2.6.2.2. Creating an opaque secret
As an administrator, you can create an opaque secret, which allows you to store unstructured key:value
pairs that can contain arbitrary values.
Procedure
Create a
Secret
object in a YAML file on a control plane node.For example:
apiVersion: v1 kind: Secret metadata: name: mysecret type: Opaque 1 data: username: <username> password: <password>
- 1
- Specifies an opaque secret.
Use the following command to create a
Secret
object:$ oc create -f <filename>.yaml
To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
Additional resources
2.6.2.3. Creating a legacy service account token secret
As an administrator, you can create a legacy service account token secret, which allows you to distribute a service account token to applications that must authenticate to the API.
It is recommended to obtain bound service account tokens using the TokenRequest API instead of using legacy service account token secrets. You should create a service account token secret only if you cannot use the TokenRequest API and if the security exposure of a nonexpiring token in a readable API object is acceptable to you.
Bound service account tokens are more secure than service account token secrets for the following reasons:
- Bound service account tokens have a bounded lifetime.
- Bound service account tokens contain audiences.
- Bound service account tokens can be bound to pods or secrets and the bound tokens are invalidated when the bound object is removed.
Workloads are automatically injected with a projected volume to obtain a bound service account token. If your workload needs an additional service account token, add an additional projected volume in your workload manifest.
For more information, see "Configuring bound service account tokens using volume projection".
Procedure
Create a
Secret
object in a YAML file on a control plane node:Example
Secret
objectapiVersion: v1 kind: Secret metadata: name: secret-sa-sample annotations: kubernetes.io/service-account.name: "sa-name" 1 type: kubernetes.io/service-account-token 2
Use the following command to create the
Secret
object:$ oc create -f <filename>.yaml
To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
2.6.2.4. Creating a basic authentication secret
As an administrator, you can create a basic authentication secret, which allows you to store the credentials needed for basic authentication. When using this secret type, the data
parameter of the Secret
object must contain the following keys encoded in the base64 format:
-
username
: the user name for authentication -
password
: the password or token for authentication
You can use the stringData
parameter to use clear text content.
Procedure
Create a
Secret
object in a YAML file on a control plane node:Example
secret
objectapiVersion: v1 kind: Secret metadata: name: secret-basic-auth type: kubernetes.io/basic-auth 1 data: stringData: 2 username: admin password: <password>
Use the following command to create the
Secret
object:$ oc create -f <filename>.yaml
To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
Additional resources
2.6.2.5. Creating an SSH authentication secret
As an administrator, you can create an SSH authentication secret, which allows you to store data used for SSH authentication. When using this secret type, the data
parameter of the Secret
object must contain the SSH credential to use.
Procedure
Create a
Secret
object in a YAML file on a control plane node:Example
secret
objectapiVersion: v1 kind: Secret metadata: name: secret-ssh-auth type: kubernetes.io/ssh-auth 1 data: ssh-privatekey: | 2 MIIEpQIBAAKCAQEAulqb/Y ...
Use the following command to create the
Secret
object:$ oc create -f <filename>.yaml
To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
Additional resources
2.6.2.6. Creating a Docker configuration secret
As an administrator, you can create a Docker configuration secret, which allows you to store the credentials for accessing a container image registry.
-
kubernetes.io/dockercfg
. Use this secret type to store your local Docker configuration file. Thedata
parameter of thesecret
object must contain the contents of a.dockercfg
file encoded in the base64 format. -
kubernetes.io/dockerconfigjson
. Use this secret type to store your local Docker configuration JSON file. Thedata
parameter of thesecret
object must contain the contents of a.docker/config.json
file encoded in the base64 format.
Procedure
Create a
Secret
object in a YAML file on a control plane node.Example Docker configuration
secret
objectapiVersion: v1 kind: Secret metadata: name: secret-docker-cfg namespace: my-project type: kubernetes.io/dockerconfig 1 data: .dockerconfig:bm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg== 2
Example Docker configuration JSON
secret
objectapiVersion: v1 kind: Secret metadata: name: secret-docker-json namespace: my-project type: kubernetes.io/dockerconfig 1 data: .dockerconfigjson:bm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg== 2
Use the following command to create the
Secret
object$ oc create -f <filename>.yaml
To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
Additional resources
2.6.2.7. Creating a secret using the web console
You can create secrets using the web console.
Procedure
- Navigate to Workloads → Secrets.
Click Create → From YAML.
Edit the YAML manually to your specifications, or drag and drop a file into the YAML editor. For example:
apiVersion: v1 kind: Secret metadata: name: example namespace: <namespace> type: Opaque 1 data: username: <base64 encoded username> password: <base64 encoded password> stringData: 2 hostname: myapp.mydomain.com
- 1
- This example specifies an opaque secret; however, you may see other secret types such as service account token secret, basic authentication secret, SSH authentication secret, or a secret that uses Docker configuration.
- 2
- Entries in the
stringData
map are converted to base64 and the entry will then be moved to thedata
map automatically. This field is write-only; the value will only be returned via thedata
field.
- Click Create.
Click Add Secret to workload.
- From the drop-down menu, select the workload to add.
- Click Save.
2.6.3. Understanding how to update secrets
When you modify the value of a secret, the value (used by an already running pod) will not dynamically change. To change a secret, you must delete the original pod and create a new pod (perhaps with an identical PodSpec).
Updating a secret follows the same workflow as deploying a new Container image. You can use the kubectl rolling-update
command.
The resourceVersion
value in a secret is not specified when it is referenced. Therefore, if a secret is updated at the same time as pods are starting, the version of the secret that is used for the pod is not defined.
Currently, it is not possible to check the resource version of a secret object that was used when a pod was created. It is planned that pods will report this information, so that a controller could restart ones using an old resourceVersion
. In the interim, do not update the data of existing secrets, but create new ones with distinct names.
2.6.4. Creating and using secrets
As an administrator, you can create a service account token secret. This allows you to distribute a service account token to applications that must authenticate to the API.
Procedure
Create a service account in your namespace by running the following command:
$ oc create sa <service_account_name> -n <your_namespace>
Save the following YAML example to a file named
service-account-token-secret.yaml
. The example includes aSecret
object configuration that you can use to generate a service account token:apiVersion: v1 kind: Secret metadata: name: <secret_name> 1 annotations: kubernetes.io/service-account.name: "sa-name" 2 type: kubernetes.io/service-account-token 3
Generate the service account token by applying the file:
$ oc apply -f service-account-token-secret.yaml
Get the service account token from the secret by running the following command:
$ oc get secret <sa_token_secret> -o jsonpath='{.data.token}' | base64 --decode 1
Example output
ayJhbGciOiJSUzI1NiIsImtpZCI6IklOb2dtck1qZ3hCSWpoNnh5YnZhSE9QMkk3YnRZMVZoclFfQTZfRFp1YlUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImJ1aWxkZXItdG9rZW4tdHZrbnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiYnVpbGRlciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjNmZGU2MGZmLTA1NGYtNDkyZi04YzhjLTNlZjE0NDk3MmFmNyIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpkZWZhdWx0OmJ1aWxkZXIifQ.OmqFTDuMHC_lYvvEUrjr1x453hlEEHYcxS9VKSzmRkP1SiVZWPNPkTWlfNRp6bIUZD3U6aN3N7dMSN0eI5hu36xPgpKTdvuckKLTCnelMx6cxOdAbrcw1mCmOClNscwjS1KO1kzMtYnnq8rXHiMJELsNlhnRyyIXRTtNBsy4t64T3283s3SLsancyx0gy0ujx-Ch3uKAKdZi5iT-I8jnnQ-ds5THDs2h65RJhgglQEmSxpHrLGZFmyHAQI-_SjvmHZPXEc482x3SkaQHNLqpmrpJorNqh1M8ZHKzlujhZgVooMvJmWPXTb2vnvi3DGn2XI-hZxl1yD2yGH1RBpYUHA
- 1
- Replace <sa_token_secret> with the name of your service token secret.
Use your service account token to authenticate with the API of your cluster:
$ curl -X GET <openshift_cluster_api> --header "Authorization: Bearer <token>" 1 2
2.6.5. About using signed certificates with secrets
To secure communication to your service, you can configure OpenShift Container Platform to generate a signed serving certificate/key pair that you can add into a secret in a project.
A service serving certificate secret is intended to support complex middleware applications that need out-of-the-box certificates. It has the same settings as the server certificates generated by the administrator tooling for nodes and masters.
Service Pod
spec configured for a service serving certificates secret.
apiVersion: v1
kind: Service
metadata:
name: registry
annotations:
service.beta.openshift.io/serving-cert-secret-name: registry-cert1
# ...
- 1
- Specify the name for the certificate
Other pods can trust cluster-created certificates (which are only signed for internal DNS names), by using the CA bundle in the /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt file that is automatically mounted in their pod.
The signature algorithm for this feature is x509.SHA256WithRSA
. To manually rotate, delete the generated secret. A new certificate is created.
2.6.5.1. Generating signed certificates for use with secrets
To use a signed serving certificate/key pair with a pod, create or edit the service to add the service.beta.openshift.io/serving-cert-secret-name
annotation, then add the secret to the pod.
Procedure
To create a service serving certificate secret:
-
Edit the
Pod
spec for your service. Add the
service.beta.openshift.io/serving-cert-secret-name
annotation with the name you want to use for your secret.kind: Service apiVersion: v1 metadata: name: my-service annotations: service.beta.openshift.io/serving-cert-secret-name: my-cert 1 spec: selector: app: MyApp ports: - protocol: TCP port: 80 targetPort: 9376
The certificate and key are in PEM format, stored in
tls.crt
andtls.key
respectively.Create the service:
$ oc create -f <file-name>.yaml
View the secret to make sure it was created:
View a list of all secrets:
$ oc get secrets
Example output
NAME TYPE DATA AGE my-cert kubernetes.io/tls 2 9m
View details on your secret:
$ oc describe secret my-cert
Example output
Name: my-cert Namespace: openshift-console Labels: <none> Annotations: service.beta.openshift.io/expiry: 2023-03-08T23:22:40Z service.beta.openshift.io/originating-service-name: my-service service.beta.openshift.io/originating-service-uid: 640f0ec3-afc2-4380-bf31-a8c784846a11 service.beta.openshift.io/expiry: 2023-03-08T23:22:40Z Type: kubernetes.io/tls Data ==== tls.key: 1679 bytes tls.crt: 2595 bytes
Edit your
Pod
spec with that secret.apiVersion: v1 kind: Pod metadata: name: my-service-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: mypod image: redis volumeMounts: - name: my-container mountPath: "/etc/my-path" securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] volumes: - name: my-volume secret: secretName: my-cert items: - key: username path: my-group/my-username mode: 511
When it is available, your pod will run. The certificate will be good for the internal service DNS name,
<service.name>.<service.namespace>.svc
.The certificate/key pair is automatically replaced when it gets close to expiration. View the expiration date in the
service.beta.openshift.io/expiry
annotation on the secret, which is in RFC3339 format.NoteIn most cases, the service DNS name
<service.name>.<service.namespace>.svc
is not externally routable. The primary use of<service.name>.<service.namespace>.svc
is for intracluster or intraservice communication, and with re-encrypt routes.
2.6.6. Troubleshooting secrets
If a service certificate generation fails with (service’s service.beta.openshift.io/serving-cert-generation-error
annotation contains):
secret/ssl-key references serviceUID 62ad25ca-d703-11e6-9d6f-0e9c0057b608, which does not match 77b6dd80-d716-11e6-9d6f-0e9c0057b60
The service that generated the certificate no longer exists, or has a different serviceUID
. You must force certificates regeneration by removing the old secret, and clearing the following annotations on the service service.beta.openshift.io/serving-cert-generation-error
, service.beta.openshift.io/serving-cert-generation-error-num
:
Delete the secret:
$ oc delete secret <secret_name>
Clear the annotations:
$ oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-
$ oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-num-
The command removing annotation has a -
after the annotation name to be removed.
2.7. Providing sensitive data to pods by using an external secrets store
Some applications need sensitive information, such as passwords and user names, that you do not want developers to have.
As an alternative to using Kubernetes Secret
objects to provide sensitive information, you can use an external secrets store to store the sensitive information. You can use the Secrets Store CSI Driver Operator to integrate with an external secrets store and mount the secret content as a pod volume.
The Secrets Store CSI Driver Operator is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
2.7.1. About the Secrets Store CSI Driver Operator
Kubernetes secrets are stored with Base64 encoding. etcd provides encryption at rest for these secrets, but when secrets are retrieved, they are decrypted and presented to the user. If role-based access control is not configured properly on your cluster, anyone with API or etcd access can retrieve or modify a secret. Additionally, anyone who is authorized to create a pod in a namespace can use that access to read any secret in that namespace.
To store and manage your secrets securely, you can configure the OpenShift Container Platform Secrets Store Container Storage Interface (CSI) Driver Operator to mount secrets from an external secret management system, such as Azure Key Vault, by using a provider plugin. Applications can then use the secret, but the secret does not persist on the system after the application pod is destroyed.
The Secrets Store CSI Driver Operator, secrets-store.csi.k8s.io
, enables OpenShift Container Platform to mount multiple secrets, keys, and certificates stored in enterprise-grade external secrets stores into pods as a volume. The Secrets Store CSI Driver Operator communicates with the provider using gRPC to fetch the mount contents from the specified external secrets store. After the volume is attached, the data in it is mounted into the container’s file system. Secrets store volumes are mounted in-line.
2.7.1.1. Secrets store providers
The following secrets store providers are available for use with the Secrets Store CSI Driver Operator:
- AWS Secrets Manager
- AWS Systems Manager Parameter Store
- Azure Key Vault
- HashiCorp Vault
2.7.1.2. Automatic rotation
The Secrets Store CSI driver periodically rotates the content in the mounted volume with the content from the external secrets store. If a secret is updated in the external secrets store, the secret will be updated in the mounted volume. The Secrets Store CSI Driver Operator polls for updates every 2 minutes.
If you enabled synchronization of mounted content as Kubernetes secrets, the Kubernetes secrets are also rotated.
Applications consuming the secret data must watch for updates to the secrets.
2.7.2. Installing the Secrets Store CSI driver
Prerequisites
- Access to the OpenShift Container Platform web console.
- Administrator access to the cluster.
Procedure
To install the Secrets Store CSI driver:
Install the Secrets Store CSI Driver Operator:
- Log in to the web console.
- Click Operators → OperatorHub.
- Locate the Secrets Store CSI Driver Operator by typing "Secrets Store CSI" in the filter box.
- Click the Secrets Store CSI Driver Operator button.
- On the Secrets Store CSI Driver Operator page, click Install.
On the Install Operator page, ensure that:
- All namespaces on the cluster (default) is selected.
- Installed Namespace is set to openshift-cluster-csi-drivers.
Click Install.
After the installation finishes, the Secrets Store CSI Driver Operator is listed in the Installed Operators section of the web console.
Create the
ClusterCSIDriver
instance for the driver (secrets-store.csi.k8s.io
):- Click Administration → CustomResourceDefinitions → ClusterCSIDriver.
On the Instances tab, click Create ClusterCSIDriver.
Use the following YAML file:
apiVersion: operator.openshift.io/v1 kind: ClusterCSIDriver metadata: name: secrets-store.csi.k8s.io spec: managementState: Managed
- Click Create.
2.7.3. Mounting secrets from an external secrets store to a CSI volume
After installing the Secrets Store CSI Driver Operator, you can mount secrets from one of the following external secrets stores to a CSI volume:
2.7.3.1. Mounting secrets from AWS Secrets Manager
You can use the Secrets Store CSI Driver Operator to mount secrets from AWS Secrets Manager to a CSI volume in OpenShift Container Platform. To mount secrets from AWS Secrets Manager, your cluster must be installed on AWS and use AWS Security Token Service (STS).
Prerequisites
- Your cluster is installed on AWS and uses AWS Security Token Service (STS).
- You have installed the Secrets Store CSI Driver Operator. See Installing the Secrets Store CSI driver for instructions.
- You have configured AWS Secrets Manager to store the required secrets.
-
You have extracted and prepared the
ccoctl
binary. -
You have installed the
jq
CLI tool. -
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Install the AWS Secrets Manager provider:
Create a YAML file with the following configuration for the provider resources:
ImportantThe AWS Secrets Manager provider for the Secrets Store CSI driver is an upstream provider.
This configuration is modified from the configuration provided in the upstream AWS documentation so that it works properly with OpenShift Container Platform. Changes to this configuration might impact functionality.
Example
aws-provider.yaml
fileapiVersion: v1 kind: ServiceAccount metadata: name: csi-secrets-store-provider-aws namespace: openshift-cluster-csi-drivers --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: csi-secrets-store-provider-aws-cluster-role rules: - apiGroups: [""] resources: ["serviceaccounts/token"] verbs: ["create"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["get"] - apiGroups: [""] resources: ["pods"] verbs: ["get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: csi-secrets-store-provider-aws-cluster-rolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: csi-secrets-store-provider-aws-cluster-role subjects: - kind: ServiceAccount name: csi-secrets-store-provider-aws namespace: openshift-cluster-csi-drivers --- apiVersion: apps/v1 kind: DaemonSet metadata: namespace: openshift-cluster-csi-drivers name: csi-secrets-store-provider-aws labels: app: csi-secrets-store-provider-aws spec: updateStrategy: type: RollingUpdate selector: matchLabels: app: csi-secrets-store-provider-aws template: metadata: labels: app: csi-secrets-store-provider-aws spec: serviceAccountName: csi-secrets-store-provider-aws hostNetwork: false containers: - name: provider-aws-installer image: public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws:1.0.r2-50-g5b4aca1-2023.06.09.21.19 imagePullPolicy: Always args: - --provider-volume=/etc/kubernetes/secrets-store-csi-providers resources: requests: cpu: 50m memory: 100Mi limits: cpu: 50m memory: 100Mi securityContext: privileged: true volumeMounts: - mountPath: "/etc/kubernetes/secrets-store-csi-providers" name: providervol - name: mountpoint-dir mountPath: /var/lib/kubelet/pods mountPropagation: HostToContainer tolerations: - operator: Exists volumes: - name: providervol hostPath: path: "/etc/kubernetes/secrets-store-csi-providers" - name: mountpoint-dir hostPath: path: /var/lib/kubelet/pods type: DirectoryOrCreate nodeSelector: kubernetes.io/os: linux
Grant privileged access to the
csi-secrets-store-provider-aws
service account by running the following command:$ oc adm policy add-scc-to-user privileged -z csi-secrets-store-provider-aws -n openshift-cluster-csi-drivers
Create the provider resources by running the following command:
$ oc apply -f aws-provider.yaml
Grant permission to allow the service account to read the AWS secret object:
Create a directory to contain the credentials request by running the following command:
$ mkdir credentialsrequest-dir-aws
Create a YAML file with the following configuration for the credentials request:
Example
credentialsrequest.yaml
fileapiVersion: cloudcredential.openshift.io/v1 kind: CredentialsRequest metadata: name: aws-provider-test namespace: openshift-cloud-credential-operator spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: AWSProviderSpec statementEntries: - action: - "secretsmanager:GetSecretValue" - "secretsmanager:DescribeSecret" effect: Allow resource: "arn:*:secretsmanager:*:*:secret:testSecret-??????" secretRef: name: aws-creds namespace: my-namespace serviceAccountNames: - aws-provider
Retrieve the OIDC provider by running the following command:
$ oc get --raw=/.well-known/openid-configuration | jq -r '.issuer'
Example output
https://<oidc_provider_name>
Copy the OIDC provider name
<oidc_provider_name>
from the output to use in the next step.Use the
ccoctl
tool to process the credentials request by running the following command:$ ccoctl aws create-iam-roles \ --name my-role --region=<aws_region> \ --credentials-requests-dir=credentialsrequest-dir-aws \ --identity-provider-arn arn:aws:iam::<aws_account>:oidc-provider/<oidc_provider_name> --output-dir=credrequests-ccoctl-output
Example output
2023/05/15 18:10:34 Role arn:aws:iam::<aws_account_id>:role/my-role-my-namespace-aws-creds created 2023/05/15 18:10:34 Saved credentials configuration to: credrequests-ccoctl-output/manifests/my-namespace-aws-creds-credentials.yaml 2023/05/15 18:10:35 Updated Role policy for Role my-role-my-namespace-aws-creds
Copy the
<aws_role_arn>
from the output to use in the next step. For example,arn:aws:iam::<aws_account_id>:role/my-role-my-namespace-aws-creds
.Bind the service account with the role ARN by running the following command:
$ oc annotate -n my-namespace sa/aws-provider eks.amazonaws.com/role-arn="<aws_role_arn>"
Create a secret provider class to define your secrets store provider:
Create a YAML file that defines the
SecretProviderClass
object:Example
secret-provider-class-aws.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1 kind: SecretProviderClass metadata: name: my-aws-provider 1 namespace: my-namespace 2 spec: provider: aws 3 parameters: 4 objects: | - objectName: "testSecret" objectType: "secretsmanager"
Create the
SecretProviderClass
object by running the following command:$ oc create -f secret-provider-class-aws.yaml
Create a deployment to use this secret provider class:
Create a YAML file that defines the
Deployment
object:Example
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: my-aws-deployment 1 namespace: my-namespace 2 spec: replicas: 1 selector: matchLabels: app: my-storage template: metadata: labels: app: my-storage spec: containers: - name: busybox image: k8s.gcr.io/e2e-test-images/busybox:1.29 command: - "/bin/sleep" - "10000" volumeMounts: - name: secrets-store-inline mountPath: "/mnt/secrets-store" readOnly: true volumes: - name: secrets-store-inline csi: driver: secrets-store.csi.k8s.io readOnly: true volumeAttributes: secretProviderClass: "my-aws-provider" 3
Create the
Deployment
object by running the following command:$ oc create -f deployment.yaml
Verification
Verify that you can access the secrets from AWS Secrets Manager in the pod volume mount:
List the secrets in the pod mount:
$ oc exec busybox-<hash> -n my-namespace -- ls /mnt/secrets-store/
Example output
testSecret
View a secret in the pod mount:
$ oc exec busybox-<hash> -n my-namespace -- cat /mnt/secrets-store/testSecret
Example output
<secret_value>
Additional resources
2.7.3.2. Mounting secrets from AWS Systems Manager Parameter Store
You can use the Secrets Store CSI Driver Operator to mount secrets from AWS Systems Manager Parameter Store to a CSI volume in OpenShift Container Platform. To mount secrets from AWS Systems Manager Parameter Store, your cluster must be installed on AWS and use AWS Security Token Service (STS).
Prerequisites
- Your cluster is installed on AWS and uses AWS Security Token Service (STS).
- You have installed the Secrets Store CSI Driver Operator. See Installing the Secrets Store CSI driver for instructions.
- You have configured AWS Systems Manager Parameter Store to store the required secrets.
-
You have extracted and prepared the
ccoctl
binary. -
You have installed the
jq
CLI tool. -
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Install the AWS Systems Manager Parameter Store provider:
Create a YAML file with the following configuration for the provider resources:
ImportantThe AWS Systems Manager Parameter Store provider for the Secrets Store CSI driver is an upstream provider.
This configuration is modified from the configuration provided in the upstream AWS documentation so that it works properly with OpenShift Container Platform. Changes to this configuration might impact functionality.
Example
aws-provider.yaml
fileapiVersion: v1 kind: ServiceAccount metadata: name: csi-secrets-store-provider-aws namespace: openshift-cluster-csi-drivers --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: csi-secrets-store-provider-aws-cluster-role rules: - apiGroups: [""] resources: ["serviceaccounts/token"] verbs: ["create"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["get"] - apiGroups: [""] resources: ["pods"] verbs: ["get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: csi-secrets-store-provider-aws-cluster-rolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: csi-secrets-store-provider-aws-cluster-role subjects: - kind: ServiceAccount name: csi-secrets-store-provider-aws namespace: openshift-cluster-csi-drivers --- apiVersion: apps/v1 kind: DaemonSet metadata: namespace: openshift-cluster-csi-drivers name: csi-secrets-store-provider-aws labels: app: csi-secrets-store-provider-aws spec: updateStrategy: type: RollingUpdate selector: matchLabels: app: csi-secrets-store-provider-aws template: metadata: labels: app: csi-secrets-store-provider-aws spec: serviceAccountName: csi-secrets-store-provider-aws hostNetwork: false containers: - name: provider-aws-installer image: public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws:1.0.r2-50-g5b4aca1-2023.06.09.21.19 imagePullPolicy: Always args: - --provider-volume=/etc/kubernetes/secrets-store-csi-providers resources: requests: cpu: 50m memory: 100Mi limits: cpu: 50m memory: 100Mi securityContext: privileged: true volumeMounts: - mountPath: "/etc/kubernetes/secrets-store-csi-providers" name: providervol - name: mountpoint-dir mountPath: /var/lib/kubelet/pods mountPropagation: HostToContainer tolerations: - operator: Exists volumes: - name: providervol hostPath: path: "/etc/kubernetes/secrets-store-csi-providers" - name: mountpoint-dir hostPath: path: /var/lib/kubelet/pods type: DirectoryOrCreate nodeSelector: kubernetes.io/os: linux
Grant privileged access to the
csi-secrets-store-provider-aws
service account by running the following command:$ oc adm policy add-scc-to-user privileged -z csi-secrets-store-provider-aws -n openshift-cluster-csi-drivers
Create the provider resources by running the following command:
$ oc apply -f aws-provider.yaml
Grant permission to allow the service account to read the AWS secret object:
Create a directory to contain the credentials request by running the following command:
$ mkdir credentialsrequest-dir-aws
Create a YAML file with the following configuration for the credentials request:
Example
credentialsrequest.yaml
fileapiVersion: cloudcredential.openshift.io/v1 kind: CredentialsRequest metadata: name: aws-provider-test namespace: openshift-cloud-credential-operator spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: AWSProviderSpec statementEntries: - action: - "ssm:GetParameter" - "ssm:GetParameters" effect: Allow resource: "arn:*:ssm:*:*:parameter/testParameter*" secretRef: name: aws-creds namespace: my-namespace serviceAccountNames: - aws-provider
Retrieve the OIDC provider by running the following command:
$ oc get --raw=/.well-known/openid-configuration | jq -r '.issuer'
Example output
https://<oidc_provider_name>
Copy the OIDC provider name
<oidc_provider_name>
from the output to use in the next step.Use the
ccoctl
tool to process the credentials request by running the following command:$ ccoctl aws create-iam-roles \ --name my-role --region=<aws_region> \ --credentials-requests-dir=credentialsrequest-dir-aws \ --identity-provider-arn arn:aws:iam::<aws_account>:oidc-provider/<oidc_provider_name> --output-dir=credrequests-ccoctl-output
Example output
2023/05/15 18:10:34 Role arn:aws:iam::<aws_account_id>:role/my-role-my-namespace-aws-creds created 2023/05/15 18:10:34 Saved credentials configuration to: credrequests-ccoctl-output/manifests/my-namespace-aws-creds-credentials.yaml 2023/05/15 18:10:35 Updated Role policy for Role my-role-my-namespace-aws-creds
Copy the
<aws_role_arn>
from the output to use in the next step. For example,arn:aws:iam::<aws_account_id>:role/my-role-my-namespace-aws-creds
.Bind the service account with the role ARN by running the following command:
$ oc annotate -n my-namespace sa/aws-provider eks.amazonaws.com/role-arn="<aws_role_arn>"
Create a secret provider class to define your secrets store provider:
Create a YAML file that defines the
SecretProviderClass
object:Example
secret-provider-class-aws.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1 kind: SecretProviderClass metadata: name: my-aws-provider 1 namespace: my-namespace 2 spec: provider: aws 3 parameters: 4 objects: | - objectName: "testParameter" objectType: "ssmparameter"
Create the
SecretProviderClass
object by running the following command:$ oc create -f secret-provider-class-aws.yaml
Create a deployment to use this secret provider class:
Create a YAML file that defines the
Deployment
object:Example
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: my-aws-deployment 1 namespace: my-namespace 2 spec: replicas: 1 selector: matchLabels: app: my-storage template: metadata: labels: app: my-storage spec: containers: - name: busybox image: k8s.gcr.io/e2e-test-images/busybox:1.29 command: - "/bin/sleep" - "10000" volumeMounts: - name: secrets-store-inline mountPath: "/mnt/secrets-store" readOnly: true volumes: - name: secrets-store-inline csi: driver: secrets-store.csi.k8s.io readOnly: true volumeAttributes: secretProviderClass: "my-aws-provider" 3
Create the
Deployment
object by running the following command:$ oc create -f deployment.yaml
Verification
Verify that you can access the secrets from AWS Systems Manager Parameter Store in the pod volume mount:
List the secrets in the pod mount:
$ oc exec busybox-<hash> -n my-namespace -- ls /mnt/secrets-store/
Example output
testParameter
View a secret in the pod mount:
$ oc exec busybox-<hash> -n my-namespace -- cat /mnt/secrets-store/testSecret
Example output
<secret_value>
Additional resources
2.7.3.3. Mounting secrets from Azure Key Vault
You can use the Secrets Store CSI Driver Operator to mount secrets from Azure Key Vault to a CSI volume in OpenShift Container Platform. To mount secrets from Azure Key Vault, your cluster must be installed on Microsoft Azure.
Prerequisites
- Your cluster is installed on Azure.
- You have installed the Secrets Store CSI Driver Operator. See Installing the Secrets Store CSI driver for instructions.
- You have configured Azure Key Vault to store the required secrets.
-
You have installed the Azure CLI (
az
). -
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Install the Azure Key Vault provider:
Create a YAML file with the following configuration for the provider resources:
ImportantThe Azure Key Vault provider for the Secrets Store CSI driver is an upstream provider.
This configuration is modified from the configuration provided in the upstream Azure documentation so that it works properly with OpenShift Container Platform. Changes to this configuration might impact functionality.
Example
azure-provider.yaml
fileapiVersion: v1 kind: ServiceAccount metadata: name: csi-secrets-store-provider-azure namespace: openshift-cluster-csi-drivers --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: csi-secrets-store-provider-azure-cluster-role rules: - apiGroups: [""] resources: ["serviceaccounts/token"] verbs: ["create"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["get"] - apiGroups: [""] resources: ["pods"] verbs: ["get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: csi-secrets-store-provider-azure-cluster-rolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: csi-secrets-store-provider-azure-cluster-role subjects: - kind: ServiceAccount name: csi-secrets-store-provider-azure namespace: openshift-cluster-csi-drivers --- apiVersion: apps/v1 kind: DaemonSet metadata: namespace: openshift-cluster-csi-drivers name: csi-secrets-store-provider-azure labels: app: csi-secrets-store-provider-azure spec: updateStrategy: type: RollingUpdate selector: matchLabels: app: csi-secrets-store-provider-azure template: metadata: labels: app: csi-secrets-store-provider-azure spec: serviceAccountName: csi-secrets-store-provider-azure hostNetwork: true containers: - name: provider-azure-installer image: mcr.microsoft.com/oss/azure/secrets-store/provider-azure:v1.4.1 imagePullPolicy: IfNotPresent args: - --endpoint=unix:///provider/azure.sock - --construct-pem-chain=true - --healthz-port=8989 - --healthz-path=/healthz - --healthz-timeout=5s livenessProbe: httpGet: path: /healthz port: 8989 failureThreshold: 3 initialDelaySeconds: 5 timeoutSeconds: 10 periodSeconds: 30 resources: requests: cpu: 50m memory: 100Mi limits: cpu: 50m memory: 100Mi securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true runAsUser: 0 capabilities: drop: - ALL volumeMounts: - mountPath: "/provider" name: providervol affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: type operator: NotIn values: - virtual-kubelet volumes: - name: providervol hostPath: path: "/var/run/secrets-store-csi-providers" tolerations: - operator: Exists nodeSelector: kubernetes.io/os: linux
Grant privileged access to the
csi-secrets-store-provider-azure
service account by running the following command:$ oc adm policy add-scc-to-user privileged -z csi-secrets-store-provider-azure -n openshift-cluster-csi-drivers
Create the provider resources by running the following command:
$ oc apply -f azure-provider.yaml
Create a service principal to access the key vault:
Set the service principal client secret as an environment variable by running the following command:
$ SERVICE_PRINCIPAL_CLIENT_SECRET="$(az ad sp create-for-rbac --name https://$KEYVAULT_NAME --query 'password' -otsv)"
Set the service principal client ID as an environment variable by running the following command:
$ SERVICE_PRINCIPAL_CLIENT_ID="$(az ad sp list --display-name https://$KEYVAULT_NAME --query '[0].appId' -otsv)"
Create a generic secret with the service principal client secret and ID by running the following command:
$ oc create secret generic secrets-store-creds -n my-namespace --from-literal clientid=${SERVICE_PRINCIPAL_CLIENT_ID} --from-literal clientsecret=${SERVICE_PRINCIPAL_CLIENT_SECRET}
Apply the
secrets-store.csi.k8s.io/used=true
label to allow the provider to find thisnodePublishSecretRef
secret:$ oc -n my-namespace label secret secrets-store-creds secrets-store.csi.k8s.io/used=true
Create a secret provider class to define your secrets store provider:
Create a YAML file that defines the
SecretProviderClass
object:Example
secret-provider-class-azure.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1 kind: SecretProviderClass metadata: name: my-azure-provider 1 namespace: my-namespace 2 spec: provider: azure 3 parameters: 4 usePodIdentity: "false" useVMManagedIdentity: "false" userAssignedIdentityID: "" keyvaultName: "kvname" objects: | array: - | objectName: secret1 objectType: secret tenantId: "tid"
Create the
SecretProviderClass
object by running the following command:$ oc create -f secret-provider-class-azure.yaml
Create a deployment to use this secret provider class:
Create a YAML file that defines the
Deployment
object:Example
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: my-azure-deployment 1 namespace: my-namespace 2 spec: replicas: 1 selector: matchLabels: app: my-storage template: metadata: labels: app: my-storage spec: containers: - name: busybox image: k8s.gcr.io/e2e-test-images/busybox:1.29 command: - "/bin/sleep" - "10000" volumeMounts: - name: secrets-store-inline mountPath: "/mnt/secrets-store" readOnly: true volumes: - name: secrets-store-inline csi: driver: secrets-store.csi.k8s.io readOnly: true volumeAttributes: secretProviderClass: "my-azure-provider" 3 nodePublishSecretRef: name: secrets-store-creds 4
- 1
- Specify the name for the deployment.
- 2
- Specify the namespace for the deployment. This must be the same namespace as the secret provider class.
- 3
- Specify the name of the secret provider class.
- 4
- Specify the name of the Kubernetes secret that contains the service principal credentials to access Azure Key Vault.
Create the
Deployment
object by running the following command:$ oc create -f deployment.yaml
Verification
Verify that you can access the secrets from Azure Key Vault in the pod volume mount:
List the secrets in the pod mount:
$ oc exec busybox-<hash> -n my-namespace -- ls /mnt/secrets-store/
Example output
secret1
View a secret in the pod mount:
$ oc exec busybox-<hash> -n my-namespace -- cat /mnt/secrets-store/secret1
Example output
my-secret-value
2.7.3.4. Mounting secrets from HashiCorp Vault
You can use the Secrets Store CSI Driver Operator to mount secrets from HashiCorp Vault to a CSI volume in OpenShift Container Platform.
Mounting secrets from HashiCorp Vault by using the Secrets Store CSI Driver Operator has been tested with the following cloud providers:
- Amazon Web Services (AWS)
- Microsoft Azure
Other cloud providers might work, but have not been tested yet. Additional cloud providers might be tested in the future.
Prerequisites
- You have installed the Secrets Store CSI Driver Operator. See Installing the Secrets Store CSI driver for instructions.
- You have installed Helm.
-
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Add the HashiCorp Helm repository by running the following command:
$ helm repo add hashicorp https://helm.releases.hashicorp.com
Update all repositories to ensure that Helm is aware of the latest versions by running the following command:
$ helm repo update
Install the HashiCorp Vault provider:
Create a new project for Vault by running the following command:
$ oc new-project vault
Label the
vault
namespace for pod security admission by running the following command:$ oc label ns vault security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged --overwrite
Grant privileged access to the
vault
service account by running the following command:$ oc adm policy add-scc-to-user privileged -z vault -n vault
Grant privileged access to the
vault-csi-provider
service account by running the following command:$ oc adm policy add-scc-to-user privileged -z vault-csi-provider -n vault
Deploy HashiCorp Vault by running the following command:
$ helm install vault hashicorp/vault --namespace=vault \ --set "server.dev.enabled=true" \ --set "injector.enabled=false" \ --set "csi.enabled=true" \ --set "global.openshift=true" \ --set "injector.agentImage.repository=docker.io/hashicorp/vault" \ --set "server.image.repository=docker.io/hashicorp/vault" \ --set "csi.image.repository=docker.io/hashicorp/vault-csi-provider" \ --set "csi.agent.image.repository=docker.io/hashicorp/vault" \ --set "csi.daemonSet.providersDir=/var/run/secrets-store-csi-providers"
Patch the
vault-csi-driver
daemon set to set thesecurityContext
toprivileged
by running the following command:$ oc patch daemonset -n vault vault-csi-provider --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/securityContext", "value": {"privileged": true} }]'
Verify that the
vault-csi-provider
pods have started properly by running the following command:$ oc get pods -n vault
Example output
NAME READY STATUS RESTARTS AGE vault-0 1/1 Running 0 24m vault-csi-provider-87rgw 1/2 Running 0 5s vault-csi-provider-bd6hp 1/2 Running 0 4s vault-csi-provider-smlv7 1/2 Running 0 5s
Configure HashiCorp Vault to store the required secrets:
Create a secret by running the following command:
$ oc exec vault-0 --namespace=vault -- vault kv put secret/example1 testSecret1=my-secret-value
Verify that the secret is readable at the path
secret/example1
by running the following command:$ oc exec vault-0 --namespace=vault -- vault kv get secret/example1
Example output
= Secret Path = secret/data/example1 ======= Metadata ======= Key Value --- ----- created_time 2024-04-05T07:05:16.713911211Z custom_metadata <nil> deletion_time n/a destroyed false version 1 === Data === Key Value --- ----- testSecret1 my-secret-value
Configure Vault to use Kubernetes authentication:
Enable the Kubernetes auth method by running the following command:
$ oc exec vault-0 --namespace=vault -- vault auth enable kubernetes
Example output
Success! Enabled kubernetes auth method at: kubernetes/
Configure the Kubernetes auth method:
Set the token reviewer as an environment variable by running the following command:
$ TOKEN_REVIEWER_JWT="$(oc exec vault-0 --namespace=vault -- cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
Set the Kubernetes service IP address as an environment variable by running the following command:
$ KUBERNETES_SERVICE_IP="$(oc get svc kubernetes -o go-template="{{ .spec.clusterIP }}")"
Update the Kubernetes auth method by running the following command:
$ oc exec -i vault-0 --namespace=vault -- vault write auth/kubernetes/config \ issuer="https://kubernetes.default.svc.cluster.local" \ token_reviewer_jwt="${TOKEN_REVIEWER_JWT}" \ kubernetes_host="https://${KUBERNETES_SERVICE_IP}:443" \ kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Example output
Success! Data written to: auth/kubernetes/config
Create a policy for the application by running the following command:
$ oc exec -i vault-0 --namespace=vault -- vault policy write csi -<<EOF path "secret/data/*" { capabilities = ["read"] } EOF
Example output
Success! Uploaded policy: csi
Create an authentication role to access the application by running the following command:
$ oc exec -i vault-0 --namespace=vault -- vault write auth/kubernetes/role/csi \ bound_service_account_names=default \ bound_service_account_namespaces=default,test-ns,negative-test-ns,my-namespace \ policies=csi \ ttl=20m
Example output
Success! Data written to: auth/kubernetes/role/csi
Verify that all of the
vault
pods are running properly by running the following command:$ oc get pods -n vault
Example output
NAME READY STATUS RESTARTS AGE vault-0 1/1 Running 0 43m vault-csi-provider-87rgw 2/2 Running 0 19m vault-csi-provider-bd6hp 2/2 Running 0 19m vault-csi-provider-smlv7 2/2 Running 0 19m
Verify that all of the
secrets-store-csi-driver
pods are running properly by running the following command:$ oc get pods -n openshift-cluster-csi-drivers | grep -E "secrets"
Example output
secrets-store-csi-driver-node-46d2g 3/3 Running 0 45m secrets-store-csi-driver-node-d2jjn 3/3 Running 0 45m secrets-store-csi-driver-node-drmt4 3/3 Running 0 45m secrets-store-csi-driver-node-j2wlt 3/3 Running 0 45m secrets-store-csi-driver-node-v9xv4 3/3 Running 0 45m secrets-store-csi-driver-node-vlz28 3/3 Running 0 45m secrets-store-csi-driver-operator-84bd699478-fpxrw 1/1 Running 0 47m
Create a secret provider class to define your secrets store provider:
Create a YAML file that defines the
SecretProviderClass
object:Example
secret-provider-class-vault.yaml
apiVersion: secrets-store.csi.x-k8s.io/v1 kind: SecretProviderClass metadata: name: my-vault-provider 1 namespace: my-namespace 2 spec: provider: vault 3 parameters: 4 roleName: "csi" vaultAddress: "http://vault.vault:8200" objects: | - secretPath: "secret/data/example1" objectName: "testSecret1" secretKey: "testSecret1
Create the
SecretProviderClass
object by running the following command:$ oc create -f secret-provider-class-vault.yaml
Create a deployment to use this secret provider class:
Create a YAML file that defines the
Deployment
object:Example
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: busybox-deployment 1 namespace: my-namespace 2 labels: app: busybox spec: replicas: 1 selector: matchLabels: app: busybox template: metadata: labels: app: busybox spec: terminationGracePeriodSeconds: 0 containers: - image: registry.k8s.io/e2e-test-images/busybox:1.29-4 name: busybox imagePullPolicy: IfNotPresent command: - "/bin/sleep" - "10000" volumeMounts: - name: secrets-store-inline mountPath: "/mnt/secrets-store" readOnly: true volumes: - name: secrets-store-inline csi: driver: secrets-store.csi.k8s.io readOnly: true volumeAttributes: secretProviderClass: "my-vault-provider" 3
Create the
Deployment
object by running the following command:$ oc create -f deployment.yaml
Verification
Verify that you can access the secrets from your HashiCorp Vault in the pod volume mount:
List the secrets in the pod mount by running the following command:
$ oc exec busybox-<hash> -n my-namespace -- ls /mnt/secrets-store/
Example output
testSecret1
View a secret in the pod mount by running the following command:
$ oc exec busybox-<hash> -n my-namespace -- cat /mnt/secrets-store/testSecret1
Example output
my-secret-value
2.7.4. Enabling synchronization of mounted content as Kubernetes secrets
You can enable synchronization to create Kubernetes secrets from the content on a mounted volume. An example where you might want to enable synchronization is to use an environment variable in your deployment to reference the Kubernetes secret.
Do not enable synchronization if you do not want to store your secrets on your OpenShift Container Platform cluster and in etcd. Enable this functionality only if you require it, such as when you want to use environment variables to refer to the secret.
If you enable synchronization, the secrets from the mounted volume are synchronized as Kubernetes secrets after you start a pod that mounts the secrets.
The synchronized Kubernetes secret is deleted when all pods that mounted the content are deleted.
Prerequisites
- You have installed the Secrets Store CSI Driver Operator.
- You have installed a secrets store provider.
- You have created the secret provider class.
-
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Edit the
SecretProviderClass
resource by running the following command:$ oc edit secretproviderclass my-azure-provider 1
- 1
- Replace
my-azure-provider
with the name of your secret provider class.
Add the
secretsObjects
section with the configuration for the synchronized Kubernetes secrets:apiVersion: secrets-store.csi.x-k8s.io/v1 kind: SecretProviderClass metadata: name: my-azure-provider namespace: my-namespace spec: provider: azure secretObjects: 1 - secretName: tlssecret 2 type: kubernetes.io/tls 3 labels: environment: "test" data: - objectName: tlskey 4 key: tls.key 5 - objectName: tlscrt key: tls.crt parameters: usePodIdentity: "false" keyvaultName: "kvname" objects: | array: - | objectName: tlskey objectType: secret - | objectName: tlscrt objectType: secret tenantId: "tid"
- 1
- Specify the configuration for synchronized Kubernetes secrets.
- 2
- Specify the name of the Kubernetes
Secret
object to create. - 3
- Specify the type of Kubernetes
Secret
object to create. For example,Opaque
orkubernetes.io/tls
. - 4
- Specify the object name or alias of the mounted content to synchronize.
- 5
- Specify the data field from the specified
objectName
to populate the Kubernetes secret with.
- Save the file to apply the changes.
2.7.5. Viewing the status of secrets in the pod volume mount
You can view detailed information, including the versions, of the secrets in the pod volume mount.
The Secrets Store CSI Driver Operator creates a SecretProviderClassPodStatus
resource in the same namespace as the pod. You can review this resource to see detailed information, including versions, about the secrets in the pod volume mount.
Prerequisites
- You have installed the Secrets Store CSI Driver Operator.
- You have installed a secrets store provider.
- You have created the secret provider class.
- You have deployed a pod that mounts a volume from the Secrets Store CSI Driver Operator.
-
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
View detailed information about the secrets in a pod volume mount by running the following command:
$ oc get secretproviderclasspodstatus <secret_provider_class_pod_status_name> -o yaml 1
- 1
- The name of the secret provider class pod status object is in the format of
<pod_name>-<namespace>-<secret_provider_class_name>
.
Example output
... status: mounted: true objects: - id: secret/tlscrt version: f352293b97da4fa18d96a9528534cb33 - id: secret/tlskey version: 02534bc3d5df481cb138f8b2a13951ef podName: busybox-<hash> secretProviderClassName: my-azure-provider targetPath: /var/lib/kubelet/pods/f0d49c1e-c87a-4beb-888f-37798456a3e7/volumes/kubernetes.io~csi/secrets-store-inline/mount
2.7.6. Uninstalling the Secrets Store CSI Driver Operator
Prerequisites
- Access to the OpenShift Container Platform web console.
- Administrator access to the cluster.
Procedure
To uninstall the Secrets Store CSI Driver Operator:
-
Stop all application pods that use the
secrets-store.csi.k8s.io
provider. - Remove any third-party provider plug-in for your chosen secret store.
Remove the Container Storage Interface (CSI) driver and associated manifests:
- Click Administration → CustomResourceDefinitions → ClusterCSIDriver.
- On the Instances tab, for secrets-store.csi.k8s.io, on the far left side, click the drop-down menu, and then click Delete ClusterCSIDriver.
- When prompted, click Delete.
- Verify that the CSI driver pods are no longer running.
Uninstall the Secrets Store CSI Driver Operator:
NoteBefore you can uninstall the Operator, you must remove the CSI driver first.
- Click Operators → Installed Operators.
- On the Installed Operators page, scroll or type "Secrets Store CSI" into the Search by name box to find the Operator, and then click it.
- On the upper, right of the Installed Operators > Operator details page, click Actions → Uninstall Operator.
When prompted on the Uninstall Operator window, click the Uninstall button to remove the Operator from the namespace. Any applications deployed by the Operator on the cluster need to be cleaned up manually.
After uninstalling, the Secrets Store CSI Driver Operator is no longer listed in the Installed Operators section of the web console.
2.8. Creating and using config maps
The following sections define config maps and how to create and use them.
2.8.1. Understanding config maps
Many applications require configuration by using some combination of configuration files, command line arguments, and environment variables. In OpenShift Container Platform, these configuration artifacts are decoupled from image content to keep containerized applications portable.
The ConfigMap
object provides mechanisms to inject containers with configuration data while keeping containers agnostic of OpenShift Container Platform. A config map can be used to store fine-grained information like individual properties or coarse-grained information like entire configuration files or JSON blobs.
The ConfigMap
object holds key-value pairs of configuration data that can be consumed in pods or used to store configuration data for system components such as controllers. For example:
ConfigMap
Object Definition
kind: ConfigMap apiVersion: v1 metadata: creationTimestamp: 2016-02-18T19:14:38Z name: example-config namespace: my-namespace data: 1 example.property.1: hello example.property.2: world example.property.file: |- property.1=value-1 property.2=value-2 property.3=value-3 binaryData: bar: L3Jvb3QvMTAw 2
You can use the binaryData
field when you create a config map from a binary file, such as an image.
Configuration data can be consumed in pods in a variety of ways. A config map can be used to:
- Populate environment variable values in containers
- Set command-line arguments in a container
- Populate configuration files in a volume
Users and system components can store configuration data in a config map.
A config map is similar to a secret, but designed to more conveniently support working with strings that do not contain sensitive information.
Config map restrictions
A config map must be created before its contents can be consumed in pods.
Controllers can be written to tolerate missing configuration data. Consult individual components configured by using config maps on a case-by-case basis.
ConfigMap
objects reside in a project.
They can only be referenced by pods in the same project.
The Kubelet only supports the use of a config map for pods it gets from the API server.
This includes any pods created by using the CLI, or indirectly from a replication controller. It does not include pods created by using the OpenShift Container Platform node’s --manifest-url
flag, its --config
flag, or its REST API because these are not common ways to create pods.
2.8.2. Creating a config map in the OpenShift Container Platform web console
You can create a config map in the OpenShift Container Platform web console.
Procedure
To create a config map as a cluster administrator:
-
In the Administrator perspective, select
Workloads
→Config Maps
. - At the top right side of the page, select Create Config Map.
- Enter the contents of your config map.
- Select Create.
-
In the Administrator perspective, select
To create a config map as a developer:
-
In the Developer perspective, select
Config Maps
. - At the top right side of the page, select Create Config Map.
- Enter the contents of your config map.
- Select Create.
-
In the Developer perspective, select
2.8.3. Creating a config map by using the CLI
You can use the following command to create a config map from directories, specific files, or literal values.
Procedure
Create a config map:
$ oc create configmap <configmap_name> [options]
2.8.3.1. Creating a config map from a directory
You can create a config map from a directory by using the --from-file
flag. This method allows you to use multiple files within a directory to create a config map.
Each file in the directory is used to populate a key in the config map, where the name of the key is the file name, and the value of the key is the content of the file.
For example, the following command creates a config map with the contents of the example-files
directory:
$ oc create configmap game-config --from-file=example-files/
View the keys in the config map:
$ oc describe configmaps game-config
Example output
Name: game-config Namespace: default Labels: <none> Annotations: <none> Data game.properties: 158 bytes ui.properties: 83 bytes
You can see that the two keys in the map are created from the file names in the directory specified in the command. The content of those keys might be large, so the output of oc describe
only shows the names of the keys and their sizes.
Prerequisite
You must have a directory with files that contain the data you want to populate a config map with.
The following procedure uses these example files:
game.properties
andui.properties
:$ cat example-files/game.properties
Example output
enemies=aliens lives=3 enemies.cheat=true enemies.cheat.level=noGoodRotten secret.code.passphrase=UUDDLRLRBABAS secret.code.allowed=true secret.code.lives=30
$ cat example-files/ui.properties
Example output
color.good=purple color.bad=yellow allow.textmode=true how.nice.to.look=fairlyNice
Procedure
Create a config map holding the content of each file in this directory by entering the following command:
$ oc create configmap game-config \ --from-file=example-files/
Verification
Enter the
oc get
command for the object with the-o
option to see the values of the keys:$ oc get configmaps game-config -o yaml
Example output
apiVersion: v1 data: game.properties: |- enemies=aliens lives=3 enemies.cheat=true enemies.cheat.level=noGoodRotten secret.code.passphrase=UUDDLRLRBABAS secret.code.allowed=true secret.code.lives=30 ui.properties: | color.good=purple color.bad=yellow allow.textmode=true how.nice.to.look=fairlyNice kind: ConfigMap metadata: creationTimestamp: 2016-02-18T18:34:05Z name: game-config namespace: default resourceVersion: "407" selflink: /api/v1/namespaces/default/configmaps/game-config uid: 30944725-d66e-11e5-8cd0-68f728db1985
2.8.3.2. Creating a config map from a file
You can create a config map from a file by using the --from-file
flag. You can pass the --from-file
option multiple times to the CLI.
You can also specify the key to set in a config map for content imported from a file by passing a key=value
expression to the --from-file
option. For example:
$ oc create configmap game-config-3 --from-file=game-special-key=example-files/game.properties
If you create a config map from a file, you can include files containing non-UTF8 data that are placed in this field without corrupting the non-UTF8 data. OpenShift Container Platform detects binary files and transparently encodes the file as MIME
. On the server, the MIME
payload is decoded and stored without corrupting the data.
Prerequisite
You must have a directory with files that contain the data you want to populate a config map with.
The following procedure uses these example files:
game.properties
andui.properties
:$ cat example-files/game.properties
Example output
enemies=aliens lives=3 enemies.cheat=true enemies.cheat.level=noGoodRotten secret.code.passphrase=UUDDLRLRBABAS secret.code.allowed=true secret.code.lives=30
$ cat example-files/ui.properties
Example output
color.good=purple color.bad=yellow allow.textmode=true how.nice.to.look=fairlyNice
Procedure
Create a config map by specifying a specific file:
$ oc create configmap game-config-2 \ --from-file=example-files/game.properties \ --from-file=example-files/ui.properties
Create a config map by specifying a key-value pair:
$ oc create configmap game-config-3 \ --from-file=game-special-key=example-files/game.properties
Verification
Enter the
oc get
command for the object with the-o
option to see the values of the keys from the file:$ oc get configmaps game-config-2 -o yaml
Example output
apiVersion: v1 data: game.properties: |- enemies=aliens lives=3 enemies.cheat=true enemies.cheat.level=noGoodRotten secret.code.passphrase=UUDDLRLRBABAS secret.code.allowed=true secret.code.lives=30 ui.properties: | color.good=purple color.bad=yellow allow.textmode=true how.nice.to.look=fairlyNice kind: ConfigMap metadata: creationTimestamp: 2016-02-18T18:52:05Z name: game-config-2 namespace: default resourceVersion: "516" selflink: /api/v1/namespaces/default/configmaps/game-config-2 uid: b4952dc3-d670-11e5-8cd0-68f728db1985
Enter the
oc get
command for the object with the-o
option to see the values of the keys from the key-value pair:$ oc get configmaps game-config-3 -o yaml
Example output
apiVersion: v1 data: game-special-key: |- 1 enemies=aliens lives=3 enemies.cheat=true enemies.cheat.level=noGoodRotten secret.code.passphrase=UUDDLRLRBABAS secret.code.allowed=true secret.code.lives=30 kind: ConfigMap metadata: creationTimestamp: 2016-02-18T18:54:22Z name: game-config-3 namespace: default resourceVersion: "530" selflink: /api/v1/namespaces/default/configmaps/game-config-3 uid: 05f8da22-d671-11e5-8cd0-68f728db1985
- 1
- This is the key that you set in the preceding step.
2.8.3.3. Creating a config map from literal values
You can supply literal values for a config map.
The --from-literal
option takes a key=value
syntax, which allows literal values to be supplied directly on the command line.
Procedure
Create a config map by specifying a literal value:
$ oc create configmap special-config \ --from-literal=special.how=very \ --from-literal=special.type=charm
Verification
Enter the
oc get
command for the object with the-o
option to see the values of the keys:$ oc get configmaps special-config -o yaml
Example output
apiVersion: v1 data: special.how: very special.type: charm kind: ConfigMap metadata: creationTimestamp: 2016-02-18T19:14:38Z name: special-config namespace: default resourceVersion: "651" selflink: /api/v1/namespaces/default/configmaps/special-config uid: dadce046-d673-11e5-8cd0-68f728db1985
2.8.4. Use cases: Consuming config maps in pods
The following sections describe some uses cases when consuming ConfigMap
objects in pods.
2.8.4.1. Populating environment variables in containers by using config maps
You can use config maps to populate individual environment variables in containers or to populate environment variables in containers from all keys that form valid environment variable names.
As an example, consider the following config map:
ConfigMap
with two environment variables
apiVersion: v1 kind: ConfigMap metadata: name: special-config 1 namespace: default 2 data: special.how: very 3 special.type: charm 4
ConfigMap
with one environment variable
apiVersion: v1 kind: ConfigMap metadata: name: env-config 1 namespace: default data: log_level: INFO 2
Procedure
You can consume the keys of this
ConfigMap
in a pod usingconfigMapKeyRef
sections.Sample
Pod
specification configured to inject specific environment variablesapiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "-c", "env" ] env: 1 - name: SPECIAL_LEVEL_KEY 2 valueFrom: configMapKeyRef: name: special-config 3 key: special.how 4 - name: SPECIAL_TYPE_KEY valueFrom: configMapKeyRef: name: special-config 5 key: special.type 6 optional: true 7 envFrom: 8 - configMapRef: name: env-config 9 securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] restartPolicy: Never
- 1
- Stanza to pull the specified environment variables from a
ConfigMap
. - 2
- Name of a pod environment variable that you are injecting a key’s value into.
- 3 5
- Name of the
ConfigMap
to pull specific environment variables from. - 4 6
- Environment variable to pull from the
ConfigMap
. - 7
- Makes the environment variable optional. As optional, the pod will be started even if the specified
ConfigMap
and keys do not exist. - 8
- Stanza to pull all environment variables from a
ConfigMap
. - 9
- Name of the
ConfigMap
to pull all environment variables from.
When this pod is run, the pod logs will include the following output:
SPECIAL_LEVEL_KEY=very log_level=INFO
SPECIAL_TYPE_KEY=charm
is not listed in the example output because optional: true
is set.
2.8.4.2. Setting command-line arguments for container commands with config maps
You can use a config map to set the value of the commands or arguments in a container by using the Kubernetes substitution syntax $(VAR_NAME)
.
As an example, consider the following config map:
apiVersion: v1 kind: ConfigMap metadata: name: special-config namespace: default data: special.how: very special.type: charm
Procedure
To inject values into a command in a container, you must consume the keys you want to use as environment variables. Then you can refer to them in a container’s command using the
$(VAR_NAME)
syntax.Sample pod specification configured to inject specific environment variables
apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY)" ] 1 env: - name: SPECIAL_LEVEL_KEY valueFrom: configMapKeyRef: name: special-config key: special.how - name: SPECIAL_TYPE_KEY valueFrom: configMapKeyRef: name: special-config key: special.type securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] restartPolicy: Never
- 1
- Inject the values into a command in a container using the keys you want to use as environment variables.
When this pod is run, the output from the echo command run in the test-container container is as follows:
very charm
2.8.4.3. Injecting content into a volume by using config maps
You can inject content into a volume by using config maps.
Example ConfigMap
custom resource (CR)
apiVersion: v1 kind: ConfigMap metadata: name: special-config namespace: default data: special.how: very special.type: charm
Procedure
You have a couple different options for injecting content into a volume by using config maps.
The most basic way to inject content into a volume by using a config map is to populate the volume with files where the key is the file name and the content of the file is the value of the key:
apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "-c", "cat", "/etc/config/special.how" ] volumeMounts: - name: config-volume mountPath: /etc/config securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] volumes: - name: config-volume configMap: name: special-config 1 restartPolicy: Never
- 1
- File containing key.
When this pod is run, the output of the cat command will be:
very
You can also control the paths within the volume where config map keys are projected:
apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "-c", "cat", "/etc/config/path/to/special-key" ] volumeMounts: - name: config-volume mountPath: /etc/config securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] volumes: - name: config-volume configMap: name: special-config items: - key: special.how path: path/to/special-key 1 restartPolicy: Never
- 1
- Path to config map key.
When this pod is run, the output of the cat command will be:
very
2.9. Using device plugins to access external resources with pods
Device plugins allow you to use a particular device type (GPU, InfiniBand, or other similar computing resources that require vendor-specific initialization and setup) in your OpenShift Container Platform pod without needing to write custom code.
2.9.1. Understanding device plugins
The device plugin provides a consistent and portable solution to consume hardware devices across clusters. The device plugin provides support for these devices through an extension mechanism, which makes these devices available to Containers, provides health checks of these devices, and securely shares them.
OpenShift Container Platform supports the device plugin API, but the device plugin Containers are supported by individual vendors.
A device plugin is a gRPC service running on the nodes (external to the kubelet
) that is responsible for managing specific hardware resources. Any device plugin must support following remote procedure calls (RPCs):
service DevicePlugin { // GetDevicePluginOptions returns options to be communicated with Device // Manager rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {} // ListAndWatch returns a stream of List of Devices // Whenever a Device state change or a Device disappears, ListAndWatch // returns the new list rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} // Allocate is called during container creation so that the Device // Plug-in can run device specific operations and instruct Kubelet // of the steps to make the Device available in the container rpc Allocate(AllocateRequest) returns (AllocateResponse) {} // PreStartcontainer is called, if indicated by Device Plug-in during // registration phase, before each container start. Device plug-in // can run device specific operations such as resetting the device // before making devices available to the container rpc PreStartcontainer(PreStartcontainerRequest) returns (PreStartcontainerResponse) {} }
Example device plugins
For easy device plugin reference implementation, there is a stub device plugin in the Device Manager code: vendor/k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin/device_plugin_stub.go.
2.9.1.1. Methods for deploying a device plugin
- Daemon sets are the recommended approach for device plugin deployments.
- Upon start, the device plugin will try to create a UNIX domain socket at /var/lib/kubelet/device-plugin/ on the node to serve RPCs from Device Manager.
- Since device plugins must manage hardware resources, access to the host file system, as well as socket creation, they must be run in a privileged security context.
- More specific details regarding deployment steps can be found with each device plugin implementation.
2.9.2. Understanding the Device Manager
Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plugins known as device plugins.
You can advertise specialized hardware without requiring any upstream code changes.
OpenShift Container Platform supports the device plugin API, but the device plugin Containers are supported by individual vendors.
Device Manager advertises devices as Extended Resources. User pods can consume devices, advertised by Device Manager, using the same Limit/Request mechanism, which is used for requesting any other Extended Resource.
Upon start, the device plugin registers itself with Device Manager invoking Register
on the /var/lib/kubelet/device-plugins/kubelet.sock and starts a gRPC service at /var/lib/kubelet/device-plugins/<plugin>.sock for serving Device Manager requests.
Device Manager, while processing a new registration request, invokes ListAndWatch
remote procedure call (RPC) at the device plugin service. In response, Device Manager gets a list of Device objects from the plugin over a gRPC stream. Device Manager will keep watching on the stream for new updates from the plugin. On the plugin side, the plugin will also keep the stream open and whenever there is a change in the state of any of the devices, a new device list is sent to the Device Manager over the same streaming connection.
While handling a new pod admission request, Kubelet passes requested Extended Resources
to the Device Manager for device allocation. Device Manager checks in its database to verify if a corresponding plugin exists or not. If the plugin exists and there are free allocatable devices as well as per local cache, Allocate
RPC is invoked at that particular device plugin.
Additionally, device plugins can also perform several other device-specific operations, such as driver installation, device initialization, and device resets. These functionalities vary from implementation to implementation.
2.9.3. Enabling Device Manager
Enable Device Manager to implement a device plugin to advertise specialized hardware without any upstream code changes.
Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plugins known as device plugins.
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure by entering the following command. Perform one of the following steps:View the machine config:
# oc describe machineconfig <name>
For example:
# oc describe machineconfig 00-worker
Example output
Name: 00-worker Namespace: Labels: machineconfiguration.openshift.io/role=worker 1
- 1
- Label required for the Device Manager.
Procedure
Create a custom resource (CR) for your configuration change.
Sample configuration for a Device Manager CR
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: devicemgr 1 spec: machineConfigPoolSelector: matchLabels: machineconfiguration.openshift.io: devicemgr 2 kubeletConfig: feature-gates: - DevicePlugins=true 3
Create the Device Manager:
$ oc create -f devicemgr.yaml
Example output
kubeletconfig.machineconfiguration.openshift.io/devicemgr created
- Ensure that Device Manager was actually enabled by confirming that /var/lib/kubelet/device-plugins/kubelet.sock is created on the node. This is the UNIX domain socket on which the Device Manager gRPC server listens for new plugin registrations. This sock file is created when the Kubelet is started only if Device Manager is enabled.
2.10. Including pod priority in pod scheduling decisions
You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.
To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.
2.10.1. Understanding pod priority
When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.
2.10.1.1. Pod priority classes
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than or equal to one billion for critical pods that must not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.
$ oc get priorityclasses
Example output
NAME VALUE GLOBAL-DEFAULT AGE system-node-critical 2000001000 false 72m system-cluster-critical 2000000000 false 72m openshift-user-critical 1000000000 false 3d13h cluster-logging 1000000 false 29s
system-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are
sdn-ovs
,sdn
, and so forth. A number of critical components include thesystem-node-critical
priority class by default, for example:- master-api
- master-controller
- master-etcd
- sdn
- sdn-ovs
- sync
system-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the
system-node-critical
priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth. A number of critical components include thesystem-cluster-critical
priority class by default, for example:- fluentd
- metrics-server
- descheduler
-
openshift-user-critical - You can use the
priorityClassName
field with important pods that cannot bind their resource consumption and do not have predictable resource consumption behavior. Prometheus pods under theopenshift-monitoring
andopenshift-user-workload-monitoring
namespaces use theopenshift-user-critical
priorityClassName
. Monitoring workloads usesystem-critical
as their firstpriorityClass
, but this causes problems when monitoring uses excessive memory and the nodes cannot evict them. As a result, monitoring drops priority to give the scheduler flexibility, moving heavy workloads around to keep critical nodes operating. - cluster-logging - This priority is used by Fluentd to make sure Fluentd pods are scheduled to nodes over other apps.
2.10.1.2. Pod priority names
After you have one or more priority classes, you can create pods that specify a priority class name in a Pod
spec. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.
2.10.2. Understanding pod preemption
When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.
When the scheduler preempts one or more pods on a node, the nominatedNodeName
field of higher-priority Pod
spec is set to the name of the node, along with the nodename
field. The scheduler uses the nominatedNodeName
field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.
After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName
field and nodeName
field of the Pod
spec might be different.
Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName
of the pending pod, making the pod eligible for another node.
Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.
The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.
2.10.2.1. Non-preempting priority classes
Pods with the preemption policy set to Never
are placed in the scheduling queue ahead of lower-priority pods, but they cannot preempt other pods. A non-preempting pod waiting to be scheduled stays in the scheduling queue until sufficient resources are free and it can be scheduled. Non-preempting pods, like other pods, are subject to scheduler back-off. This means that if the scheduler tries unsuccessfully to schedule these pods, they are retried with lower frequency, allowing other pods with lower priority to be scheduled before them.
Non-preempting pods can still be preempted by other, high-priority pods.
2.10.2.2. Pod preemption and other scheduler settings
If you enable pod priority and preemption, consider your other scheduler settings:
- Pod priority and pod disruption budget
- A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
- Pod priority and pod affinity
- Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
2.10.2.3. Graceful termination of preempted pods
When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.
To minimize this gap, configure a small graceful termination period for lower-priority pods.
2.10.3. Configuring priority and preemption
You apply pod priority and preemption by creating a priority class object and associating pods to the priority by using the priorityClassName
in your pod specs.
You cannot add a priority class directly to an existing scheduled pod.
Procedure
To configure your cluster to use priority and preemption:
Create one or more priority classes:
Create a YAML file similar to the following:
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority 1 value: 1000000 2 preemptionPolicy: PreemptLowerPriority 3 globalDefault: false 4 description: "This priority class should be used for XYZ service pods only." 5
- 1
- The name of the priority class object.
- 2
- The priority value of the object.
- 3
- Optional. Specifies whether this priority class is preempting or non-preempting. The preemption policy defaults to
PreemptLowerPriority
, which allows pods of that priority class to preempt lower-priority pods. If the preemption policy is set toNever
, pods in that priority class are non-preempting. - 4
- Optional. Specifies whether this priority class should be used for pods without a priority class name specified. This field is
false
by default. Only one priority class withglobalDefault
set totrue
can exist in the cluster. If there is no priority class withglobalDefault:true
, the priority of pods with no priority class name is zero. Adding a priority class withglobalDefault:true
affects only pods created after the priority class is added and does not change the priorities of existing pods. - 5
- Optional. Describes which pods developers should use with this priority class. Enter an arbitrary text string.
Create the priority class:
$ oc create -f <file-name>.yaml
Create a pod spec to include the name of a priority class:
Create a YAML file similar to the following:
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] priorityClassName: high-priority 1
- 1
- Specify the priority class to use with this pod.
Create the pod:
$ oc create -f <file-name>.yaml
You can add the priority name directly to the pod configuration or to a pod template.
2.11. Placing pods on specific nodes using node selectors
A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.
If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.
2.11.1. Using node selectors to control pod placement
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You add labels to a node, a compute machine set, or a machine config. Adding the label to the compute machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.
To add node selectors to an existing pod, add a node selector to the controlling object for that pod, such as a ReplicaSet
object, DaemonSet
object, StatefulSet
object, Deployment
object, or DeploymentConfig
object. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the pod spec. If the pod does not have a controlling object, you must delete the pod, edit the pod spec, and recreate the pod.
You cannot add a node selector directly to an existing scheduled pod.
Prerequisites
To add a node selector to existing pods, determine the controlling object for that pod. For example, the router-default-66d5cf9464-m2g75
pod is controlled by the router-default-66d5cf9464
replica set:
$ oc describe pod router-default-66d5cf9464-7pwkc
Example output
kind: Pod apiVersion: v1 metadata: # ... Name: router-default-66d5cf9464-7pwkc Namespace: openshift-ingress # ... Controlled By: ReplicaSet/router-default-66d5cf9464 # ...
The web console lists the controlling object under ownerReferences
in the pod YAML:
apiVersion: v1 kind: Pod metadata: name: router-default-66d5cf9464-7pwkc # ... ownerReferences: - apiVersion: apps/v1 kind: ReplicaSet name: router-default-66d5cf9464 uid: d81dd094-da26-11e9-a48a-128e7edf0312 controller: true blockOwnerDeletion: true # ...
Procedure
Add labels to a node by using a compute machine set or editing the node directly:
Use a
MachineSet
object to add labels to nodes managed by the compute machine set when a node is created:Run the following command to add labels to a
MachineSet
object:$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
For example:
$ oc patch MachineSet abc612-msrtw-worker-us-east-1c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
TipYou can alternatively apply the following YAML to add labels to a compute machine set:
apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: name: xf2bd-infra-us-east-2a namespace: openshift-machine-api spec: template: spec: metadata: labels: region: "east" type: "user-node" # ...
Verify that the labels are added to the
MachineSet
object by using theoc edit
command:For example:
$ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api
Example
MachineSet
objectapiVersion: machine.openshift.io/v1beta1 kind: MachineSet # ... spec: # ... template: metadata: # ... spec: metadata: labels: region: east type: user-node # ...
Add labels directly to a node:
Edit the
Node
object for the node:$ oc label nodes <name> <key>=<value>
For example, to label a node:
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
TipYou can alternatively apply the following YAML to add labels to a node:
kind: Node apiVersion: v1 metadata: name: hello-node-6fbccf8d9 labels: type: "user-node" region: "east" # ...
Verify that the labels are added to the node:
$ oc get nodes -l type=user-node,region=east
Example output
NAME STATUS ROLES AGE VERSION ip-10-0-142-25.ec2.internal Ready worker 17m v1.29.4
Add the matching node selector to a pod:
To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:
Example
ReplicaSet
object with labelskind: ReplicaSet apiVersion: apps/v1 metadata: name: hello-node-6fbccf8d9 # ... spec: # ... template: metadata: creationTimestamp: null labels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default pod-template-hash: 66d5cf9464 spec: nodeSelector: kubernetes.io/os: linux node-role.kubernetes.io/worker: '' type: user-node 1 # ...
- 1
- Add the node selector.
To add a node selector to a specific, new pod, add the selector to the
Pod
object directly:Example
Pod
object with a node selectorapiVersion: v1 kind: Pod metadata: name: hello-node-6fbccf8d9 # ... spec: nodeSelector: region: east type: user-node # ...
NoteYou cannot add a node selector directly to an existing scheduled pod.
2.12. Run Once Duration Override Operator
2.12.1. Run Once Duration Override Operator overview
You can use the Run Once Duration Override Operator to specify a maximum time limit that run-once pods can be active for.
The Run Once Duration Override Operator is not currently available for OpenShift Container Platform 4.16. The Operator is planned to be released in the near future.
2.12.1.1. About the Run Once Duration Override Operator
OpenShift Container Platform relies on run-once pods to perform tasks such as deploying a pod or performing a build. Run-once pods are pods that have a RestartPolicy
of Never
or OnFailure
.
Cluster administrators can use the Run Once Duration Override Operator to force a limit on the time that those run-once pods can be active. After the time limit expires, the cluster will try to actively terminate those pods. The main reason to have such a limit is to prevent tasks such as builds to run for an excessive amount of time.
To apply the run-once duration override from the Run Once Duration Override Operator to run-once pods, you must enable it on each applicable namespace.
If both the run-once pod and the Run Once Duration Override Operator have their activeDeadlineSeconds
value set, the lower of the two values is used.
2.12.2. Run Once Duration Override Operator release notes
Cluster administrators can use the Run Once Duration Override Operator to force a limit on the time that run-once pods can be active. After the time limit expires, the cluster tries to terminate the run-once pods. The main reason to have such a limit is to prevent tasks such as builds to run for an excessive amount of time.
To apply the run-once duration override from the Run Once Duration Override Operator to run-once pods, you must enable it on each applicable namespace.
These release notes track the development of the Run Once Duration Override Operator for OpenShift Container Platform.
For an overview of the Run Once Duration Override Operator, see About the Run Once Duration Override Operator.
2.12.2.1. Run Once Duration Override Operator 1.1.1
Issued: 2024-07-01
The following advisory is available for the Run Once Duration Override Operator 1.1.1: RHSA-2024:1616
2.12.2.1.1. New features and enhancements
You can install and use the Run Once Duration Override Operator in an OpenShift Container Platform cluster running in FIPS mode.
ImportantTo enable FIPS mode for your cluster, you must run the installation program from a Red Hat Enterprise Linux (RHEL) computer configured to operate in FIPS mode. For more information about configuring FIPS mode on RHEL, see Installing the system in FIPS mode.
When running Red Hat Enterprise Linux (RHEL) or Red Hat Enterprise Linux CoreOS (RHCOS) booted in FIPS mode, OpenShift Container Platform core components use the RHEL cryptographic libraries that have been submitted to NIST for FIPS 140-2/140-3 Validation on only the x86_64, ppc64le, and s390x architectures.
2.12.2.1.2. Bug fixes
- This release of the Run Once Duration Override Operator addresses several Common Vulnerabilities and Exposures (CVEs).
2.12.3. Overriding the active deadline for run-once pods
You can use the Run Once Duration Override Operator to specify a maximum time limit that run-once pods can be active for. By enabling the run-once duration override on a namespace, all future run-once pods created or updated in that namespace have their activeDeadlineSeconds
field set to the value specified by the Run Once Duration Override Operator.
The Run Once Duration Override Operator is not currently available for OpenShift Container Platform 4.16. The Operator is planned to be released in the near future.
2.12.3.1. Installing the Run Once Duration Override Operator
You can use the web console to install the Run Once Duration Override Operator.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
Procedure
- Log in to the OpenShift Container Platform web console.
Create the required namespace for the Run Once Duration Override Operator.
- Navigate to Administration → Namespaces and click Create Namespace.
-
Enter
openshift-run-once-duration-override-operator
in the Name field and click Create.
Install the Run Once Duration Override Operator.
- Navigate to Operators → OperatorHub.
- Enter Run Once Duration Override Operator into the filter box.
- Select the Run Once Duration Override Operator and click Install.
On the Install Operator page:
- The Update channel is set to stable, which installs the latest stable release of the Run Once Duration Override Operator.
- Select A specific namespace on the cluster.
- Choose openshift-run-once-duration-override-operator from the dropdown menu under Installed namespace.
Select an Update approval strategy.
- The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
- The Manual strategy requires a user with appropriate credentials to approve the Operator update.
- Click Install.
Create a
RunOnceDurationOverride
instance.- From the Operators → Installed Operators page, click Run Once Duration Override Operator.
- Select the Run Once Duration Override tab and click Create RunOnceDurationOverride.
Edit the settings as necessary.
Under the
runOnceDurationOverride
section, you can update thespec.activeDeadlineSeconds
value, if required. The predefined value is3600
seconds, or 1 hour.- Click Create.
Verification
- Log in to the OpenShift CLI.
Verify all pods are created and running properly.
$ oc get pods -n openshift-run-once-duration-override-operator
Example output
NAME READY STATUS RESTARTS AGE run-once-duration-override-operator-7b88c676f6-lcxgc 1/1 Running 0 7m46s runoncedurationoverride-62blp 1/1 Running 0 41s runoncedurationoverride-h8h8b 1/1 Running 0 41s runoncedurationoverride-tdsqk 1/1 Running 0 41s
2.12.3.2. Enabling the run-once duration override on a namespace
To apply the run-once duration override from the Run Once Duration Override Operator to run-once pods, you must enable it on each applicable namespace.
Prerequisites
- The Run Once Duration Override Operator is installed.
Procedure
- Log in to the OpenShift CLI.
Add the label to enable the run-once duration override to your namespace:
$ oc label namespace <namespace> \ 1 runoncedurationoverrides.admission.runoncedurationoverride.openshift.io/enabled=true
- 1
- Specify the namespace to enable the run-once duration override on.
After you enable the run-once duration override on this namespace, future run-once pods that are created in this namespace will have their activeDeadlineSeconds
field set to the override value from the Run Once Duration Override Operator. Existing pods in this namespace will also have their activeDeadlineSeconds
value set when they are updated next.
Verification
Create a test run-once pod in the namespace that you enabled the run-once duration override on:
apiVersion: v1 kind: Pod metadata: name: example namespace: <namespace> 1 spec: restartPolicy: Never 2 securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: busybox securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] image: busybox:1.25 command: - /bin/sh - -ec - | while sleep 5; do date; done
Verify that the pod has its
activeDeadlineSeconds
field set:$ oc get pods -n <namespace> -o yaml | grep activeDeadlineSeconds
Example output
activeDeadlineSeconds: 3600
2.12.3.3. Updating the run-once active deadline override value
You can customize the override value that the Run Once Duration Override Operator applies to run-once pods. The predefined value is 3600
seconds, or 1 hour.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have installed the Run Once Duration Override Operator.
Procedure
- Log in to the OpenShift CLI.
Edit the
RunOnceDurationOverride
resource:$ oc edit runoncedurationoverride cluster
Update the
activeDeadlineSeconds
field:apiVersion: operator.openshift.io/v1 kind: RunOnceDurationOverride metadata: # ... spec: runOnceDurationOverride: spec: activeDeadlineSeconds: 1800 1 # ...
- 1
- Set the
activeDeadlineSeconds
field to the desired value, in seconds.
- Save the file to apply the changes.
Any future run-once pods created in namespaces where the run-once duration override is enabled will have their activeDeadlineSeconds
field set to this new value. Existing run-once pods in these namespaces will receive this new value when they are updated.
2.12.4. Uninstalling the Run Once Duration Override Operator
You can remove the Run Once Duration Override Operator from OpenShift Container Platform by uninstalling the Operator and removing its related resources.
The Run Once Duration Override Operator is not currently available for OpenShift Container Platform 4.16. The Operator is planned to be released in the near future.
2.12.4.1. Uninstalling the Run Once Duration Override Operator
You can use the web console to uninstall the Run Once Duration Override Operator. Uninstalling the Run Once Duration Override Operator does not unset the activeDeadlineSeconds
field for run-once pods, but it will no longer apply the override value to future run-once pods.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
- You have installed the Run Once Duration Override Operator.
Procedure
- Log in to the OpenShift Container Platform web console.
- Navigate to Operators → Installed Operators.
-
Select
openshift-run-once-duration-override-operator
from the Project dropdown list. Delete the
RunOnceDurationOverride
instance.- Click Run Once Duration Override Operator and select the Run Once Duration Override tab.
- Click the Options menu next to the cluster entry and select Delete RunOnceDurationOverride.
- In the confirmation dialog, click Delete.
Uninstall the Run Once Duration Override Operator Operator.
- Navigate to Operators → Installed Operators.
- Click the Options menu next to the Run Once Duration Override Operator entry and click Uninstall Operator.
- In the confirmation dialog, click Uninstall.
2.12.4.2. Uninstalling Run Once Duration Override Operator resources
Optionally, after uninstalling the Run Once Duration Override Operator, you can remove its related resources from your cluster.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
- You have uninstalled the Run Once Duration Override Operator.
Procedure
- Log in to the OpenShift Container Platform web console.
Remove CRDs that were created when the Run Once Duration Override Operator was installed:
- Navigate to Administration → CustomResourceDefinitions.
-
Enter
RunOnceDurationOverride
in the Name field to filter the CRDs. - Click the Options menu next to the RunOnceDurationOverride CRD and select Delete CustomResourceDefinition.
- In the confirmation dialog, click Delete.
Delete the
openshift-run-once-duration-override-operator
namespace.- Navigate to Administration → Namespaces.
-
Enter
openshift-run-once-duration-override-operator
into the filter box. - Click the Options menu next to the openshift-run-once-duration-override-operator entry and select Delete Namespace.
-
In the confirmation dialog, enter
openshift-run-once-duration-override-operator
and click Delete.
Remove the run-once duration override label from the namespaces that it was enabled on.
- Navigate to Administration → Namespaces.
- Select your namespace.
- Click Edit next to the Labels field.
- Remove the runoncedurationoverrides.admission.runoncedurationoverride.openshift.io/enabled=true label and click Save.
Chapter 3. Automatically scaling pods with the Custom Metrics Autoscaler Operator
3.1. Release notes
3.1.1. Custom Metrics Autoscaler Operator release notes
The release notes for the Custom Metrics Autoscaler Operator for Red Hat OpenShift describe new features and enhancements, deprecated features, and known issues.
The Custom Metrics Autoscaler Operator uses the Kubernetes-based Event Driven Autoscaler (KEDA) and is built on top of the OpenShift Container Platform horizontal pod autoscaler (HPA).
The Custom Metrics Autoscaler Operator for Red Hat OpenShift is provided as an installable component, with a distinct release cycle from the core OpenShift Container Platform. The Red Hat OpenShift Container Platform Life Cycle Policy outlines release compatibility.
3.1.1.1. Supported versions
The following table defines the Custom Metrics Autoscaler Operator versions for each OpenShift Container Platform version.
Version | OpenShift Container Platform version | General availability |
---|---|---|
2.14.1 | 4.16 | General availability |
2.14.1 | 4.15 | General availability |
2.14.1 | 4.14 | General availability |
2.14.1 | 4.13 | General availability |
2.14.1 | 4.12 | General availability |
3.1.1.2. Custom Metrics Autoscaler Operator 2.14.1 release notes
This release of the Custom Metrics Autoscaler Operator 2.14.1-454 provides a CVE, a new feature, and bug fixes for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHBA-2024:5865.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of Kubernetes-based Event Driven Autoscaler (KEDA).
3.1.1.2.1. New features and enhancements
3.1.1.2.1.1. Support for the Cron trigger with the Custom Metrics Autoscaler Operator
The Custom Metrics Autoscaler Operator can now use the Cron trigger to scale pods based on an hourly schedule. When your specified time frame starts, the Custom Metrics Autoscaler Operator scales pods to your desired amount. When the time frame ends, the Operator scales back down to the previous level.
For more information, see Understanding the Cron trigger.
3.1.1.2.2. Bug fixes
-
Previously, if you made changes to audit configuration parameters in the
KedaController
custom resource, thekeda-metrics-server-audit-policy
config map would not get updated. As a consequence, you could not change the audit configuration parameters after the initial deployment of the Custom Metrics Autoscaler. With this fix, changes to the audit configuration now render properly in the config map, allowing you to change the audit configuration any time after installation. (OCPBUGS-32521)
3.1.2. Release notes for past releases of the Custom Metrics Autoscaler Operator
The following release notes are for previous versions of the Custom Metrics Autoscaler Operator.
For the current version, see Custom Metrics Autoscaler Operator release notes.
3.1.2.1. Custom Metrics Autoscaler Operator 2.13.1 release notes
This release of the Custom Metrics Autoscaler Operator 2.13.1-421 provides a new feature and a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHBA-2024:4837.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of Kubernetes-based Event Driven Autoscaler (KEDA).
3.1.2.1.1. New features and enhancements
3.1.2.1.1.1. Support for custom certificates with the Custom Metrics Autoscaler Operator
The Custom Metrics Autoscaler Operator can now use custom service CA certificates to connect securely to TLS-enabled metrics sources, such as an external Kafka cluster or an external Prometheus service. By default, the Operator uses automatically-generated service certificates to connect to on-cluster services only. There is a new field in the KedaController
object that allows you to load custom server CA certificates for connecting to external services by using config maps.
For more information, see Custom CA certificates for the Custom Metrics Autoscaler.
3.1.2.1.2. Bug fixes
-
Previously, the
custom-metrics-autoscaler
andcustom-metrics-autoscaler-adapter
images were missing time zone information. As a consequence, scaled objects withcron
triggers failed to work because the controllers were unable to find time zone information. With this fix, the image builds are updated to include time zone information. As a result, scaled objects containingcron
triggers now function properly. Scaled objects containingcron
triggers are currently not supported for the custom metrics autoscaler. (OCPBUGS-34018)
3.1.2.2. Custom Metrics Autoscaler Operator 2.12.1-394 release notes
This release of the Custom Metrics Autoscaler Operator 2.12.1-394 provides a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2024:2901.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of Kubernetes-based Event Driven Autoscaler (KEDA).
3.1.2.2.1. Bug fixes
-
Previously, the
protojson.Unmarshal
function entered into an infinite loop when unmarshaling certain forms of invalid JSON. This condition could occur when unmarshaling into a message that contains agoogle.protobuf.Any
value or when theUnmarshalOptions.DiscardUnknown
option is set. This release fixes this issue. (OCPBUGS-30305) -
Previously, when parsing a multipart form, either explicitly with the
Request.ParseMultipartForm
method or implicitly with theRequest.FormValue
,Request.PostFormValue
, orRequest.FormFile
method, the limits on the total size of the parsed form were not applied to the memory consumed. This could cause memory exhaustion. With this fix, the parsing process now correctly limits the maximum size of form lines while reading a single form line. (OCPBUGS-30360) -
Previously, when following an HTTP redirect to a domain that is not on a matching subdomain or on an exact match of the initial domain, an HTTP client would not forward sensitive headers, such as
Authorization
orCookie
. For example, a redirect fromexample.com
towww.example.com
would forward theAuthorization
header, but a redirect towww.example.org
would not forward the header. This release fixes this issue. (OCPBUGS-30365) -
Previously, verifying a certificate chain that contains a certificate with an unknown public key algorithm caused the certificate verification process to panic. This condition affected all crypto and Transport Layer Security (TLS) clients and servers that set the
Config.ClientAuth
parameter to theVerifyClientCertIfGiven
orRequireAndVerifyClientCert
value. The default behavior is for TLS servers to not verify client certificates. This release fixes this issue. (OCPBUGS-30370) -
Previously, if errors returned from the
MarshalJSON
method contained user-controlled data, an attacker could have used the data to break the contextual auto-escaping behavior of the HTML template package. This condition would allow for subsequent actions to inject unexpected content into the templates. This release fixes this issue. (OCPBUGS-30397) -
Previously, the
net/http
andgolang.org/x/net/http2
Go packages did not limit the number ofCONTINUATION
frames for an HTTP/2 request. This condition could result in excessive CPU consumption. This release fixes this issue. (OCPBUGS-30894)
3.1.2.3. Custom Metrics Autoscaler Operator 2.12.1-384 release notes
This release of the Custom Metrics Autoscaler Operator 2.12.1-384 provides a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHBA-2024:2043.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.
3.1.2.3.1. Bug fixes
-
Previously, the
custom-metrics-autoscaler
andcustom-metrics-autoscaler-adapter
images were missing time zone information. As a consequence, scaled objects withcron
triggers failed to work because the controllers were unable to find time zone information. With this fix, the image builds are updated to include time zone information. As a result, scaled objects containingcron
triggers now function properly. (OCPBUGS-32395)
3.1.2.4. Custom Metrics Autoscaler Operator 2.12.1-376 release notes
This release of the Custom Metrics Autoscaler Operator 2.12.1-376 provides security updates and bug fixes for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2024:1812.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.
3.1.2.4.1. Bug fixes
- Previously, if invalid values such as nonexistent namespaces were specified in scaled object metadata, the underlying scaler clients would not free, or close, their client descriptors, resulting in a slow memory leak. This fix properly closes the underlying client descriptors when there are errors, preventing memory from leaking. (OCPBUGS-30145)
-
Previously the
ServiceMonitor
custom resource (CR) for thekeda-metrics-apiserver
pod was not functioning, because the CR referenced an incorrect metrics port name ofhttp
. This fix corrects theServiceMonitor
CR to reference the proper port name ofmetrics
. As a result, the Service Monitor functions properly. (OCPBUGS-25806)
3.1.2.5. Custom Metrics Autoscaler Operator 2.11.2-322 release notes
This release of the Custom Metrics Autoscaler Operator 2.11.2-322 provides security updates and bug fixes for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2023:6144.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.
3.1.2.5.1. Bug fixes
- Because the Custom Metrics Autoscaler Operator version 3.11.2-311 was released without a required volume mount in the Operator deployment, the Custom Metrics Autoscaler Operator pod would restart every 15 minutes. This fix adds the required volume mount to the Operator deployment. As a result, the Operator no longer restarts every 15 minutes. (OCPBUGS-22361)
3.1.2.6. Custom Metrics Autoscaler Operator 2.11.2-311 release notes
This release of the Custom Metrics Autoscaler Operator 2.11.2-311 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.11.2-311 were released in RHBA-2023:5981.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.
3.1.2.6.1. New features and enhancements
3.1.2.6.1.1. Red Hat OpenShift Service on AWS (ROSA) and OpenShift Dedicated are now supported
The Custom Metrics Autoscaler Operator 2.11.2-311 can be installed on OpenShift ROSA and OpenShift Dedicated managed clusters. Previous versions of the Custom Metrics Autoscaler Operator could be installed only in the openshift-keda
namespace. This prevented the Operator from being installed on OpenShift ROSA and OpenShift Dedicated clusters. This version of Custom Metrics Autoscaler allows installation to other namespaces such as openshift-operators
or keda
, enabling installation into ROSA and Dedicated clusters.
3.1.2.6.2. Bug fixes
-
Previously, if the Custom Metrics Autoscaler Operator was installed and configured, but not in use, the OpenShift CLI reported the
couldn’t get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
error after anyoc
command was entered. The message, although harmless, could have caused confusion. With this fix, theGot empty response for: external.metrics…
error no longer appears inappropriately. (OCPBUGS-15779) - Previously, any annotation or label change to objects managed by the Custom Metrics Autoscaler were reverted by Custom Metrics Autoscaler Operator any time the Keda Controller was modified, for example after a configuration change. This caused continuous changing of labels in your objects. The Custom Metrics Autoscaler now uses its own annotation to manage labels and annotations, and annotation or label are no longer inappropriately reverted. (OCPBUGS-15590)
3.1.2.7. Custom Metrics Autoscaler Operator 2.10.1-267 release notes
This release of the Custom Metrics Autoscaler Operator 2.10.1-267 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.10.1-267 were released in RHBA-2023:4089.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.
3.1.2.7.1. Bug fixes
-
Previously, the
custom-metrics-autoscaler
andcustom-metrics-autoscaler-adapter
images did not contain time zone information. Because of this, scaled objects with cron triggers failed to work because the controllers were unable to find time zone information. With this fix, the image builds now include time zone information. As a result, scaled objects containing cron triggers now function properly. (OCPBUGS-15264) -
Previously, the Custom Metrics Autoscaler Operator would attempt to take ownership of all managed objects, including objects in other namespaces and cluster-scoped objects. Because of this, the Custom Metrics Autoscaler Operator was unable to create the role binding for reading the credentials necessary to be an API server. This caused errors in the
kube-system
namespace. With this fix, the Custom Metrics Autoscaler Operator skips adding theownerReference
field to any object in another namespace or any cluster-scoped object. As a result, the role binding is now created without any errors. (OCPBUGS-15038) -
Previously, the Custom Metrics Autoscaler Operator added an
ownerReferences
field to theopenshift-keda
namespace. While this did not cause functionality problems, the presence of this field could have caused confusion for cluster administrators. With this fix, the Custom Metrics Autoscaler Operator does not add theownerReference
field to theopenshift-keda
namespace. As a result, theopenshift-keda
namespace no longer has a superfluousownerReference
field. (OCPBUGS-15293) -
Previously, if you used a Prometheus trigger configured with authentication method other than pod identity, and the
podIdentity
parameter was set tonone
, the trigger would fail to scale. With this fix, the Custom Metrics Autoscaler for OpenShift now properly handles thenone
pod identity provider type. As a result, a Prometheus trigger configured with authentication method other than pod identity, and thepodIdentity
parameter sset tonone
now properly scales. (OCPBUGS-15274)
3.1.2.8. Custom Metrics Autoscaler Operator 2.10.1 release notes
This release of the Custom Metrics Autoscaler Operator 2.10.1 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.10.1 were released in RHEA-2023:3199.
Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.
3.1.2.8.1. New features and enhancements
3.1.2.8.1.1. Custom Metrics Autoscaler Operator general availability
The Custom Metrics Autoscaler Operator is now generally available as of Custom Metrics Autoscaler Operator version 2.10.1.
Scaling by using a scaled job is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
3.1.2.8.1.2. Performance metrics
You can now use the Prometheus Query Language (PromQL) to query metrics on the Custom Metrics Autoscaler Operator.
3.1.2.8.1.3. Pausing the custom metrics autoscaling for scaled objects
You can now pause the autoscaling of a scaled object, as needed, and resume autoscaling when ready.
3.1.2.8.1.4. Replica fall back for scaled objects
You can now specify the number of replicas to fall back to if a scaled object fails to get metrics from the source.
3.1.2.8.1.5. Customizable HPA naming for scaled objects
You can now specify a custom name for the horizontal pod autoscaler in scaled objects.
3.1.2.8.1.6. Activation and scaling thresholds
Because the horizontal pod autoscaler (HPA) cannot scale to or from 0 replicas, the Custom Metrics Autoscaler Operator does that scaling, after which the HPA performs the scaling. You can now specify when the HPA takes over autoscaling, based on the number of replicas. This allows for more flexibility with your scaling policies.
3.1.2.9. Custom Metrics Autoscaler Operator 2.8.2-174 release notes
This release of the Custom Metrics Autoscaler Operator 2.8.2-174 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.8.2-174 were released in RHEA-2023:1683.
The Custom Metrics Autoscaler Operator version 2.8.2-174 is a Technology Preview feature.
3.1.2.9.1. New features and enhancements
3.1.2.9.1.1. Operator upgrade support
You can now upgrade from a prior version of the Custom Metrics Autoscaler Operator. See "Changing the update channel for an Operator" in the "Additional resources" for information on upgrading an Operator.
3.1.2.9.1.2. must-gather support
You can now collect data about the Custom Metrics Autoscaler Operator and its components by using the OpenShift Container Platform must-gather
tool. Currently, the process for using the must-gather
tool with the Custom Metrics Autoscaler is different than for other operators. See "Gathering debugging data in the "Additional resources" for more information.
3.1.2.10. Custom Metrics Autoscaler Operator 2.8.2 release notes
This release of the Custom Metrics Autoscaler Operator 2.8.2 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.8.2 were released in RHSA-2023:1042.
The Custom Metrics Autoscaler Operator version 2.8.2 is a Technology Preview feature.
3.1.2.10.1. New features and enhancements
3.1.2.10.1.1. Audit Logging
You can now gather and view audit logs for the Custom Metrics Autoscaler Operator and its associated components. Audit logs are security-relevant chronological sets of records that document the sequence of activities that have affected the system by individual users, administrators, or other components of the system.
3.1.2.10.1.2. Scale applications based on Apache Kafka metrics
You can now use the KEDA Apache kafka trigger/scaler to scale deployments based on an Apache Kafka topic.
3.1.2.10.1.3. Scale applications based on CPU metrics
You can now use the KEDA CPU trigger/scaler to scale deployments based on CPU metrics.
3.1.2.10.1.4. Scale applications based on memory metrics
You can now use the KEDA memory trigger/scaler to scale deployments based on memory metrics.
3.2. Custom Metrics Autoscaler Operator overview
As a developer, you can use Custom Metrics Autoscaler Operator for Red Hat OpenShift to specify how OpenShift Container Platform should automatically increase or decrease the number of pods for a deployment, stateful set, custom resource, or job based on custom metrics that are not based only on CPU or memory.
The Custom Metrics Autoscaler Operator is an optional Operator, based on the Kubernetes Event Driven Autoscaler (KEDA), that allows workloads to be scaled using additional metrics sources other than pod metrics.
The custom metrics autoscaler currently supports only the Prometheus, CPU, memory, and Apache Kafka metrics.
The Custom Metrics Autoscaler Operator scales your pods up and down based on custom, external metrics from specific applications. Your other applications continue to use other scaling methods. You configure triggers, also known as scalers, which are the source of events and metrics that the custom metrics autoscaler uses to determine how to scale. The custom metrics autoscaler uses a metrics API to convert the external metrics to a form that OpenShift Container Platform can use. The custom metrics autoscaler creates a horizontal pod autoscaler (HPA) that performs the actual scaling.
To use the custom metrics autoscaler, you create a ScaledObject
or ScaledJob
object for a workload, which is a custom resource (CR) that defines the scaling metadata. You specify the deployment or job to scale, the source of the metrics to scale on (trigger), and other parameters such as the minimum and maximum replica counts allowed.
You can create only one scaled object or scaled job for each workload that you want to scale. Also, you cannot use a scaled object or scaled job and the horizontal pod autoscaler (HPA) on the same workload.
The custom metrics autoscaler, unlike the HPA, can scale to zero. If you set the minReplicaCount
value in the custom metrics autoscaler CR to 0
, the custom metrics autoscaler scales the workload down from 1 to 0 replicas to or up from 0 replicas to 1. This is known as the activation phase. After scaling up to 1 replica, the HPA takes control of the scaling. This is known as the scaling phase.
Some triggers allow you to change the number of replicas that are scaled by the cluster metrics autoscaler. In all cases, the parameter to configure the activation phase always uses the same phrase, prefixed with activation. For example, if the threshold
parameter configures scaling, activationThreshold
would configure activation. Configuring the activation and scaling phases allows you more flexibility with your scaling policies. For example, you can configure a higher activation phase to prevent scaling up or down if the metric is particularly low.
The activation value has more priority than the scaling value in case of different decisions for each. For example, if the threshold
is set to 10
, and the activationThreshold
is 50
, if the metric reports 40
, the scaler is not active and the pods are scaled to zero even if the HPA requires 4 instances.
Figure 3.1. Custom metrics autoscaler workflow
- You create or modify a scaled object custom resource for a workload on a cluster. The object contains the scaling configuration for that workload. Prior to accepting the new object, the OpenShift API server sends it to the custom metrics autoscaler admission webhooks process to ensure that the object is valid. If validation succeeds, the API server persists the object.
- The custom metrics autoscaler controller watches for new or modified scaled objects. When the OpenShift API server notifies the controller of a change, the controller monitors any external trigger sources, also known as data sources, that are specified in the object for changes to the metrics data. One or more scalers request scaling data from the external trigger source. For example, for a Kafka trigger type, the controller uses the Kafka scaler to communicate with a Kafka instance to obtain the data requested by the trigger.
- The controller creates a horizontal pod autoscaler object for the scaled object. As a result, the Horizontal Pod Autoscaler (HPA) Operator starts monitoring the scaling data associated with the trigger. The HPA requests scaling data from the cluster OpenShift API server endpoint.
- The OpenShift API server endpoint is served by the custom metrics autoscaler metrics adapter. When the metrics adapter receives a request for custom metrics, it uses a GRPC connection to the controller to request it for the most recent trigger data received from the scaler.
- The HPA makes scaling decisions based upon the data received from the metrics adapter and scales the workload up or down by increasing or decreasing the replicas.
- As a it operates, a workload can affect the scaling metrics. For example, if a workload is scaled up to handle work in a Kafka queue, the queue size decreases after the workload processes all the work. As a result, the workload is scaled down.
-
If the metrics are in a range specified by the
minReplicaCount
value, the custom metrics autoscaler controller disables all scaling, and leaves the replica count at a fixed level. If the metrics exceed that range, the custom metrics autoscaler controller enables scaling and allows the HPA to scale the workload. While scaling is disabled, the HPA does not take any action.
3.2.1. Custom CA certificates for the Custom Metrics Autoscaler
By default, the Custom Metrics Autoscaler Operator uses automatically-generated service CA certificate to connect to on-cluster services.
If you want to use off-cluster services that require custom CA certificates, you can add the required certificates to a config map. Then, add the config map to the KedaController
custom resource as described in Installing the custom metrics autoscaler. The Operator loads those certificates on start-up and registers them as trusted by the Operator.
The config maps can contain one or more certificate files that contain one or more PEM-encoded CA certificates. Or, you can use separate config maps for each certificate file.
If you later update the config map to add additional certificates, you must restart the keda-operator-*
pod for the changes to take effect.
3.3. Installing the custom metrics autoscaler
You can use the OpenShift Container Platform web console to install the Custom Metrics Autoscaler Operator.
The installation creates the following five CRDs:
-
ClusterTriggerAuthentication
-
KedaController
-
ScaledJob
-
ScaledObject
-
TriggerAuthentication
3.3.1. Installing the custom metrics autoscaler
You can use the following procedure to install the Custom Metrics Autoscaler Operator.
Prerequisites
- Remove any previously-installed Technology Preview versions of the Cluster Metrics Autoscaler Operator.
Remove any versions of the community-based KEDA.
Also, remove the KEDA 1.x custom resource definitions by running the following commands:
$ oc delete crd scaledobjects.keda.k8s.io
$ oc delete crd triggerauthentications.keda.k8s.io
Optional: If you need the Custom Metrics Autoscaler Operator to connect to off-cluster services, such as an external Kafka cluster or an external Prometheus service, put any required service CA certificates into a config map. The config map must exist in the same namespace where the Operator is installed. For example:
$ oc create configmap -n openshift-keda thanos-cert --from-file=ca-cert.pem
Procedure
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose Custom Metrics Autoscaler from the list of available Operators, and click Install.
- On the Install Operator page, ensure that the All namespaces on the cluster (default) option is selected for Installation Mode. This installs the Operator in all namespaces.
- Ensure that the openshift-keda namespace is selected for Installed Namespace. OpenShift Container Platform creates the namespace, if not present in your cluster.
- Click Install.
Verify the installation by listing the Custom Metrics Autoscaler Operator components:
- Navigate to Workloads → Pods.
-
Select the
openshift-keda
project from the drop-down menu and verify that thecustom-metrics-autoscaler-operator-*
pod is running. -
Navigate to Workloads → Deployments to verify that the
custom-metrics-autoscaler-operator
deployment is running.
Optional: Verify the installation in the OpenShift CLI using the following commands:
$ oc get all -n openshift-keda
The output appears similar to the following:
Example output
NAME READY STATUS RESTARTS AGE pod/custom-metrics-autoscaler-operator-5fd8d9ffd8-xt4xp 1/1 Running 0 18m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/custom-metrics-autoscaler-operator 1/1 1 1 18m NAME DESIRED CURRENT READY AGE replicaset.apps/custom-metrics-autoscaler-operator-5fd8d9ffd8 1 1 1 18m
Install the
KedaController
custom resource, which creates the required CRDs:- In the OpenShift Container Platform web console, click Operators → Installed Operators.
- Click Custom Metrics Autoscaler.
- On the Operator Details page, click the KedaController tab.
On the KedaController tab, click Create KedaController and edit the file.
kind: KedaController apiVersion: keda.sh/v1alpha1 metadata: name: keda namespace: openshift-keda spec: watchNamespace: '' 1 operator: logLevel: info 2 logEncoder: console 3 caConfigMaps: 4 - thanos-cert - kafka-cert metricsServer: logLevel: '0' 5 auditConfig: 6 logFormat: "json" logOutputVolumeClaim: "persistentVolumeClaimName" policy: rules: - level: Metadata omitStages: ["RequestReceived"] omitManagedFields: false lifetime: maxAge: "2" maxBackup: "1" maxSize: "50" serviceAccount: {}
- 1
- Specifies a single namespace in which the Custom Metrics Autoscaler Operator should scale applications. Leave it blank or leave it empty to scale applications in all namespaces. This field should have a namespace or be empty. The default value is empty.
- 2
- Specifies the level of verbosity for the Custom Metrics Autoscaler Operator log messages. The allowed values are
debug
,info
,error
. The default isinfo
. - 3
- Specifies the logging format for the Custom Metrics Autoscaler Operator log messages. The allowed values are
console
orjson
. The default isconsole
. - 4
- Optional: Specifies one or more config maps with CA certificates, which the Custom Metrics Autoscaler Operator can use to connect securely to TLS-enabled metrics sources.
- 5
- Specifies the logging level for the Custom Metrics Autoscaler Metrics Server. The allowed values are
0
forinfo
and4
ordebug
. The default is0
. - 6
- Activates audit logging for the Custom Metrics Autoscaler Operator and specifies the audit policy to use, as described in the "Configuring audit logging" section.
- Click Create to create the KEDA controller.
3.4. Understanding custom metrics autoscaler triggers
Triggers, also known as scalers, provide the metrics that the Custom Metrics Autoscaler Operator uses to scale your pods.
The custom metrics autoscaler currently supports only the Prometheus, CPU, memory, and Apache Kafka triggers.
You use a ScaledObject
or ScaledJob
custom resource to configure triggers for specific objects, as described in the sections that follow.
3.4.1. Understanding the Prometheus trigger
You can scale pods based on Prometheus metrics, which can use the installed OpenShift Container Platform monitoring or an external Prometheus server as the metrics source. See "Additional resources" for information on the configurations required to use the OpenShift Container Platform monitoring as a source for metrics.
If Prometheus is collecting metrics from the application that the custom metrics autoscaler is scaling, do not set the minimum replicas to 0
in the custom resource. If there are no application pods, the custom metrics autoscaler does not have any metrics to scale on.
Example scaled object with a Prometheus target
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: prom-scaledobject namespace: my-namespace spec: # ... triggers: - type: prometheus 1 metadata: serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092 2 namespace: kedatest 3 metricName: http_requests_total 4 threshold: '5' 5 query: sum(rate(http_requests_total{job="test-app"}[1m])) 6 authModes: basic 7 cortexOrgID: my-org 8 ignoreNullValues: "false" 9 unsafeSsl: "false" 10
- 1
- Specifies Prometheus as the trigger type.
- 2
- Specifies the address of the Prometheus server. This example uses OpenShift Container Platform monitoring.
- 3
- Optional: Specifies the namespace of the object you want to scale. This parameter is mandatory if using OpenShift Container Platform monitoring as a source for the metrics.
- 4
- Specifies the name to identify the metric in the
external.metrics.k8s.io
API. If you are using more than one trigger, all metric names must be unique. - 5
- Specifies the value that triggers scaling. Must be specified as a quoted string value.
- 6
- Specifies the Prometheus query to use.
- 7
- Specifies the authentication method to use. Prometheus scalers support bearer authentication (
bearer
), basic authentication (basic
), or TLS authentication (tls
). You configure the specific authentication parameters in a trigger authentication, as discussed in a following section. As needed, you can also use a secret. - 8
- 9
- Optional: Specifies how the trigger should proceed if the Prometheus target is lost.
-
If
true
, the trigger continues to operate if the Prometheus target is lost. This is the default behavior. -
If
false
, the trigger returns an error if the Prometheus target is lost.
-
If
- 10
- Optional: Specifies whether the certificate check should be skipped. For example, you might skip the check if you use self-signed certificates at the Prometheus endpoint.
-
If
true
, the certificate check is performed. -
If
false
, the certificate check is not performed. This is the default behavior.
-
If
3.4.1.1. Configuring the custom metrics autoscaler to use OpenShift Container Platform monitoring
You can use the installed OpenShift Container Platform Prometheus monitoring as a source for the metrics used by the custom metrics autoscaler. However, there are some additional configurations you must perform.
These steps are not required for an external Prometheus source.
You must perform the following tasks, as described in this section:
- Create a service account.
- Create a secret that generates a token for the service account.
- Create the trigger authentication.
- Create a role.
- Add that role to the service account.
- Reference the token in the trigger authentication object used by Prometheus.
Prerequisites
- OpenShift Container Platform monitoring must be installed.
- Monitoring of user-defined workloads must be enabled in OpenShift Container Platform monitoring, as described in the Creating a user-defined workload monitoring config map section.
- The Custom Metrics Autoscaler Operator must be installed.
Procedure
Change to the project with the object you want to scale:
$ oc project my-project
Create a service account and token, if your cluster does not have one:
Create a
service account
object by using the following command:$ oc create serviceaccount thanos 1
- 1
- Specifies the name of the service account.
Create a
secret
YAML to generate a service account token:apiVersion: v1 kind: Secret metadata: name: thanos-token annotations: kubernetes.io/service-account.name: thanos 1 type: kubernetes.io/service-account-token
- 1
- Specifies the name of the service account.
Create the secret object by using the following command:
$ oc create -f <file_name>.yaml
Use the following command to locate the token assigned to the service account:
$ oc describe serviceaccount thanos 1
- 1
- Specifies the name of the service account.
Example output
Name: thanos Namespace: my-project Labels: <none> Annotations: <none> Image pull secrets: thanos-dockercfg-nnwgj Mountable secrets: thanos-dockercfg-nnwgj Tokens: thanos-token 1 Events: <none>
- 1
- Use this token in the trigger authentication.
Create a trigger authentication with the service account token:
Create a YAML file similar to the following:
apiVersion: keda.sh/v1alpha1 kind: TriggerAuthentication metadata: name: keda-trigger-auth-prometheus spec: secretTargetRef: 1 - parameter: bearerToken 2 name: thanos-token 3 key: token 4 - parameter: ca name: thanos-token key: ca.crt
Create the CR object:
$ oc create -f <file-name>.yaml
Create a role for reading Thanos metrics:
Create a YAML file with the following parameters:
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: thanos-metrics-reader rules: - apiGroups: - "" resources: - pods verbs: - get - apiGroups: - metrics.k8s.io resources: - pods - nodes verbs: - get - list - watch
Create the CR object:
$ oc create -f <file-name>.yaml
Create a role binding for reading Thanos metrics:
Create a YAML file similar to the following:
apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: thanos-metrics-reader 1 namespace: my-project 2 roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: thanos-metrics-reader subjects: - kind: ServiceAccount name: thanos 3 namespace: my-project 4
Create the CR object:
$ oc create -f <file-name>.yaml
You can now deploy a scaled object or scaled job to enable autoscaling for your application, as described in "Understanding how to add custom metrics autoscalers". To use OpenShift Container Platform monitoring as the source, in the trigger, or scaler, you must include the following parameters:
-
triggers.type
must beprometheus
-
triggers.metadata.serverAddress
must behttps://thanos-querier.openshift-monitoring.svc.cluster.local:9092
-
triggers.metadata.authModes
must bebearer
-
triggers.metadata.namespace
must be set to the namespace of the object to scale -
triggers.authenticationRef
must point to the trigger authentication resource specified in the previous step
3.4.2. Understanding the CPU trigger
You can scale pods based on CPU metrics. This trigger uses cluster metrics as the source for metrics.
The custom metrics autoscaler scales the pods associated with an object to maintain the CPU usage that you specify. The autoscaler increases or decreases the number of replicas between the minimum and maximum numbers to maintain the specified CPU utilization across all pods. The memory trigger considers the memory utilization of the entire pod. If the pod has multiple containers, the memory trigger considers the total memory utilization of all containers in the pod.
-
This trigger cannot be used with the
ScaledJob
custom resource. -
When using a memory trigger to scale an object, the object does not scale to
0
, even if you are using multiple triggers.
Example scaled object with a CPU target
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: cpu-scaledobject namespace: my-namespace spec: # ... triggers: - type: cpu 1 metricType: Utilization 2 metadata: value: '60' 3 minReplicaCount: 1 4
- 1
- Specifies CPU as the trigger type.
- 2
- Specifies the type of metric to use, either
Utilization
orAverageValue
. - 3
- Specifies the value that triggers scaling. Must be specified as a quoted string value.
-
When using
Utilization
, the target value is the average of the resource metrics across all relevant pods, represented as a percentage of the requested value of the resource for the pods. -
When using
AverageValue
, the target value is the average of the metrics across all relevant pods.
-
When using
- 4
- Specifies the minimum number of replicas when scaling down. For a CPU trigger, enter a value of
1
or greater, because the HPA cannot scale to zero if you are using only CPU metrics.
3.4.3. Understanding the memory trigger
You can scale pods based on memory metrics. This trigger uses cluster metrics as the source for metrics.
The custom metrics autoscaler scales the pods associated with an object to maintain the average memory usage that you specify. The autoscaler increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified memory utilization across all pods. The memory trigger considers the memory utilization of entire pod. If the pod has multiple containers, the memory utilization is the sum of all of the containers.
-
This trigger cannot be used with the
ScaledJob
custom resource. -
When using a memory trigger to scale an object, the object does not scale to
0
, even if you are using multiple triggers.
Example scaled object with a memory target
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: memory-scaledobject namespace: my-namespace spec: # ... triggers: - type: memory 1 metricType: Utilization 2 metadata: value: '60' 3 containerName: api 4
- 1
- Specifies memory as the trigger type.
- 2
- Specifies the type of metric to use, either
Utilization
orAverageValue
. - 3
- Specifies the value that triggers scaling. Must be specified as a quoted string value.
-
When using
Utilization
, the target value is the average of the resource metrics across all relevant pods, represented as a percentage of the requested value of the resource for the pods. -
When using
AverageValue
, the target value is the average of the metrics across all relevant pods.
-
When using
- 4
- Optional: Specifies an individual container to scale, based on the memory utilization of only that container, rather than the entire pod. In this example, only the container named
api
is to be scaled.
3.4.4. Understanding the Kafka trigger
You can scale pods based on an Apache Kafka topic or other services that support the Kafka protocol. The custom metrics autoscaler does not scale higher than the number of Kafka partitions, unless you set the allowIdleConsumers
parameter to true
in the scaled object or scaled job.
If the number of consumer groups exceeds the number of partitions in a topic, the extra consumer groups remain idle. To avoid this, by default the number of replicas does not exceed:
- The number of partitions on a topic, if a topic is specified
- The number of partitions of all topics in the consumer group, if no topic is specified
-
The
maxReplicaCount
specified in scaled object or scaled job CR
You can use the allowIdleConsumers
parameter to disable these default behaviors.
Example scaled object with a Kafka target
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: kafka-scaledobject namespace: my-namespace spec: # ... triggers: - type: kafka 1 metadata: topic: my-topic 2 bootstrapServers: my-cluster-kafka-bootstrap.openshift-operators.svc:9092 3 consumerGroup: my-group 4 lagThreshold: '10' 5 activationLagThreshold: '5' 6 offsetResetPolicy: latest 7 allowIdleConsumers: true 8 scaleToZeroOnInvalidOffset: false 9 excludePersistentLag: false 10 version: '1.0.0' 11 partitionLimitation: '1,2,10-20,31' 12
- 1
- Specifies Kafka as the trigger type.
- 2
- Specifies the name of the Kafka topic on which Kafka is processing the offset lag.
- 3
- Specifies a comma-separated list of Kafka brokers to connect to.
- 4
- Specifies the name of the Kafka consumer group used for checking the offset on the topic and processing the related lag.
- 5
- Optional: Specifies the average target value that triggers scaling. Must be specified as a quoted string value. The default is
5
. - 6
- Optional: Specifies the target value for the activation phase. Must be specified as a quoted string value.
- 7
- Optional: Specifies the Kafka offset reset policy for the Kafka consumer. The available values are:
latest
andearliest
. The default islatest
. - 8
- Optional: Specifies whether the number of Kafka replicas can exceed the number of partitions on a topic.
-
If
true
, the number of Kafka replicas can exceed the number of partitions on a topic. This allows for idle Kafka consumers. -
If
false
, the number of Kafka replicas cannot exceed the number of partitions on a topic. This is the default.
-
If
- 9
- Specifies how the trigger behaves when a Kafka partition does not have a valid offset.
-
If
true
, the consumers are scaled to zero for that partition. -
If
false
, the scaler keeps a single consumer for that partition. This is the default.
-
If
- 10
- Optional: Specifies whether the trigger includes or excludes partition lag for partitions whose current offset is the same as the current offset of the previous polling cycle.
-
If
true
, the scaler excludes partition lag in these partitions. -
If
false
, the trigger includes all consumer lag in all partitions. This is the default.
-
If
- 11
- Optional: Specifies the version of your Kafka brokers. Must be specified as a quoted string value. The default is
1.0.0
. - 12
- Optional: Specifies a comma-separated list of partition IDs to scope the scaling on. If set, only the listed IDs are considered when calculating lag. Must be specified as a quoted string value. The default is to consider all partitions.
3.4.5. Understanding the Cron trigger
You can scale pods based on a time range.
When the time range starts, the custom metrics autoscaler scales the pods associated with an object from the configured minimum number of pods to the specified number of desired pods. At the end of the time range, the pods are scaled back to the configured minimum. The time period must be configured in cron format.
The custom metrics autoscaler with Cron trigger scales pods based only on the specified times and cannot scale pods on a daily, weekly, monthly, or yearly schedule.
The following example scales the pods associated with this scaled object from 0
to 100
from 6:00 AM to 6:30 PM India Standard Time.
Example scaled object with a Cron trigger
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: cron-scaledobject namespace: default spec: scaleTargetRef: name: my-deployment minReplicaCount: 0 1 maxReplicaCount: 100 2 cooldownPeriod: 300 triggers: - type: cron 3 metadata: timezone: Asia/Kolkata 4 start: "0 6 * * *" 5 end: "30 18 * * *" 6 desiredReplicas: "100" 7
- 1
- Specifies the minimum number of pods to scale down to at the end of the time frame.
- 2
- Specifies the maximum number of replicas when scaling up. This value should be the same as
desiredReplicas
. The default is100
. - 3
- Specifies a Cron trigger.
- 4
- Specifies the timezone for the time frame. This value must be from the IANA Time Zone Database.
- 5
- Specifies the start of the time frame.
- 6
- Specifies the end of the time frame.
- 7
- Specifies the number of pods to scale to between the start and end of the time frame. This value should be the same as
maxReplicaCount
.
3.5. Understanding custom metrics autoscaler trigger authentications
A trigger authentication allows you to include authentication information in a scaled object or a scaled job that can be used by the associated containers. You can use trigger authentications to pass OpenShift Container Platform secrets, platform-native pod authentication mechanisms, environment variables, and so on.
You define a TriggerAuthentication
object in the same namespace as the object that you want to scale. That trigger authentication can be used only by objects in that namespace.
Alternatively, to share credentials between objects in multiple namespaces, you can create a ClusterTriggerAuthentication
object that can be used across all namespaces.
Trigger authentications and cluster trigger authentication use the same configuration. However, a cluster trigger authentication requires an additional kind
parameter in the authentication reference of the scaled object.
Example trigger authentication with a secret
kind: TriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: name: secret-triggerauthentication namespace: my-namespace 1 spec: secretTargetRef: 2 - parameter: user-name 3 name: my-secret 4 key: USER_NAME 5 - parameter: password name: my-secret key: USER_PASSWORD
- 1
- Specifies the namespace of the object you want to scale.
- 2
- Specifies that this trigger authentication uses a secret for authorization.
- 3
- Specifies the authentication parameter to supply by using the secret.
- 4
- Specifies the name of the secret to use.
- 5
- Specifies the key in the secret to use with the specified parameter.
Example cluster trigger authentication with a secret
kind: ClusterTriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: 1 name: secret-cluster-triggerauthentication spec: secretTargetRef: 2 - parameter: user-name 3 name: secret-name 4 key: USER_NAME 5 - parameter: user-password name: secret-name key: USER_PASSWORD
- 1
- Note that no namespace is used with a cluster trigger authentication.
- 2
- Specifies that this trigger authentication uses a secret for authorization.
- 3
- Specifies the authentication parameter to supply by using the secret.
- 4
- Specifies the name of the secret to use.
- 5
- Specifies the key in the secret to use with the specified parameter.
Example trigger authentication with a token
kind: TriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: name: token-triggerauthentication namespace: my-namespace 1 spec: secretTargetRef: 2 - parameter: bearerToken 3 name: my-token-2vzfq 4 key: token 5 - parameter: ca name: my-token-2vzfq key: ca.crt
- 1
- Specifies the namespace of the object you want to scale.
- 2
- Specifies that this trigger authentication uses a secret for authorization.
- 3
- Specifies the authentication parameter to supply by using the token.
- 4
- Specifies the name of the token to use.
- 5
- Specifies the key in the token to use with the specified parameter.
Example trigger authentication with an environment variable
kind: TriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: name: env-var-triggerauthentication namespace: my-namespace 1 spec: env: 2 - parameter: access_key 3 name: ACCESS_KEY 4 containerName: my-container 5
- 1
- Specifies the namespace of the object you want to scale.
- 2
- Specifies that this trigger authentication uses environment variables for authorization.
- 3
- Specify the parameter to set with this variable.
- 4
- Specify the name of the environment variable.
- 5
- Optional: Specify a container that requires authentication. The container must be in the same resource as referenced by
scaleTargetRef
in the scaled object.
Example trigger authentication with pod authentication providers
kind: TriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: name: pod-id-triggerauthentication namespace: my-namespace 1 spec: podIdentity: 2 provider: aws-eks 3
Additional resources
- For information about OpenShift Container Platform secrets, see Providing sensitive data to pods.
3.5.1. Using trigger authentications
You use trigger authentications and cluster trigger authentications by using a custom resource to create the authentication, then add a reference to a scaled object or scaled job.
Prerequisites
- The Custom Metrics Autoscaler Operator must be installed.
If you are using a secret, the
Secret
object must exist, for example:Example secret
apiVersion: v1 kind: Secret metadata: name: my-secret data: user-name: <base64_USER_NAME> password: <base64_USER_PASSWORD>
Procedure
Create the
TriggerAuthentication
orClusterTriggerAuthentication
object.Create a YAML file that defines the object:
Example trigger authentication with a secret
kind: TriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: name: prom-triggerauthentication namespace: my-namespace spec: secretTargetRef: - parameter: user-name name: my-secret key: USER_NAME - parameter: password name: my-secret key: USER_PASSWORD
Create the
TriggerAuthentication
object:$ oc create -f <filename>.yaml
Create or edit a
ScaledObject
YAML file that uses the trigger authentication:Create a YAML file that defines the object by running the following command:
Example scaled object with a trigger authentication
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: scaledobject namespace: my-namespace spec: scaleTargetRef: name: example-deployment maxReplicaCount: 100 minReplicaCount: 0 pollingInterval: 30 triggers: - type: prometheus metadata: serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092 namespace: kedatest # replace <NAMESPACE> metricName: http_requests_total threshold: '5' query: sum(rate(http_requests_total{job="test-app"}[1m])) authModes: "basic" authenticationRef: name: prom-triggerauthentication 1 kind: TriggerAuthentication 2
Example scaled object with a cluster trigger authentication
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: scaledobject namespace: my-namespace spec: scaleTargetRef: name: example-deployment maxReplicaCount: 100 minReplicaCount: 0 pollingInterval: 30 triggers: - type: prometheus metadata: serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092 namespace: kedatest # replace <NAMESPACE> metricName: http_requests_total threshold: '5' query: sum(rate(http_requests_total{job="test-app"}[1m])) authModes: "basic" authenticationRef: name: prom-cluster-triggerauthentication 1 kind: ClusterTriggerAuthentication 2
Create the scaled object by running the following command:
$ oc apply -f <filename>
3.6. Pausing the custom metrics autoscaler for a scaled object
You can pause and restart the autoscaling of a workload, as needed.
For example, you might want to pause autoscaling before performing cluster maintenance or to avoid resource starvation by removing non-mission-critical workloads.
3.6.1. Pausing a custom metrics autoscaler
You can pause the autoscaling of a scaled object by adding the autoscaling.keda.sh/paused-replicas
annotation to the custom metrics autoscaler for that scaled object. The custom metrics autoscaler scales the replicas for that workload to the specified value and pauses autoscaling until the annotation is removed.
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: annotations: autoscaling.keda.sh/paused-replicas: "4" # ...
Procedure
Use the following command to edit the
ScaledObject
CR for your workload:$ oc edit ScaledObject scaledobject
Add the
autoscaling.keda.sh/paused-replicas
annotation with any value:apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: annotations: autoscaling.keda.sh/paused-replicas: "4" 1 creationTimestamp: "2023-02-08T14:41:01Z" generation: 1 name: scaledobject namespace: my-project resourceVersion: '65729' uid: f5aec682-acdf-4232-a783-58b5b82f5dd0
- 1
- Specifies that the Custom Metrics Autoscaler Operator is to scale the replicas to the specified value and stop autoscaling.
3.6.2. Restarting the custom metrics autoscaler for a scaled object
You can restart a paused custom metrics autoscaler by removing the autoscaling.keda.sh/paused-replicas
annotation for that ScaledObject
.
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: annotations: autoscaling.keda.sh/paused-replicas: "4" # ...
Procedure
Use the following command to edit the
ScaledObject
CR for your workload:$ oc edit ScaledObject scaledobject
Remove the
autoscaling.keda.sh/paused-replicas
annotation.apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: annotations: autoscaling.keda.sh/paused-replicas: "4" 1 creationTimestamp: "2023-02-08T14:41:01Z" generation: 1 name: scaledobject namespace: my-project resourceVersion: '65729' uid: f5aec682-acdf-4232-a783-58b5b82f5dd0
- 1
- Remove this annotation to restart a paused custom metrics autoscaler.
3.7. Gathering audit logs
You can gather audit logs, which are a security-relevant chronological set of records documenting the sequence of activities that have affected the system by individual users, administrators, or other components of the system.
For example, audit logs can help you understand where an autoscaling request is coming from. This is key information when backends are getting overloaded by autoscaling requests made by user applications and you need to determine which is the troublesome application.
3.7.1. Configuring audit logging
You can configure auditing for the Custom Metrics Autoscaler Operator by editing the KedaController
custom resource. The logs are sent to an audit log file on a volume that is secured by using a persistent volume claim in the KedaController
CR.
Prerequisites
- The Custom Metrics Autoscaler Operator must be installed.
Procedure
Edit the
KedaController
custom resource to add theauditConfig
stanza:kind: KedaController apiVersion: keda.sh/v1alpha1 metadata: name: keda namespace: openshift-keda spec: # ... metricsServer: # ... auditConfig: logFormat: "json" 1 logOutputVolumeClaim: "pvc-audit-log" 2 policy: rules: 3 - level: Metadata omitStages: "RequestReceived" 4 omitManagedFields: false 5 lifetime: 6 maxAge: "2" maxBackup: "1" maxSize: "50"
- 1
- Specifies the output format of the audit log, either
legacy
orjson
. - 2
- Specifies an existing persistent volume claim for storing the log data. All requests coming to the API server are logged to this persistent volume claim. If you leave this field empty, the log data is sent to stdout.
- 3
- Specifies which events should be recorded and what data they should include:
-
None
: Do not log events. -
Metadata
: Log only the metadata for the request, such as user, timestamp, and so forth. Do not log the request text and the response text. This is the default. -
Request
: Log only the metadata and the request text but not the response text. This option does not apply for non-resource requests. -
RequestResponse
: Log event metadata, request text, and response text. This option does not apply for non-resource requests.
-
- 4
- Specifies stages for which no event is created.
- 5
- Specifies whether to omit the managed fields of the request and response bodies from being written to the API audit log, either
true
to omit the fields orfalse
to include the fields. - 6
- Specifies the size and lifespan of the audit logs.
-
maxAge
: The maximum number of days to retain audit log files, based on the timestamp encoded in their filename. -
maxBackup
: The maximum number of audit log files to retain. Set to0
to retain all audit log files. -
maxSize
: The maximum size in megabytes of an audit log file before it gets rotated.
-
Verification
View the audit log file directly:
Obtain the name of the
keda-metrics-apiserver-*
pod:oc get pod -n openshift-keda
Example output
NAME READY STATUS RESTARTS AGE custom-metrics-autoscaler-operator-5cb44cd75d-9v4lv 1/1 Running 0 8m20s keda-metrics-apiserver-65c7cc44fd-rrl4r 1/1 Running 0 2m55s keda-operator-776cbb6768-zpj5b 1/1 Running 0 2m55s
View the log data by using a command similar to the following:
$ oc logs keda-metrics-apiserver-<hash>|grep -i metadata 1
- 1
- Optional: You can use the
grep
command to specify the log level to display:Metadata
,Request
,RequestResponse
.
For example:
$ oc logs keda-metrics-apiserver-65c7cc44fd-rrl4r|grep -i metadata
Example output
... {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4c81d41b-3dab-4675-90ce-20b87ce24013","stage":"ResponseComplete","requestURI":"/healthz","verb":"get","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.131.0.1"],"userAgent":"kube-probe/1.28","responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2023-02-16T13:00:03.554567Z","stageTimestamp":"2023-02-16T13:00:03.555032Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} ...
Alternatively, you can view a specific log:
Use a command similar to the following to log into the
keda-metrics-apiserver-*
pod:$ oc rsh pod/keda-metrics-apiserver-<hash> -n openshift-keda
For example:
$ oc rsh pod/keda-metrics-apiserver-65c7cc44fd-rrl4r -n openshift-keda
Change to the
/var/audit-policy/
directory:sh-4.4$ cd /var/audit-policy/
List the available logs:
sh-4.4$ ls
Example output
log-2023.02.17-14:50 policy.yaml
View the log, as needed:
sh-4.4$ cat <log_name>/<pvc_name>|grep -i <log_level> 1
- 1
- Optional: You can use the
grep
command to specify the log level to display:Metadata
,Request
,RequestResponse
.
For example:
sh-4.4$ cat log-2023.02.17-14:50/pvc-audit-log|grep -i Request
Example output
... {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"63e7f68c-04ec-4f4d-8749-bf1656572a41","stage":"ResponseComplete","requestURI":"/openapi/v2","verb":"get","user":{"username":"system:aggregator","groups":["system:authenticated"]},"sourceIPs":["10.128.0.1"],"responseStatus":{"metadata":{},"code":304},"requestReceivedTimestamp":"2023-02-17T13:12:55.035478Z","stageTimestamp":"2023-02-17T13:12:55.038346Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:discovery\" of ClusterRole \"system:discovery\" to Group \"system:authenticated\""}} ...
3.8. Gathering debugging data
When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.
To help troubleshoot your issue, provide the following information:
-
Data gathered using the
must-gather
tool. - The unique cluster ID.
You can use the must-gather
tool to collect data about the Custom Metrics Autoscaler Operator and its components, including the following items:
-
The
openshift-keda
namespace and its child objects. - The Custom Metric Autoscaler Operator installation objects.
- The Custom Metric Autoscaler Operator CRD objects.
3.8.1. Gathering debugging data
The following command runs the must-gather
tool for the Custom Metrics Autoscaler Operator:
$ oc adm must-gather --image="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \ -n openshift-marketplace \ -o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"
The standard OpenShift Container Platform must-gather
command, oc adm must-gather
, does not collect Custom Metrics Autoscaler Operator data.
Prerequisites
-
You are logged in to OpenShift Container Platform as a user with the
cluster-admin
role. -
The OpenShift Container Platform CLI (
oc
) installed.
Procedure
Navigate to the directory where you want to store the
must-gather
data.NoteIf your cluster is using a restricted network, you must take additional steps. If your mirror registry has a trusted CA, you must first add the trusted CA to the cluster. For all clusters on restricted networks, you must import the default
must-gather
image as an image stream by running the following command.$ oc import-image is/must-gather -n openshift
Perform one of the following:
To get only the Custom Metrics Autoscaler Operator
must-gather
data, use the following command:$ oc adm must-gather --image="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \ -n openshift-marketplace \ -o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"
The custom image for the
must-gather
command is pulled directly from the Operator package manifests, so that it works on any cluster where the Custom Metric Autoscaler Operator is available.To gather the default
must-gather
data in addition to the Custom Metric Autoscaler Operator information:Use the following command to obtain the Custom Metrics Autoscaler Operator image and set it as an environment variable:
$ IMAGE="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \ -n openshift-marketplace \ -o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"
Use the
oc adm must-gather
with the Custom Metrics Autoscaler Operator image:$ oc adm must-gather --image-stream=openshift/must-gather --image=${IMAGE}
Example 3.1. Example must-gather output for the Custom Metric Autoscaler
└── openshift-keda ├── apps │ ├── daemonsets.yaml │ ├── deployments.yaml │ ├── replicasets.yaml │ └── statefulsets.yaml ├── apps.openshift.io │ └── deploymentconfigs.yaml ├── autoscaling │ └── horizontalpodautoscalers.yaml ├── batch │ ├── cronjobs.yaml │ └── jobs.yaml ├── build.openshift.io │ ├── buildconfigs.yaml │ └── builds.yaml ├── core │ ├── configmaps.yaml │ ├── endpoints.yaml │ ├── events.yaml │ ├── persistentvolumeclaims.yaml │ ├── pods.yaml │ ├── replicationcontrollers.yaml │ ├── secrets.yaml │ └── services.yaml ├── discovery.k8s.io │ └── endpointslices.yaml ├── image.openshift.io │ └── imagestreams.yaml ├── k8s.ovn.org │ ├── egressfirewalls.yaml │ └── egressqoses.yaml ├── keda.sh │ ├── kedacontrollers │ │ └── keda.yaml │ ├── scaledobjects │ │ └── example-scaledobject.yaml │ └── triggerauthentications │ └── example-triggerauthentication.yaml ├── monitoring.coreos.com │ └── servicemonitors.yaml ├── networking.k8s.io │ └── networkpolicies.yaml ├── openshift-keda.yaml ├── pods │ ├── custom-metrics-autoscaler-operator-58bd9f458-ptgwx │ │ ├── custom-metrics-autoscaler-operator │ │ │ └── custom-metrics-autoscaler-operator │ │ │ └── logs │ │ │ ├── current.log │ │ │ ├── previous.insecure.log │ │ │ └── previous.log │ │ └── custom-metrics-autoscaler-operator-58bd9f458-ptgwx.yaml │ ├── custom-metrics-autoscaler-operator-58bd9f458-thbsh │ │ └── custom-metrics-autoscaler-operator │ │ └── custom-metrics-autoscaler-operator │ │ └── logs │ ├── keda-metrics-apiserver-65c7cc44fd-6wq4g │ │ ├── keda-metrics-apiserver │ │ │ └── keda-metrics-apiserver │ │ │ └── logs │ │ │ ├── current.log │ │ │ ├── previous.insecure.log │ │ │ └── previous.log │ │ └── keda-metrics-apiserver-65c7cc44fd-6wq4g.yaml │ └── keda-operator-776cbb6768-fb6m5 │ ├── keda-operator │ │ └── keda-operator │ │ └── logs │ │ ├── current.log │ │ ├── previous.insecure.log │ │ └── previous.log │ └── keda-operator-776cbb6768-fb6m5.yaml ├── policy │ └── poddisruptionbudgets.yaml └── route.openshift.io └── routes.yaml
Create a compressed file from the
must-gather
directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:$ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ 1
- 1
- Replace
must-gather-local.5421342344627712289/
with the actual directory name.
- Attach the compressed file to your support case on the Red Hat Customer Portal.
3.9. Viewing Operator metrics
The Custom Metrics Autoscaler Operator exposes ready-to-use metrics that it pulls from the on-cluster monitoring component. You can query the metrics by using the Prometheus Query Language (PromQL) to analyze and diagnose issues. All metrics are reset when the controller pod restarts.
3.9.1. Accessing performance metrics
You can access the metrics and run queries by using the OpenShift Container Platform web console.
Procedure
- Select the Administrator perspective in the OpenShift Container Platform web console.
- Select Observe → Metrics.
- To create a custom query, add your PromQL query to the Expression field.
- To add multiple queries, select Add Query.
3.9.1.1. Provided Operator metrics
The Custom Metrics Autoscaler Operator exposes the following metrics, which you can view by using the OpenShift Container Platform web console.
Metric name | Description |
---|---|
|
Whether the particular scaler is active or inactive. A value of |
| The current value for each scaler’s metric, which is used by the Horizontal Pod Autoscaler (HPA) in computing the target average. |
| The latency of retrieving the current metric from each scaler. |
| The number of errors that have occurred for each scaler. |
| The total number of errors encountered for all scalers. |
| The number of errors that have occurred for each scaled obejct. |
| The total number of Custom Metrics Autoscaler custom resources in each namespace for each custom resource type. |
| The total number of triggers by trigger type. |
Custom Metrics Autoscaler Admission webhook metrics
The Custom Metrics Autoscaler Admission webhook also exposes the following Prometheus metrics.
Metric name | Description |
---|---|
| The number of scaled object validations. |
| The number of validation errors. |
3.10. Understanding how to add custom metrics autoscalers
To add a custom metrics autoscaler, create a ScaledObject
custom resource for a deployment, stateful set, or custom resource. Create a ScaledJob
custom resource for a job.
You can create only one scaled object for each workload that you want to scale. Also, you cannot use a scaled object and the horizontal pod autoscaler (HPA) on the same workload.
3.10.1. Adding a custom metrics autoscaler to a workload
You can create a custom metrics autoscaler for a workload that is created by a Deployment
, StatefulSet
, or custom resource
object.
Prerequisites
- The Custom Metrics Autoscaler Operator must be installed.
If you use a custom metrics autoscaler for scaling based on CPU or memory:
Your cluster administrator must have properly configured cluster metrics. You can use the
oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with CPU and Memory displayed under Usage.$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Example output
Name: openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Namespace: openshift-kube-scheduler Labels: <none> Annotations: <none> API Version: metrics.k8s.io/v1beta1 Containers: Name: wait-for-host-port Usage: Memory: 0 Name: scheduler Usage: Cpu: 8m Memory: 45440Ki Kind: PodMetrics Metadata: Creation Timestamp: 2019-05-23T18:47:56Z Self Link: /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal Timestamp: 2019-05-23T18:47:56Z Window: 1m0s Events: <none>
The pods associated with the object you want to scale must include specified memory and CPU limits. For example:
Example pod spec
apiVersion: v1 kind: Pod # ... spec: containers: - name: app image: images.my-company.example/app:v4 resources: limits: memory: "128Mi" cpu: "500m" # ...
Procedure
Create a YAML file similar to the following. Only the name
<2>
, object name<4>
, and object kind<5>
are required:Example scaled object
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: annotations: autoscaling.keda.sh/paused-replicas: "0" 1 name: scaledobject 2 namespace: my-namespace spec: scaleTargetRef: apiVersion: apps/v1 3 name: example-deployment 4 kind: Deployment 5 envSourceContainerName: .spec.template.spec.containers[0] 6 cooldownPeriod: 200 7 maxReplicaCount: 100 8 minReplicaCount: 0 9 metricsServer: 10 auditConfig: logFormat: "json" logOutputVolumeClaim: "persistentVolumeClaimName" policy: rules: - level: Metadata omitStages: "RequestReceived" omitManagedFields: false lifetime: maxAge: "2" maxBackup: "1" maxSize: "50" fallback: 11 failureThreshold: 3 replicas: 6 pollingInterval: 30 12 advanced: restoreToOriginalReplicaCount: false 13 horizontalPodAutoscalerConfig: name: keda-hpa-scale-down 14 behavior: 15 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 100 periodSeconds: 15 triggers: - type: prometheus 16 metadata: serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092 namespace: kedatest metricName: http_requests_total threshold: '5' query: sum(rate(http_requests_total{job="test-app"}[1m])) authModes: basic authenticationRef: 17 name: prom-triggerauthentication kind: TriggerAuthentication
- 1
- Optional: Specifies that the Custom Metrics Autoscaler Operator is to scale the replicas to the specified value and stop autoscaling, as described in the "Pausing the custom metrics autoscaler for a workload" section.
- 2
- Specifies a name for this custom metrics autoscaler.
- 3
- Optional: Specifies the API version of the target resource. The default is
apps/v1
. - 4
- Specifies the name of the object that you want to scale.
- 5
- Specifies the
kind
asDeployment
,StatefulSet
orCustomResource
. - 6
- Optional: Specifies the name of the container in the target resource, from which the custom metrics autoscaler gets environment variables holding secrets and so forth. The default is
.spec.template.spec.containers[0]
. - 7
- Optional. Specifies the period in seconds to wait after the last trigger is reported before scaling the deployment back to
0
if theminReplicaCount
is set to0
. The default is300
. - 8
- Optional: Specifies the maximum number of replicas when scaling up. The default is
100
. - 9
- Optional: Specifies the minimum number of replicas when scaling down.
- 10
- Optional: Specifies the parameters for audit logs. as described in the "Configuring audit logging" section.
- 11
- Optional: Specifies the number of replicas to fall back to if a scaler fails to get metrics from the source for the number of times defined by the
failureThreshold
parameter. For more information on fallback behavior, see the KEDA documentation. - 12
- Optional: Specifies the interval in seconds to check each trigger on. The default is
30
. - 13
- Optional: Specifies whether to scale back the target resource to the original replica count after the scaled object is deleted. The default is
false
, which keeps the replica count as it is when the scaled object is deleted. - 14
- Optional: Specifies a name for the horizontal pod autoscaler. The default is
keda-hpa-{scaled-object-name}
. - 15
- Optional: Specifies a scaling policy to use to control the rate to scale pods up or down, as described in the "Scaling policies" section.
- 16
- Specifies the trigger to use as the basis for scaling, as described in the "Understanding the custom metrics autoscaler triggers" section. This example uses OpenShift Container Platform monitoring.
- 17
- Optional: Specifies a trigger authentication or a cluster trigger authentication. For more information, see Understanding the custom metrics autoscaler trigger authentication in the Additional resources section.
-
Enter
TriggerAuthentication
to use a trigger authentication. This is the default. -
Enter
ClusterTriggerAuthentication
to use a cluster trigger authentication.
-
Enter
Create the custom metrics autoscaler by running the following command:
$ oc create -f <filename>.yaml
Verification
View the command output to verify that the custom metrics autoscaler was created:
$ oc get scaledobject <scaled_object_name>
Example output
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK AGE scaledobject apps/v1.Deployment example-deployment 0 50 prometheus prom-triggerauthentication True True True 17s
Note the following fields in the output:
-
TRIGGERS
: Indicates the trigger, or scaler, that is being used. -
AUTHENTICATION
: Indicates the name of any trigger authentication being used. READY
: Indicates whether the scaled object is ready to start scaling:-
If
True
, the scaled object is ready. -
If
False
, the scaled object is not ready because of a problem in one or more of the objects you created.
-
If
ACTIVE
: Indicates whether scaling is taking place:-
If
True
, scaling is taking place. -
If
False
, scaling is not taking place because there are no metrics or there is a problem in one or more of the objects you created.
-
If
FALLBACK
: Indicates whether the custom metrics autoscaler is able to get metrics from the source-
If
False
, the custom metrics autoscaler is getting metrics. -
If
True
, the custom metrics autoscaler is getting metrics because there are no metrics or there is a problem in one or more of the objects you created.
-
If
-
3.10.2. Adding a custom metrics autoscaler to a job
You can create a custom metrics autoscaler for any Job
object.
Scaling by using a scaled job is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Prerequisites
- The Custom Metrics Autoscaler Operator must be installed.
Procedure
Create a YAML file similar to the following:
kind: ScaledJob apiVersion: keda.sh/v1alpha1 metadata: name: scaledjob namespace: my-namespace spec: failedJobsHistoryLimit: 5 jobTargetRef: activeDeadlineSeconds: 600 1 backoffLimit: 6 2 parallelism: 1 3 completions: 1 4 template: 5 metadata: name: pi spec: containers: - name: pi image: perl command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] maxReplicaCount: 100 6 pollingInterval: 30 7 successfulJobsHistoryLimit: 5 8 failedJobsHistoryLimit: 5 9 envSourceContainerName: 10 rolloutStrategy: gradual 11 scalingStrategy: 12 strategy: "custom" customScalingQueueLengthDeduction: 1 customScalingRunningJobPercentage: "0.5" pendingPodConditions: - "Ready" - "PodScheduled" - "AnyOtherCustomPodCondition" multipleScalersCalculation : "max" triggers: - type: prometheus 13 metadata: serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092 namespace: kedatest metricName: http_requests_total threshold: '5' query: sum(rate(http_requests_total{job="test-app"}[1m])) authModes: "bearer" authenticationRef: 14 name: prom-cluster-triggerauthentication
- 1
- Specifies the maximum duration the job can run.
- 2
- Specifies the number of retries for a job. The default is
6
. - 3
- Optional: Specifies how many pod replicas a job should run in parallel; defaults to
1
.-
For non-parallel jobs, leave unset. When unset, the default is
1
.
-
For non-parallel jobs, leave unset. When unset, the default is
- 4
- Optional: Specifies how many successful pod completions are needed to mark a job completed.
-
For non-parallel jobs, leave unset. When unset, the default is
1
. - For parallel jobs with a fixed completion count, specify the number of completions.
-
For parallel jobs with a work queue, leave unset. When unset the default is the value of the
parallelism
parameter.
-
For non-parallel jobs, leave unset. When unset, the default is
- 5
- Specifies the template for the pod the controller creates.
- 6
- Optional: Specifies the maximum number of replicas when scaling up. The default is
100
. - 7
- Optional: Specifies the interval in seconds to check each trigger on. The default is
30
. - 8
- Optional: Specifies the number of successful finished jobs should be kept. The default is
100
. - 9
- Optional: Specifies how many failed jobs should be kept. The default is
100
. - 10
- Optional: Specifies the name of the container in the target resource, from which the custom autoscaler gets environment variables holding secrets and so forth. The default is
.spec.template.spec.containers[0]
. - 11
- Optional: Specifies whether existing jobs are terminated whenever a scaled job is being updated:
-
default
: The autoscaler terminates an existing job if its associated scaled job is updated. The autoscaler recreates the job with the latest specs. -
gradual
: The autoscaler does not terminate an existing job if its associated scaled job is updated. The autoscaler creates new jobs with the latest specs.
-
- 12
- Optional: Specifies a scaling strategy:
default
,custom
, oraccurate
. The default isdefault
. For more information, see the link in the "Additional resources" section that follows. - 13
- Specifies the trigger to use as the basis for scaling, as described in the "Understanding the custom metrics autoscaler triggers" section.
- 14
- Optional: Specifies a trigger authentication or a cluster trigger authentication. For more information, see Understanding the custom metrics autoscaler trigger authentication in the Additional resources section.
-
Enter
TriggerAuthentication
to use a trigger authentication. This is the default. -
Enter
ClusterTriggerAuthentication
to use a cluster trigger authentication.
-
Enter
Create the custom metrics autoscaler by running the following command:
$ oc create -f <filename>.yaml
Verification
View the command output to verify that the custom metrics autoscaler was created:
$ oc get scaledjob <scaled_job_name>
Example output
NAME MAX TRIGGERS AUTHENTICATION READY ACTIVE AGE scaledjob 100 prometheus prom-triggerauthentication True True 8s
Note the following fields in the output:
-
TRIGGERS
: Indicates the trigger, or scaler, that is being used. -
AUTHENTICATION
: Indicates the name of any trigger authentication being used. READY
: Indicates whether the scaled object is ready to start scaling:-
If
True
, the scaled object is ready. -
If
False
, the scaled object is not ready because of a problem in one or more of the objects you created.
-
If
ACTIVE
: Indicates whether scaling is taking place:-
If
True
, scaling is taking place. -
If
False
, scaling is not taking place because there are no metrics or there is a problem in one or more of the objects you created.
-
If
-
3.10.3. Additional resources
3.11. Removing the Custom Metrics Autoscaler Operator
You can remove the custom metrics autoscaler from your OpenShift Container Platform cluster. After removing the Custom Metrics Autoscaler Operator, remove other components associated with the Operator to avoid potential issues.
Delete the KedaController
custom resource (CR) first. If you do not delete the KedaController
CR, OpenShift Container Platform can hang when you delete the openshift-keda
project. If you delete the Custom Metrics Autoscaler Operator before deleting the CR, you are not able to delete the CR.
3.11.1. Uninstalling the Custom Metrics Autoscaler Operator
Use the following procedure to remove the custom metrics autoscaler from your OpenShift Container Platform cluster.
Prerequisites
- The Custom Metrics Autoscaler Operator must be installed.
Procedure
- In the OpenShift Container Platform web console, click Operators → Installed Operators.
- Switch to the openshift-keda project.
Remove the
KedaController
custom resource.- Find the CustomMetricsAutoscaler Operator and click the KedaController tab.
- Find the custom resource, and then click Delete KedaController.
- Click Uninstall.
Remove the Custom Metrics Autoscaler Operator:
- Click Operators → Installed Operators.
- Find the CustomMetricsAutoscaler Operator and click the Options menu and select Uninstall Operator.
- Click Uninstall.
Optional: Use the OpenShift CLI to remove the custom metrics autoscaler components:
Delete the custom metrics autoscaler CRDs:
-
clustertriggerauthentications.keda.sh
-
kedacontrollers.keda.sh
-
scaledjobs.keda.sh
-
scaledobjects.keda.sh
-
triggerauthentications.keda.sh
$ oc delete crd clustertriggerauthentications.keda.sh kedacontrollers.keda.sh scaledjobs.keda.sh scaledobjects.keda.sh triggerauthentications.keda.sh
Deleting the CRDs removes the associated roles, cluster roles, and role bindings. However, there might be a few cluster roles that must be manually deleted.
-
List any custom metrics autoscaler cluster roles:
$ oc get clusterrole | grep keda.sh
Delete the listed custom metrics autoscaler cluster roles. For example:
$ oc delete clusterrole.keda.sh-v1alpha1-admin
List any custom metrics autoscaler cluster role bindings:
$ oc get clusterrolebinding | grep keda.sh
Delete the listed custom metrics autoscaler cluster role bindings. For example:
$ oc delete clusterrolebinding.keda.sh-v1alpha1-admin
Delete the custom metrics autoscaler project:
$ oc delete project openshift-keda
Delete the Cluster Metric Autoscaler Operator:
$ oc delete operator/openshift-custom-metrics-autoscaler-operator.openshift-keda
Chapter 4. Controlling pod placement onto nodes (scheduling)
4.1. Controlling pod placement using the scheduler
Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.
The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.
- Default pod scheduling
- OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customization tools to determine the best fit for a pod.
- Advanced pod scheduling
In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required or has a preference to run on a particular node or alongside a specific pod.
You can control pod placement by using the following scheduling features:
4.1.1. About the default scheduler
The default OpenShift Container Platform pod scheduler is responsible for determining the placement of new pods onto nodes within the cluster. It reads data from the pod and finds a node that is a good fit based on configured profiles. It is completely independent and exists as a standalone solution. It does not modify the pod; it creates a binding for the pod that ties the pod to the particular node.
4.1.1.1. Understanding default scheduling
The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:
- Filters the nodes
- The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates, or filters.
- Prioritizes the filtered list of nodes
- This is achieved by passing each node through a series of priority, or scoring, functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each scoring function. The node score provided by each scoring function is multiplied by the weight (default weight for most scores is 1) and then combined by adding the scores for each node provided by all the scores. This weight attribute can be used by administrators to give higher importance to some scores.
- Selects the best fit node
- The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.
4.1.2. Scheduler use cases
One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.
4.1.2.1. Infrastructure topological levels
Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes. For example: region=r1
, zone=z1
, rack=s1
.
These label names have no particular meaning and administrators are free to name their infrastructure levels anything, such as city/building/room. Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regions
→ zones
→ racks
). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.
4.1.2.2. Affinity
Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.
If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
4.1.2.3. Anti-affinity
Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.
If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
4.2. Scheduling pods using a scheduler profile
You can configure OpenShift Container Platform to use a scheduling profile to schedule pods onto nodes within the cluster.