Chapter 6. Working with clusters

6.1. Viewing system event information in an OpenShift Container Platform cluster

Events in OpenShift Container Platform are modeled based on events that happen to API objects in an OpenShift Container Platform cluster.

6.1.1. Understanding events

Events allow OpenShift Container Platform to record information about real-world events in a resource-agnostic manner. They also allow developers and administrators to consume information about system components in a unified way.

6.1.2. Viewing events using the CLI

You can get a list of events in a given project using the CLI.

Procedure

To view events in a project use the following command:

$ oc get events [-n <project>] 1

1: The name of the project.

For example:

$ oc get events -n openshift-config

LAST SEEN   TYPE      REASON                   OBJECT                      MESSAGE
97m         Normal    Scheduled                pod/dapi-env-test-pod       Successfully assigned openshift-config/dapi-env-test-pod to ip-10-0-171-202.ec2.internal
97m         Normal    Pulling                  pod/dapi-env-test-pod       pulling image "gcr.io/google_containers/busybox"
97m         Normal    Pulled                   pod/dapi-env-test-pod       Successfully pulled image "gcr.io/google_containers/busybox"
97m         Normal    Created                  pod/dapi-env-test-pod       Created container
9m5s        Warning   FailedCreatePodSandBox   pod/dapi-volume-test-pod    Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dapi-volume-test-pod_openshift-config_6bc60c1f-452e-11e9-9140-0eec59c23068_0(748c7a40db3d08c07fb4f9eba774bd5effe5f0d5090a242432a73eee66ba9e22): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/33366/ns/net": no such file or directory
8m31s       Normal    Scheduled                pod/dapi-volume-test-pod    Successfully assigned openshift-config/dapi-volume-test-pod to ip-10-0-171-202.ec2.internal

To view events in your project from the OpenShift Container Platform console.
1. Launch the OpenShift Container Platform console.
2. Click Home Events and select your project.
3. Move to resource that you want to see events. For example: Home Projects <project-name> <resource-name>.
  Many objects, such as pods and deployments, have their own Events tab as well, which shows events related to that object.

6.1.3. List of events

This section describes the events of OpenShift Container Platform.

Table 6.1. Configuration Events
Name	Description
`FailedValidation`	Failed pod configuration validation.

Table 6.2. Container Events
Name	Description
`BackOff`	Back-off restarting failed the container.
`Created`	Container created.
`Failed`	Pull/Create/Start failed.
`Killing`	Killing the container.
`Started`	Container started.
`Preempting`	Preempting other pods.
`ExceededGracePeriod`	Container runtime did not stop the pod within specified grace period.

Table 6.3. Health Events
Name	Description
`Unhealthy`	Container is unhealthy.

Table 6.4. Image Events
Name	Description
`BackOff`	Back off Ctr Start, image pull.
`ErrImageNeverPull`	The image’s NeverPull Policy is violated.
`Failed`	Failed to pull the image.
`InspectFailed`	Failed to inspect the image.
`Pulled`	Successfully pulled the image or the container image is already present on the machine.
`Pulling`	Pulling the image.

Table 6.5. Image Manager Events
Name	Description
`FreeDiskSpaceFailed`	Free disk space failed.
`InvalidDiskCapacity`	Invalid disk capacity.

Table 6.6. Node Events
Name	Description
`FailedMount`	Volume mount failed.
`HostNetworkNotSupported`	Host network not supported.
`HostPortConflict`	Host/port conflict.
`InsufficientFreeCPU`	Insufficient free CPU.
`InsufficientFreeMemory`	Insufficient free memory.
`KubeletSetupFailed`	Kubelet setup failed.
`NilShaper`	Undefined shaper.
`NodeNotReady`	Node is not ready.
`NodeNotSchedulable`	Node is not schedulable.
`NodeReady`	Node is ready.
`NodeSchedulable`	Node is schedulable.
`NodeSelectorMismatching`	Node selector mismatch.
`OutOfDisk`	Out of disk.
`Rebooted`	Node rebooted.
`Starting`	Starting kubelet.
`FailedAttachVolume`	Failed to attach volume.
`FailedDetachVolume`	Failed to detach volume.
`VolumeResizeFailed`	Failed to expand/reduce volume.
`VolumeResizeSuccessful`	Successfully expanded/reduced volume.
`FileSystemResizeFailed`	Failed to expand/reduce file system.
`FileSystemResizeSuccessful`	Successfully expanded/reduced file system.
`FailedUnMount`	Failed to unmount volume.
`FailedMapVolume`	Failed to map a volume.
`FailedUnmapDevice`	Failed unmaped device.
`AlreadyMountedVolume`	Volume is already mounted.
`SuccessfulDetachVolume`	Volume is successfully detached.
`SuccessfulMountVolume`	Volume is successfully mounted.
`SuccessfulUnMountVolume`	Volume is successfully unmounted.
`ContainerGCFailed`	Container garbage collection failed.
`ImageGCFailed`	Image garbage collection failed.
`FailedNodeAllocatableEnforcement`	Failed to enforce System Reserved Cgroup limit.
`NodeAllocatableEnforced`	Enforced System Reserved Cgroup limit.
`UnsupportedMountOption`	Unsupported mount option.
`SandboxChanged`	Pod sandbox changed.
`FailedCreatePodSandBox`	Failed to create pod sandbox.
`FailedPodSandBoxStatus`	Failed pod sandbox status.

Table 6.7. Pod Worker Events
Name	Description
`FailedSync`	Pod sync failed.

Table 6.8. System Events
Name	Description
`SystemOOM`	There is an OOM (out of memory) situation on the cluster.

Table 6.9. Pod Events
Name	Description
`FailedKillPod`	Failed to stop a pod.
`FailedCreatePodContainer`	Failed to create a pod contianer.
`Failed`	Failed to make pod data directories.
`NetworkNotReady`	Network is not ready.
`FailedCreate`	Error creating: `<error-msg>`.
`SuccessfulCreate`	Created pod: `<pod-name>`.
`FailedDelete`	Error deleting: `<error-msg>`.
`SuccessfulDelete`	Deleted pod: `<pod-id>`.

Table 6.10. Horizontal Pod AutoScaler Events
Name	Description
SelectorRequired	Selector is required.
`InvalidSelector`	Could not convert selector into a corresponding internal selector object.
`FailedGetObjectMetric`	HPA was unable to compute the replica count.
`InvalidMetricSourceType`	Unknown metric source type.
`ValidMetricFound`	HPA was able to successfully calculate a replica count.
`FailedConvertHPA`	Failed to convert the given HPA.
`FailedGetScale`	HPA controller was unable to get the target’s current scale.
`SucceededGetScale`	HPA controller was able to get the target’s current scale.
`FailedComputeMetricsReplicas`	Failed to compute desired number of replicas based on listed metrics.
`FailedRescale`	New size: `<size>`; reason: `<msg>`; error: `<error-msg>`.
`SuccessfulRescale`	New size: `<size>`; reason: `<msg>`.
`FailedUpdateStatus`	Failed to update status.

Table 6.11. Network Events (openshift-sdn)
Name	Description
`Starting`	Starting OpenShift-SDN.
`NetworkFailed`	The pod’s network interface has been lost and the pod will be stopped.

Table 6.12. Network Events (kube-proxy)
Name	Description
`NeedPods`	The service-port `<serviceName>:<port>` needs pods.

Table 6.13. Volume Events
Name	Description
`FailedBinding`	There are no persistent volumes available and no storage class is set.
`VolumeMismatch`	Volume size or class is different from what is requested in claim.
`VolumeFailedRecycle`	Error creating recycler pod.
`VolumeRecycled`	Occurs when volume is recycled.
`RecyclerPod`	Occurs when pod is recycled.
`VolumeDelete`	Occurs when volume is deleted.
`VolumeFailedDelete`	Error when deleting the volume.
`ExternalProvisioning`	Occurs when volume for the claim is provisioned either manually or via external software.
`ProvisioningFailed`	Failed to provision volume.
`ProvisioningCleanupFailed`	Error cleaning provisioned volume.
`ProvisioningSucceeded`	Occurs when the volume is provisioned successfully.
`WaitForFirstConsumer`	Delay binding until pod scheduling.

Table 6.14. Lifecycle hooks
Name	Description
`FailedPostStartHook`	Handler failed for pod start.
`FailedPreStopHook`	Handler failed for pre-stop.
`UnfinishedPreStopHook`	Pre-stop hook unfinished.

Table 6.15. Deployments
Name	Description
`DeploymentCancellationFailed`	Failed to cancel deployment.
`DeploymentCancelled`	Cancelled deployment.
`DeploymentCreated`	Created new replication controller.
`IngressIPRangeFull`	No available Ingress IP to allocate to service.

Table 6.16. Scheduler Events
Name	Description
`FailedScheduling`	Failed to schedule pod: `<pod-namespace>/<pod-name>`. This event is raised for multiple reasons, for example: `AssumePodVolumes` failed, Binding rejected etc.
`Preempted`	By `<preemptor-namespace>/<preemptor-name>` on node `<node-name>`.
`Scheduled`	Successfully assigned `<pod-name>` to `<node-name>`.

Table 6.17. DaemonSet Events
Name	Description
`SelectingAll`	This daemon set is selecting all pods. A non-empty selector is required.
`FailedPlacement`	Failed to place pod on `<node-name>`.
`FailedDaemonPod`	Found failed daemon pod `<pod-name>` on node `<node-name>`, will try to kill it.

Table 6.18. LoadBalancer Service Events
Name	Description
`CreatingLoadBalancerFailed`	Error creating load balancer.
`DeletingLoadBalancer`	Deleting load balancer.
`EnsuringLoadBalancer`	Ensuring load balancer.
`EnsuredLoadBalancer`	Ensured load balancer.
`UnAvailableLoadBalancer`	There are no available nodes for `LoadBalancer` service.
`LoadBalancerSourceRanges`	Lists the new `LoadBalancerSourceRanges`. For example, `<old-source-range> <new-source-range>`.
`LoadbalancerIP`	Lists the new IP address. For example, `<old-ip> <new-ip>`.
`ExternalIP`	Lists external IP address. For example, `Added: <external-ip>`.
`UID`	Lists the new UID. For example, `<old-service-uid> <new-service-uid>`.
`ExternalTrafficPolicy`	Lists the new `ExternalTrafficPolicy`. For example, `<old-policy> <new-ploicy>`.
`HealthCheckNodePort`	Lists the new `HealthCheckNodePort`. For example, `<old-node-port> new-node-port>`.
`UpdatedLoadBalancer`	Updated load balancer with new hosts.
`LoadBalancerUpdateFailed`	Error updating load balancer with new hosts.
`DeletingLoadBalancer`	Deleting load balancer.
`DeletingLoadBalancerFailed`	Error deleting load balancer.
`DeletedLoadBalancer`	Deleted load balancer.

6.2. Estimating the number of pods your OpenShift Container Platform nodes can hold

As a cluster administrator, you can use the cluster capacity tool to view the number of pods that can be scheduled to increase the current resources before they become exhausted, and to ensure any future pods can be scheduled. This capacity comes from an individual node host in a cluster, and includes CPU, memory, disk space, and others.

6.2.1. Understanding the OpenShift Container Platform cluster capacity tool

The cluster capacity tool simulates a sequence of scheduling decisions to determine how many instances of an input pod can be scheduled on the cluster before it is exhausted of resources to provide a more accurate estimation.

Note

The remaining allocatable capacity is a rough estimation, because it does not count all of the resources being distributed among nodes. It analyzes only the remaining resources and estimates the available capacity that is still consumable in terms of a number of instances of a pod with given requirements that can be scheduled in a cluster.

Also, pods might only have scheduling support on particular sets of nodes based on its selection and affinity criteria. As a result, the estimation of which remaining pods a cluster can schedule can be difficult.

You can run the cluster capacity analysis tool as a stand-alone utility from the command line, or as a job in a pod inside an OpenShift Container Platform cluster. Running it as job inside of a pod enables you to run it multiple times without intervention.

6.2.2. Running the cluster capacity tool on the command line

You can run the OpenShift Container Platform cluster capacity tool from the command line to estimate the number of pods that can be scheduled onto your cluster.

Prerequisites

Download and install the cluster-capacity tool.

Create a sample pod specification file, which the tool uses for estimating resource usage. The podspec specifies its resource requirements as limits or requests. The cluster capacity tool takes the pod’s resource requirements into account for its estimation analysis.

An example of the pod specification input is:

apiVersion: v1
kind: Pod
metadata:
  name: small-pod
  labels:
    app: guestbook
    tier: frontend
spec:
  containers:
  - name: php-redis
    image: gcr.io/google-samples/gb-frontend:v4
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 150m
        memory: 100Mi
      requests:
        cpu: 150m
        memory: 100Mi

Procedure

To run the tool on the command line:

Run the following command:
```
$ ./cluster-capacity --kubeconfig <path-to-kubeconfig> \ 1
    --podspec <path-to-pod-spec> 2
```
1
Specify the path to your Kubernetes configuration file.
2
Specify the path to the sample pod specification file
You can also add the --verbose option to output a detailed description of how many pods can be scheduled on each node in the cluster:
```
$ ./cluster-capacity --kubeconfig <path-to-kubeconfig> \
    --podspec <path-to-pod-spec> --verbose
```

View the output, which looks similar to the following:

small-pod pod requirements:
	- CPU: 150m
	- Memory: 100Mi

The cluster can schedule 52 instance(s) of the pod small-pod.

Termination reason: Unschedulable: No nodes are available that match all of the
following predicates:: Insufficient cpu (2).

Pod distribution among nodes:
small-pod
	- 192.168.124.214: 26 instance(s)
	- 192.168.124.120: 26 instance(s)

In the above example, the number of estimated pods that can be scheduled onto the cluster is 52.

6.2.3. Running the cluster capacity tool as a job inside a pod

Running the cluster capacity tool as a job inside of a pod has the advantage of being able to be run multiple times without needing user intervention. Running the cluster capacity tool as a job involves using a ConfigMap.

Prerequisites

Download and install the cluster-capacity tool.

Procedure

To run the cluster capacity tool:

Create the cluster role:

$ cat << EOF| oc create -f -
kind: ClusterRole
apiVersion: v1
metadata:
  name: cluster-capacity-role
rules:
- apiGroups: [""]
  resources: ["pods", "nodes", "persistentvolumeclaims", "persistentvolumes", "services"]
  verbs: ["get", "watch", "list"]
EOF

Create the service account:
```
$ oc create sa cluster-capacity-sa
```

Add the role to the service account:

$ oc adm policy add-cluster-role-to-user cluster-capacity-role \
    system:serviceaccount:default:cluster-capacity-sa

Define and create the pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: small-pod
  labels:
    app: guestbook
    tier: frontend
spec:
  containers:
  - name: php-redis
    image: gcr.io/google-samples/gb-frontend:v4
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 150m
        memory: 100Mi
      requests:
        cpu: 150m
        memory: 100Mi

The cluster capacity analysis is mounted in a volume using a ConfigMap named cluster-capacity-configmap to mount input pod spec file pod.yaml into a volume test-volume at the path /test-pod.
If you haven’t created a ConfigMap, create one before creating the job:
```
$ oc create configmap cluster-capacity-configmap \
    --from-file=pod.yaml=pod.yaml
```

Create the job using the below example of a job specification file:

apiVersion: batch/v1
kind: Job
metadata:
  name: cluster-capacity-job
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: cluster-capacity-pod
    spec:
        containers:
        - name: cluster-capacity
          image: openshift/origin-cluster-capacity
          imagePullPolicy: "Always"
          volumeMounts:
          - mountPath: /test-pod
            name: test-volume
          env:
          - name: CC_INCLUSTER 1
            value: "true"
          command:
          - "/bin/sh"
          - "-ec"
          - |
            /bin/cluster-capacity --podspec=/test-pod/pod.yaml --verbose
        restartPolicy: "Never"
        serviceAccountName: cluster-capacity-sa
        volumes:
        - name: test-volume
          configMap:
            name: cluster-capacity-configmap

1: A required environment variable letting the cluster capacity tool know that it is running inside a cluster as a pod.
The pod.yaml key of the ConfigMap is the same as the pod specification file name, though it is not required. By doing this, the input pod spec file can be accessed inside the pod as /test-pod/pod.yaml.

Run the cluster capacity image as a job in a pod:
```
$ oc create -f cluster-capacity-job.yaml
```

Check the job logs to find the number of pods that can be scheduled in the cluster:

$ oc logs jobs/cluster-capacity-job
small-pod pod requirements:
        - CPU: 150m
        - Memory: 100Mi

The cluster can schedule 52 instance(s) of the pod small-pod.

Termination reason: Unschedulable: No nodes are available that match all of the
following predicates:: Insufficient cpu (2).

Pod distribution among nodes:
small-pod
        - 192.168.124.214: 26 instance(s)
        - 192.168.124.120: 26 instance(s)

6.3. Configuring cluster memory to meet container memory and risk requirements

As a cluster administrator, you can help your clusters operate efficiently through managing application memory by:

Determining the memory and risk requirements of a containerized application component and configuring the container memory parameters to suit those requirements.
Configuring containerized application runtimes (for example, OpenJDK) to adhere optimally to the configured container memory parameters.
Diagnosing and resolving memory-related error conditions associated with running in a container.

6.3.1. Understanding managing application memory

It is recommended to read fully the overview of how OpenShift Container Platform manages Compute Resources before proceeding.

For each kind of resource (memory, CPU, storage), OpenShift Container Platform allows optional request and limit values to be placed on each container in a pod.

Note the following about memory requests and memory limits:

Memory request
- The memory request value, if specified, influences the OpenShift Container Platform scheduler. The scheduler considers the memory request when scheduling a container to a node, then fences off the requested memory on the chosen node for the use of the container.
- If a node’s memory is exhausted, OpenShift Container Platform prioritizes evicting its containers whose memory usage most exceeds their memory request. In serious cases of memory exhaustion, the node OOM killer may select and kill a process in a container based on a similar metric.
- The cluster administrator can assign quota or assign default values for the memory request value.
- The cluster administrator may override the memory request values that a developer specifies, in order to manage cluster overcommit.
Memory limit
- The memory limit value, if specified, provides a hard limit on the memory that can be allocated across all the processes in a container.
- If the memory allocated by all of the processes in a container exceeds the memory limit, the node OOM killer will immediately select and kill a process in the container.
- If both memory request and limit are specified, the memory limit value must be greater than or equal to the memory request.
- The cluster administrator can assign quota or assign default values for the memory limit value.

6.3.1.1. Managing application memory strategy

The steps for sizing application memory on OpenShift Container Platform are as follows:

Determine expected container memory usage
Determine expected mean and peak container memory usage, empirically if necessary (for example, by separate load testing). Remember to consider all the processes that may potentially run in parallel in the container: for example, does the main application spawn any ancillary scripts?
Determine risk appetite
Determine risk appetite for eviction. If the risk appetite is low, the container should request memory according to the expected peak usage plus a percentage safety margin. If the risk appetite is higher, it may be more appropriate to request memory according to the expected mean usage.
Set container memory request
Set container memory request based on the above. The more accurately the request represents the application memory usage, the better. If the request is too high, cluster and quota usage will be inefficient. If the request is too low, the chances of application eviction increase.
Set container memory limit, if required
Set container memory limit, if required. Setting a limit has the effect of immediately killing a container process if the combined memory usage of all processes in the container exceeds the limit, and is therefore a mixed blessing. On the one hand, it may make unanticipated excess memory usage obvious early ("fail fast"); on the other hand it also terminates processes abruptly.
Note that some OpenShift Container Platform clusters may require a limit value to be set; some may override the request based on the limit; and some application images rely on a limit value being set as this is easier to detect than a request value.
If the memory limit is set, it should not be set to less than the expected peak container memory usage plus a percentage safety margin.
Ensure application is tuned
Ensure application is tuned with respect to configured request and limit values, if appropriate. This step is particularly relevant to applications which pool memory, such as the JVM. The rest of this page discusses this.

6.3.2. Understanding OpenJDK settings for OpenShift Container Platform

The default OpenJDK settings do not work well with containerized environments. As a result, some additional Java memory settings must always be provided whenever running the OpenJDK in a container.

The JVM memory layout is complex, version dependent, and describing it in detail is beyond the scope of this documentation. However, as a starting point for running OpenJDK in a container, at least the following three memory-related tasks are key:

Overriding the JVM maximum heap size.
Encouraging the JVM to release unused memory to the operating system, if appropriate.
Ensuring all JVM processes within a container are appropriately configured.

Optimally tuning JVM workloads for running in a container is beyond the scope of this documentation, and may involve setting multiple additional JVM options.

6.3.2.1. Understanding how to override the JVM maximum heap size

For many Java workloads, the JVM heap is the largest single consumer of memory. Currently, the OpenJDK defaults to allowing up to 1/4 (1/-XX:MaxRAMFraction) of the compute node’s memory to be used for the heap, regardless of whether the OpenJDK is running in a container or not. It is therefore essential to override this behavior, especially if a container memory limit is also set.

There are at least two ways the above can be achieved:

If the container memory limit is set and the experimental options are supported by the JVM, set -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap.
This sets -XX:MaxRAM to the container memory limit, and the maximum heap size (-XX:MaxHeapSize / -Xmx) to 1/-XX:MaxRAMFraction (1/4 by default).
Directly override one of -XX:MaxRAM, -XX:MaxHeapSize or -Xmx.
This option involves hard-coding a value, but has the advantage of allowing a safety margin to be calculated.

6.3.2.2. Understanding how to encourage the JVM to release unused memory to the operating system

By default, the OpenJDK does not aggressively return unused memory to the operating system. This may be appropriate for many containerized Java workloads, but notable exceptions include workloads where additional active processes co-exist with a JVM within a container, whether those additional processes are native, additional JVMs, or a combination of the two.

The OpenShift Container Platform Jenkins maven slave image uses the following JVM arguments to encourage the JVM to release unused memory to the operating system:

`-XX:+UseParallelGC
-XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4
-XX:AdaptiveSizePolicyWeight=90`.

These arguments are intended to return heap memory to the operating system whenever allocated memory exceeds 110% of in-use memory (-XX:MaxHeapFreeRatio), spending up to 20% of CPU time in the garbage collector (-XX:GCTimeRatio). At no time will the application heap allocation be less than the initial heap allocation (overridden by -XX:InitialHeapSize / -Xms). Detailed additional information is available Tuning Java’s footprint in OpenShift (Part 1), Tuning Java’s footprint in OpenShift (Part 2), and at OpenJDK and Containers.

6.3.2.3. Understanding how to ensure all JVM processes within a container are appropriately configured

In the case that multiple JVMs run in the same container, it is essential to ensure that they are all configured appropriately. For many workloads it will be necessary to grant each JVM a percentage memory budget, leaving a perhaps substantial additional safety margin.

Many Java tools use different environment variables (JAVA_OPTS, GRADLE_OPTS, MAVEN_OPTS, and so on) to configure their JVMs and it can be challenging to ensure that the right settings are being passed to the right JVM.

The JAVA_TOOL_OPTIONS environment variable is always respected by the OpenJDK, and values specified in JAVA_TOOL_OPTIONS will be overridden by other options specified on the JVM command line. By default, to ensure that these options are used by default for all JVM workloads run in the slave image, the OpenShift Container Platform Jenkins maven slave image sets:

`JAVA_TOOL_OPTIONS="-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap -Dsun.zip.disableMemoryMapping=true"`

This does not guarantee that additional options are not required, but is intended to be a helpful starting point.

6.3.3. Finding the memory request and limit from within a pod

An application wishing to dynamically discover its memory request and limit from within a pod should use the Downward API.

Procedure

Configure the pod to add the MEMORY_REQUEST and MEMORY_LIMIT stanzas:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - name: test
    image: fedora:latest
    command:
    - sleep
    - "3600"
    env:
    - name: MEMORY_REQUEST 1
      valueFrom:
        resourceFieldRef:
          containerName: test
          resource: requests.memory
    - name: MEMORY_LIMIT 2
      valueFrom:
        resourceFieldRef:
          containerName: test
          resource: limits.memory
    resources:
      requests:
        memory: 384Mi
      limits:
        memory: 512Mi

1

Add this stanza to discover the application memory request value.

2

Add this stanza to discover the application memory limit value.

Create the pod:
```
$ oc create -f <file-name>.yaml
```
Access the pod using a remote shell:
```
$ oc rsh test
```

Check that the requested values were applied:

$ env | grep MEMORY | sort
MEMORY_LIMIT=536870912
MEMORY_REQUEST=402653184

Note

The memory limit value can also be read from inside the container by the /sys/fs/cgroup/memory/memory.limit_in_bytes file.

6.3.4. Understanding OOM kill policy

OpenShift Container Platform may kill a process in a container if the total memory usage of all the processes in the container exceeds the memory limit, or in serious cases of node memory exhaustion.

When a process is OOM killed, this may or may not result in the container exiting immediately. If the container PID 1 process receives the SIGKILL, the container will exit immediately. Otherwise, the container behavior is dependent on the behavior of the other processes.

For example, a container process exited with code 137, indicating it received a SIGKILL signal.

If the container does not exit immediately, an OOM kill is detectable as follows:

Access the pod using a remote shell:
```
# oc rsh test
```

The oom_kill counter in /sys/fs/cgroup/memory/memory.oom_control is incremented

$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
oom_kill 0
$ sed -e '' </dev/zero  # provoke an OOM kill
Killed
$ echo $?
137
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
oom_kill 1

If one or more processes in a pod are OOM killed, when the pod subsequently exits, whether immediately or not, it will have phase Failed and reason OOMKilled. An OOM killed pod may be restarted depending on the value of restartPolicy. If not restarted, controllers such as the ReplicationController will notice the pod’s failed status and create a new pod to replace the old one.

If not restarted, the pod status is as follows:

$ oc get pod test
NAME      READY     STATUS      RESTARTS   AGE
test      0/1       OOMKilled   0          1m

$ oc get pod test -o yaml
...
status:
  containerStatuses:
  - name: test
    ready: false
    restartCount: 0
    state:
      terminated:
        exitCode: 137
        reason: OOMKilled
  phase: Failed

If restarted, its status is as follows:

$ oc get pod test
NAME      READY     STATUS    RESTARTS   AGE
test      1/1       Running   1          1m

$ oc get pod test -o yaml
...
status:
  containerStatuses:
  - name: test
    ready: true
    restartCount: 1
    lastState:
      terminated:
        exitCode: 137
        reason: OOMKilled
    state:
      running:
  phase: Running

6.3.5. Understanding pod eviction

OpenShift Container Platform may evict a pod from its node when the node’s memory is exhausted. Depending on the extent of memory exhaustion, the eviction may or may not be graceful. Graceful eviction implies the main process (PID 1) of each container receiving a SIGTERM signal, then some time later a SIGKILL signal if the process has not exited already. Non-graceful eviction implies the main process of each container immediately receiving a SIGKILL signal.

An evicted pod will have phase Failed and reason Evicted. It will not be restarted, regardless of the value of restartPolicy. However, controllers such as the ReplicationController will notice the pod’s failed status and create a new pod to replace the old one.

$ oc get pod test
NAME      READY     STATUS    RESTARTS   AGE
test      0/1       Evicted   0          1m

$ oc get pod test -o yaml
...
status:
  message: 'Pod The node was low on resource: [MemoryPressure].'
  phase: Failed
  reason: Evicted

6.4. Configuring your cluster to place pods on overcommited nodes

In an overcommited state, the sum of the container compute resource requests and limits exceeds the resources available on the system. Overcommitment might be desirable in development environments where a tradeoff of guaranteed performance for capacity is acceptable.

Note

In OpenShift Container Platform overcommittment is enabled by default. See Disabling overcommitment for a node.

6.4.1. Understanding overcommitment

Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.

OpenShift Container Platform administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.

Note

That these overrides have no effect if no limits have been set on containers. Create a LimitRange object with default limits (per individual project, or in the project template) in order to ensure that the overrides apply.

After these overrides, the container limits and requests must still be validated by any LimitRange objects in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRanges with caution.

6.4.2. Understanding resource requests and overcommitment

For each compute resource, a container may specify a resource request and limit. Scheduling decisions are made based on the request to ensure that a node has enough capacity available to meet the requested value. If a container specifies limits, but omits requests, the requests are defaulted to the limits. A container is not able to exceed the specified limit on the node.

The enforcement of limits is dependent upon the compute resource type. If a container makes no request or limit, the container is scheduled to a node with no resource guarantees. In practice, the container is able to consume as much of the specified resource as is available with the lowest local priority. In low resource situations, containers that specify no resource requests are given the lowest quality of service.

Scheduling is based on resources requested, while quota and hard limits refer to resource limits, which can be set higher than requested resources. The difference between request and limit determines the level of overcommit; for instance, if a container is given a memory request of 1Gi and a memory limit of 2Gi, it is scheduled based on the 1Gi request being available on the node, but could use up to 2Gi; so it is 200% overcommitted.

6.4.2.1. Understanding Buffer Chunk Limiting for Fluentd

If the Fluentd logger is unable to keep up with a high number of logs, it will need to switch to file buffering to reduce memory usage and prevent data loss.

Fluentd file buffering stores records in chunks. Chunks are stored in buffers.

The Fluentd buffer_chunk_limit is determined by the environment variable BUFFER_SIZE_LIMIT, which has the default value 8m. The file buffer size per output is determined by the environment variable FILE_BUFFER_LIMIT, which has the default value 256Mi. The permanent volume size must be larger than FILE_BUFFER_LIMIT multiplied by the output.

On the Fluentd pods, permanent volume /var/lib/fluentd should be prepared by the PVC or hostmount, for example. That area is then used for the file buffers.

The buffer_type and buffer_path are configured in the Fluentd configuration files as follows:

$ egrep "buffer_type|buffer_path" *.conf
output-es-config.conf:
  buffer_type file
  buffer_path `/var/lib/fluentd/buffer-output-es-config`
output-es-ops-config.conf:
  buffer_type file
  buffer_path `/var/lib/fluentd/buffer-output-es-ops-config`

The Fluentd buffer_queue_limit is the value of the variable BUFFER_QUEUE_LIMIT. This value is 32 by default.

The environment variable BUFFER_QUEUE_LIMIT is calculated as (FILE_BUFFER_LIMIT / (number_of_outputs * BUFFER_SIZE_LIMIT)).

If the BUFFER_QUEUE_LIMIT variable has the default set of values:

FILE_BUFFER_LIMIT = 256Mi
number_of_outputs = 1
BUFFER_SIZE_LIMIT = 8Mi

The value of buffer_queue_limit will be 32. To change the buffer_queue_limit, you must change the value of FILE_BUFFER_LIMIT.

In this formula, number_of_outputs is 1 if all the logs are sent to a single resource, and it is incremented by 1 for each additional resource. For example, the value of number_of_outputs is:

1 - if all logs are sent to a single Elasticsearch pod
2 - if application logs are sent to an Elasticsearch pod and ops logs are sent to another Elasticsearch pod
4 - if application logs are sent to an Elasticsearch pod, ops logs are sent to another Elasticsearch pod, and both of them are forwarded to other Fluentd instances

6.4.3. Understanding compute resources and containers

The node-enforced behavior for compute resources is specific to the resource type.

6.4.3.1. Understanding container CPU requests

A container is guaranteed the amount of CPU it requests and is additionally able to consume excess CPU available on the node, up to any limit specified by the container. If multiple containers are attempting to use excess CPU, CPU time is distributed based on the amount of CPU requested by each container.

For example, if one container requested 500m of CPU time and another container requested 250m of CPU time, then any extra CPU time available on the node is distributed among the containers in a 2:1 ratio. If a container specified a limit, it will be throttled not to use more CPU than the specified limit. CPU requests are enforced using the CFS shares support in the Linux kernel. By default, CPU limits are enforced using the CFS quota support in the Linux kernel over a 100ms measuring interval, though this can be disabled.

6.4.3.2. Understanding container memory requests

A container is guaranteed the amount of memory it requests. A container can use more memory than requested, but once it exceeds its requested amount, it could be terminated in a low memory situation on the node. If a container uses less memory than requested, it will not be terminated unless system tasks or daemons need more memory than was accounted for in the node’s resource reservation. If a container specifies a limit on memory, it is immediately terminated if it exceeds the limit amount.

6.4.4. Understanding overcomitment and quality of service classes

A node is overcommitted when it has a pod scheduled that makes no request, or when the sum of limits across all pods on that node exceeds available machine capacity.

In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resource than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:

Table 6.19. Quality of Service Classes
Priority	Class Name	Description
1 (highest)	Guaranteed	If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed.
2	Burstable	If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable.
3 (lowest)	BestEffort	If requests and limits are not set for any of the resources, then the container is classified as BestEffort.

Memory is an incompressible resource, so in low memory situations, containers that have the lowest priority are terminated first:

Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable containers under system memory pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.
BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of memory.

6.4.4.1. Understanding how to reserve memory across quality of service tiers

You can use the qos-reserved parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower OoS classes from using resources requested by pods in higher QoS classes.

OpenShift Container Platform uses the qos-reserved parameter as follows:

A value of qos-reserved=memory=100% will prevent the Burstable and BestEffort QOS classes from consuming memory that was requested by a higher QoS class. This increases the risk of inducing OOM on BestEffort and Burstable workloads in favor of increasing memory resource guarantees for Guaranteed and Burstable workloads.
A value of qos-reserved=memory=50% will allow the Burstable and BestEffort QOS classes to consume half of the memory requested by a higher QoS class.
A value of qos-reserved=memory=0% will allow a Burstable and BestEffort QoS classes to consume up to the full node allocatable amount if available, but increases the risk that a Guaranteed workload will not have access to requested memory. This condition effectively disables this feature.

6.4.5. Understanding swap memory and QOS

You can disable swap by default on your nodes in order to preserve quality of service (QOS) guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.

For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.

Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.

Important

If swap is enabled, any out-of-resource handling eviction thresholds for available memory will not work as expected. Take advantage of out-of-resource handling to allow pods to be evicted from a node when it is under memory pressure, and rescheduled on an alternative node that has no such pressure.

6.4.6. Understanding nodes overcommitment

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority

You can view the current setting by running the following commands on your nodes:

$ sysctl -a |grep commit

vm.overcommit_memory = 1

$ sysctl -a |grep panic
vm.panic_on_oom = 0

Note

The above flags should already be set on nodes, and no further action is required.

You can also perform the following configurations for each node:

Disable or enforce CPU limits using CPU CFS quotas
Reserve resources for system processes
Reserve memory across quality of service tiers

6.4.6.1. Disabling or enforcing CPU limits using CPU CFS quotas

Nodes by default enforce specified CPU limits using the Completely Fair Scheduler (CFS) quota support in the Linux kernel.

Prerequisites

Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:

View the Machine Config Pool:

$ oc describe machineconfigpool <name>

For example:

$ oc describe machineconfigpool worker

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: 2019-02-08T14:52:39Z
  generation: 1
  labels:
    custom-kubelet: small-pods 1

1: If a label has been added it appears under labels.

If the label is not present, add a key/value pair:

$ oc label machineconfigpool worker custom-kubelet=small-pods

Procedure

Create a Custom Resource (CR) for your configuration change.

Sample configuration for a disabling CPU limits

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: disable-cpu-units 1
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods 2
  kubeletConfig:
    cpu-cfs-quota: 3
      - "false"

1: Assign a name to CR.
2: Specify the label to apply the configuration change.
3: Set the cpu-cfs-quota parameter to false.

If CPU limit enforcement is disabled, it is important to understand the impact that will have on your node:

If a container makes a request for CPU, it will continue to be enforced by CFS shares in the Linux kernel.
If a container makes no explicit request for CPU, but it does specify a limit, the request will default to the specified limit, and be enforced by CFS shares in the Linux kernel.
If a container specifies both a request and a limit for CPU, the request will be enforced by CFS shares in the Linux kernel, and the limit will have no impact on the node.

6.4.6.2. Reserving resources for system processes

To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by system daemons that are required to run on your node for your cluster to function (sshd, etc.). In particular, it is recommended that you reserve resources for incompressible resources such as memory.

Procedure

To explicitly reserve resources for non-pod processes, allocate node resources by specifying resources available for scheduling. For more details, see Allocating Resources for Nodes.

6.4.7. Disabling overcommitment for a node

When enabled, overcommitment can be disabled on each node.

Procedure

To disable overcommitment in a node run the following command on that node:

$ sysctl -w vm.overcommit_memory=0

6.4.8. Disabling overcommitment for a project

When enabled, overcommitment can be disabled per-project. For example, you can allow infrastructure components to be configured independently of overcommitment.

Procedure

To disable overcommitment in a project:

Edit the project object file

Add the following annotation:

quota.openshift.io/cluster-resource-override-enabled: "false"

Create the project object:
```
$ oc create -f <file-name>.yaml
```

6.5. Enabling features using feature gates

As an administrator, you can turn on features that are in Technology Preview status.

6.5.1. Understanding feature gates and Technology Preview features

You can use the Feature Gate Custom Resource to enable Technology Preview features throughout your cluster. This allows you, for example, to enable Technology Preview features on test clusters where you can fully test them while ensuring they are disabled on production clusters.

Important

After turning Technology Preview features on using feature gates, they cannot be turned off and cluster upgrades are prevented.

For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.

6.5.2. Features that are affected by FeatureGates

The following Technology Preview features included in OpenShift Container Platform:

FeatureGate	Description	Default
`ExperimentalCriticalPodAnnotation`	Enables annotating specific Pods as critical so that their scheduling is guaranteed.	True
`RotateKubeletServerCertificate`	Enables the rotation of the server TLS certificate on the cluster.	True
`SupportPodPidsLimit`	Enables support for limiting the number of processes (PIDs) running in a Pod.	True
`MachineHealthCheck`	Enables automatically repairing unhealthy machines in a machine pool.	False
`CSIBlockVolume`	Enables external CSI drivers to implement raw block volume support.	False
`LocalStorageCapacityIsolation`	Enable the consumption of local ephemeral storage and also the `sizeLimit` property of an `emptyDir` volume.	False

You can enable the MachineHealthCheck and CSIBlockVolume features by editing the Feature Gate Custom Resource. Turning on these features cannot be undone and prevents the ability to upgrade your cluster.

The LocalStorageCapacityIsolation cannot be enabled.

6.5.3. Enabling Technology Preview features using feature gates

You can turn on the MachineHealthCheck and CSIBlockVolume Technology Preview features on for all nodes in the cluster by editing the Feature Gate Custom Resource, named cluster, in the openshift-config project.

Turning

Important

Turning on Technology Preview features using the Feature Gate Custom Resource cannot be undone and prevents upgrades.

Procedure

To turn on the Technology Preview features for the entire cluster:

In the OpenShift Container Platform web sonsole, switch to the Administration Custom Resource Definitions page.
On the Custom Resource Definitions page, click FeatureGate.
On the Custom Resource Definitions page, click Actions View Instances.
On the Feature Gates page, click Create Feature Gates.

Add the featureSet parameter:

apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: cluster
spec:
  featureSet: "TechPreviewNoUpgrade" 1

1: Add the featureSet: "TechPreviewNoUpgrade" parameter.

Click Save.

Chapter 6. Working with clusters

6.1. Viewing system event information in an OpenShift Container Platform cluster

6.1.1. Understanding events

6.1.2. Viewing events using the CLI

6.1.3. List of events

6.2. Estimating the number of pods your OpenShift Container Platform nodes can hold

6.2.1. Understanding the OpenShift Container Platform cluster capacity tool

6.2.2. Running the cluster capacity tool on the command line

6.2.3. Running the cluster capacity tool as a job inside a pod

6.3. Configuring cluster memory to meet container memory and risk requirements

6.3.1. Understanding managing application memory

6.3.1.1. Managing application memory strategy

6.3.2. Understanding OpenJDK settings for OpenShift Container Platform

6.3.2.1. Understanding how to override the JVM maximum heap size

6.3.2.2. Understanding how to encourage the JVM to release unused memory to the operating system

6.3.2.3. Understanding how to ensure all JVM processes within a container are appropriately configured

6.3.3. Finding the memory request and limit from within a pod

6.3.4. Understanding OOM kill policy

6.3.5. Understanding pod eviction

6.4. Configuring your cluster to place pods on overcommited nodes

6.4.1. Understanding overcommitment

6.4.2. Understanding resource requests and overcommitment

6.4.2.1. Understanding Buffer Chunk Limiting for Fluentd

6.4.3. Understanding compute resources and containers

6.4.3.1. Understanding container CPU requests

6.4.3.2. Understanding container memory requests

6.4.4. Understanding overcomitment and quality of service classes

6.4.4.1. Understanding how to reserve memory across quality of service tiers

6.4.5. Understanding swap memory and QOS

6.4.6. Understanding nodes overcommitment

6.4.6.1. Disabling or enforcing CPU limits using CPU CFS quotas

6.4.6.2. Reserving resources for system processes

6.4.7. Disabling overcommitment for a node

6.4.8. Disabling overcommitment for a project

6.5. Enabling features using feature gates

6.5.1. Understanding feature gates and Technology Preview features

6.5.2. Features that are affected by FeatureGates

6.5.3. Enabling Technology Preview features using feature gates

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links