Search

Cluster Administration

download PDF
OpenShift Container Platform 3.11

OpenShift Container Platform 3.11 Cluster Administration

Red Hat OpenShift Documentation Team

Abstract

OpenShift Cluster Administration topics cover the day to day tasks for managing your OpenShift cluster and other advanced configuration topics.

Chapter 1. Overview

 
These Cluster Administration topics cover the day-to-day tasks for managing your OpenShift Container Platform cluster and other advanced configuration topics.

Chapter 2. Managing nodes

2.1. Overview

You can manage nodes in your instance using the CLI.

When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. The master uses the information from node objects to validate nodes with health checks.

2.2. Listing nodes

To list all nodes that are known to the master:

$ oc get nodes

Example Output

NAME                   STATUS    ROLES     AGE       VERSION
master.example.com     Ready     master    7h        v1.9.1+a0ce1bc657
node1.example.com      Ready     compute   7h        v1.9.1+a0ce1bc657
node2.example.com      Ready     compute   7h        v1.9.1+a0ce1bc657

To list all nodes with information on a project’s pod deployment with node information:

$ oc get nodes -o wide

Example Output

NAME                           STATUS    ROLES     AGE       VERSION           EXTERNAL-IP      OS-IMAGE                                      KERNEL-VERSION          CONTAINER-RUNTIME
ip-172-18-0-39.ec2.internal    Ready     infra     1d        v1.10.0+b81c8f8   54.172.185.130   Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   docker://1.13.1
ip-172-18-10-95.ec2.internal   Ready     master    1d        v1.10.0+b81c8f8   54.88.22.81      Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   docker://1.13.1
ip-172-18-8-35.ec2.internal    Ready     compute   1d        v1.10.0+b81c8f8   34.230.50.57     Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   docker://1.13.1

To list only information about a single node, replace <node> with the full node name:

$ oc get node <node>

The STATUS column in the output of these commands can show nodes with the following conditions:

Table 2.1. Node Conditions
ConditionDescription

Ready

The node is passing the health checks performed from the master by returning StatusOK.

NotReady

The node is not passing the health checks performed from the master.

SchedulingDisabled

Pods cannot be scheduled for placement on the node.

Note

The STATUS column can also show Unknown for a node if the CLI cannot find any node condition.

To get more detailed information about a specific node, including the reason for the current condition:

$ oc describe node <node>

For example:

$ oc describe node node1.example.com

Example Output

Name:               node1.example.com 1
Roles:              compute 2
Labels:             beta.kubernetes.io/arch=amd64 3
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=m01.example.com
                    node-role.kubernetes.io/compute=true
                    node-role.kubernetes.io/infra=true
                    node-role.kubernetes.io/master=true
                    zone=default
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach=true  4
CreationTimestamp:  Thu, 24 May 2018 11:46:56 -0400
Taints:             <none>   5
Unschedulable:      false
Conditions:                  6
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Tue, 17 Jul 2018 11:47:30 -0400   Tue, 10 Jul 2018 15:45:16 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Tue, 17 Jul 2018 11:47:30 -0400   Tue, 10 Jul 2018 15:45:16 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 17 Jul 2018 11:47:30 -0400   Tue, 10 Jul 2018 16:03:54 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Tue, 17 Jul 2018 11:47:30 -0400   Mon, 16 Jul 2018 15:10:25 -0400   KubeletReady                 kubelet is posting ready status
  PIDPressure      False   Tue, 17 Jul 2018 11:47:30 -0400   Thu, 05 Jul 2018 10:06:51 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
Addresses:                   7
  InternalIP:  192.168.122.248
  Hostname:    node1.example.com
Capacity:                    8
 cpu:            2
 hugepages-2Mi:  0
 memory:         8010336Ki
 pods:           40
Allocatable:
 cpu:            2
 hugepages-2Mi:  0
 memory:         7907936Ki
 pods:           40
System Info:                 9
 Machine ID:                         b3adb9acbc49fc1f9a7d6
 System UUID:                        B3ADB9A-B0CB-C49FC1F9A7D6
 Boot ID:                            9359d15aec9-81a20aef5876
 Kernel Version:                     3.10.0-693.21.1.el7.x86_64
 OS Image:                           OpenShift Enterprise
 Operating System:                   linux
 Architecture:                       amd64
 Container Runtime Version:          docker://1.13.1
 Kubelet Version:                    v1.10.0+b81c8f8
 Kube-Proxy Version:                 v1.10.0+b81c8f8
ExternalID:                          node1.example.com
Non-terminated Pods:                 (14 in total)       10
  Namespace                          Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                          ----                                  ------------  ----------  ---------------  -------------
  default                            docker-registry-2-w252l               100m (5%)     0 (0%)      256Mi (3%)       0 (0%)
  default                            registry-console-2-dpnc9              0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                            router-2-5snb2                        100m (5%)     0 (0%)      256Mi (3%)       0 (0%)
  kube-service-catalog               apiserver-jh6gt                       0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-service-catalog               controller-manager-z4t5j              0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                        master-api-m01.example.com            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                        master-controllers-m01.example.com    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                        master-etcd-m01.example.com           0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-ansible-service-broker   asb-1-hnn5t                           0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-node                     sync-dvhvs                            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-sdn                      ovs-zjs5k                             100m (5%)     200m (10%)  300Mi (3%)       400Mi (5%)
  openshift-sdn                      sdn-zr4cb                             100m (5%)     0 (0%)      200Mi (2%)       0 (0%)
  openshift-template-service-broker  apiserver-s9n7t                       0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-web-console              webconsole-785689b664-q7s9j           100m (5%)     0 (0%)      100Mi (1%)       0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  500m (25%)    200m (10%)  1112Mi (14%)     400Mi (5%)
Events:                                                  11
  Type     Reason                   Age                From                      Message
  ----     ------                   ----               ----                      -------
  Normal   NodeHasSufficientPID     6d (x5 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  6d                 kubelet, m01.example.com  Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientDisk    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientPID     6d                 kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
  Normal   Starting                 6d                 kubelet, m01.example.com  Starting kubelet.
 ...

1
The name of the node.
2
The role of the node, either master, compute, or infra.
3
The labels applied to the node.
4
The annotations applied to the node.
5
The taints applied to the node.
6
7
The IP address and host name of the node.
8
9
Information about the node host.
10
The pods on the node.
11
The events reported by the node.

2.3. Viewing nodes

You can display usage statistics about nodes, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.

To view the usage statistics:

$ oc adm top nodes

Example Output

NAME       CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
node-1     297m         29%       4263Mi          55%
node-0     55m          5%        1201Mi          15%
infra-1    85m          8%        1319Mi          17%
infra-0    182m         18%       2524Mi          32%
master-0   178m         8%        2584Mi          16%

To view the usage statistics for nodes with labels:

$ oc adm top node --selector=''

You must choose the selector (label query) to filter on. Supports =, ==, and !=.

Note

You must have cluster-reader permission to view the usage statistics.

Note

The metrics-server must be installed to view the usage statistics. See Requirements for Using Horizontal Pod Autoscalers.

2.4. Adding hosts

You can add new hosts to your cluster by running the scaleup.yml playbook. This playbook queries the master, generates and distributes new certificates for the new hosts, and then runs the configuration playbooks on only the new hosts. Before running the scaleup.yml playbook, complete all prerequisite host preparation steps.

Important

The scaleup.yml playbook configures only the new host. It does not update NO_PROXY in master services, and it does not restart master services.

You must have an existing inventory file, for example /etc/ansible/hosts, that is representative of your current cluster configuration in order to run the scaleup.yml playbook. If you previously used the atomic-openshift-installer command to run your installation, you can check ~/.config/openshift/hosts for the last inventory file that the installer generated and use that file as your inventory file. You can modify this file as required. You must then specify the file location with -i when you run the ansible-playbook.

Important

See the cluster maximums section for the recommended maximum number of nodes.

Procedure
  1. Ensure you have the latest playbooks by updating the openshift-ansible package:

    # yum update openshift-ansible
  2. Edit your /etc/ansible/hosts file and add new_<host_type> to the [OSEv3:children] section. For example, to add a new node host, add new_nodes:

    [OSEv3:children]
    masters
    nodes
    new_nodes

    To add new master hosts, add new_masters.

  3. Create a [new_<host_type>] section to specify host information for the new hosts. Format this section like an existing section, as shown in the following example of adding a new node:

    [nodes]
    master[1:3].example.com
    node1.example.com openshift_node_group_name='node-config-compute'
    node2.example.com openshift_node_group_name='node-config-compute'
    infra-node1.example.com openshift_node_group_name='node-config-infra'
    infra-node2.example.com openshift_node_group_name='node-config-infra'
    
    [new_nodes]
    node3.example.com openshift_node_group_name='node-config-infra'

    See Configuring Host Variables for more options.

    When adding new masters, add hosts to both the [new_masters] section and the [new_nodes] section to ensure that the new master host is part of the OpenShift SDN:

    [masters]
    master[1:2].example.com
    
    [new_masters]
    master3.example.com
    
    [nodes]
    master[1:2].example.com
    node1.example.com openshift_node_group_name='node-config-compute'
    node2.example.com openshift_node_group_name='node-config-compute'
    infra-node1.example.com openshift_node_group_name='node-config-infra'
    infra-node2.example.com openshift_node_group_name='node-config-infra'
    
    [new_nodes]
    master3.example.com
    Important

    If you label a master host with the node-role.kubernetes.io/infra=true label and have no other dedicated infrastructure nodes, you must also explicitly mark the host as schedulable by adding openshift_schedulable=true to the entry. Otherwise, the registry and router pods cannot be placed anywhere.

  4. Change to the playbook directory and run the openshift_node_group.yml playbook. If your inventory file is located somewhere other than the default of /etc/ansible/hosts, specify the location with the -i option:

    $ cd /usr/share/ansible/openshift-ansible
    $ ansible-playbook [-i /path/to/file] \
      playbooks/openshift-master/openshift_node_group.yml

    This creates the ConfigMap for the new node groups, and ultimately, the configuration file of the node on the host.

    Note

    Running the openshift_node_group.yaml playbook only updates new nodes. It cannot be run to update existing nodes in a cluster.

  5. Run the scaleup.yml playbook. If your inventory file is located somewhere other than the default of /etc/ansible/hosts, specify the location with the -i option.

    • For additional nodes:

      $ ansible-playbook [-i /path/to/file] \
          playbooks/openshift-node/scaleup.yml
    • For additional masters:

      $ ansible-playbook [-i /path/to/file] \
          playbooks/openshift-master/scaleup.yml
  6. Set the node label to logging-infra-fluentd=true, if you deployed the EFK stack in your cluster:

    # oc label node/new-node.example.com logging-infra-fluentd=true
  7. After the playbook runs, verify the installation.
  8. Move any hosts that you defined in the [new_<host_type>] section to their appropriate section. By moving these hosts, subsequent playbook runs that use this inventory file treat the nodes correctly. You can keep the empty [new_<host_type>] section. For example, when adding new nodes:

    [nodes]
    master[1:3].example.com
    node1.example.com openshift_node_group_name='node-config-compute'
    node2.example.com openshift_node_group_name='node-config-compute'
    node3.example.com openshift_node_group_name='node-config-compute'
    infra-node1.example.com openshift_node_group_name='node-config-infra'
    infra-node2.example.com openshift_node_group_name='node-config-infra'
    
    [new_nodes]

2.5. Deleting nodes

When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node itself are not deleted. Any bare pods not backed by a replication controller would be inaccessible to OpenShift Container Platform, pods backed by replication controllers would be rescheduled to other available nodes, and local manifest pods would need to be manually deleted.

To delete a node from the OpenShift Container Platform cluster:

  1. Evacuate pods from the node you are preparing to delete.
  2. Delete the node object:

    $ oc delete node <node>
  3. Check that the node has been removed from the node list:

    $ oc get nodes

    Pods should now be only scheduled for the remaining nodes that are in Ready state.

  4. If you want to uninstall all OpenShift Container Platform content from the node host, including all pods and containers, continue to Uninstalling Nodes and follow the procedure using the uninstall.yml playbook. The procedure assumes general understanding of the cluster installation process using Ansible.

2.6. Updating labels on nodes

To add or update labels on a node:

$ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>

To see more detailed usage:

$ oc label -h

2.7. Listing pods on nodes

To list all or selected pods on one or more nodes:

$ oc adm manage-node <node1> <node2> \
    --list-pods [--pod-selector=<pod_selector>] [-o json|yaml]

To list all or selected pods on selected nodes:

$ oc adm manage-node --selector=<node_selector> \
    --list-pods [--pod-selector=<pod_selector>] [-o json|yaml]

2.8. Marking nodes as unschedulable or schedulable

By default, healthy nodes with a Ready status are marked as schedulable, meaning that new pods are allowed for placement on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Existing pods on the node are not affected.

To mark a node or nodes as unschedulable:

$ oc adm manage-node <node1> <node2> --schedulable=false

For example:

$ oc adm manage-node node1.example.com --schedulable=false

Example Output

NAME                 LABELS                                        STATUS
node1.example.com    kubernetes.io/hostname=node1.example.com      Ready,SchedulingDisabled

To mark a currently unschedulable node or nodes as schedulable:

$ oc adm manage-node <node1> <node2> --schedulable

Alternatively, instead of specifying specific node names (e.g., <node1> <node2>), you can use the --selector=<node_selector> option to mark selected nodes as schedulable or unschedulable.

2.9. Evacuating pods on nodes

Evacuating pods allows you to migrate all or selected pods from a given node. Nodes must first be marked unschedulable to perform pod evacuation.

Only pods backed by a replication controller can be evacuated; the replication controllers create new pods on other nodes and remove the existing pods from the specified node(s). Bare pods, meaning those not backed by a replication controller, are unaffected by default. You can evacuate a subset of pods by specifying a pod-selector. Pod selector is based on labels, so all the pods with the specified label will be evacuated.

To evacuate all or selected pods on a node:

$ oc adm drain <node> [--pod-selector=<pod_selector>]

You can force deletion of bare pods by using the --force option. When set to true, deletion continues even if there are pods not managed by a replication controller, ReplicaSet, job, daemonset, or StatefulSet:

$ oc adm drain <node> --force=true

You can use --grace-period to set a period of time in seconds for each pod to terminate gracefully. If negative, the default value specified in the pod is used:

$ oc adm drain <node> --grace-period=-1

You can use --ignore-daemonsets and set it to true to ignore daemonset-managed pods:

$ oc adm drain <node> --ignore-daemonsets=true

You can use --timeout to set the length of time to wait before giving up. A value of 0 sets an infinite length of time:

$ oc adm drain <node> --timeout=5s

You can use --delete-local-data and set it to true to continue deletion even if there are pods using emptyDir (local data that is deleted when the node is drained):

$ oc adm drain <node> --delete-local-data=true

To list objects that will be migrated without actually performing the evacuation, use the --dry-run option and set it to true:

$ oc adm drain <node> --dry-run=true

Instead of specifying a specific node name, you can use the --selector=<node_selector> option to evacuate pods on nodes that match the selector.

2.10. Rebooting nodes

To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.

Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.

2.10.1. Infrastructure nodes

Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes available to run infrastructure. The scenario below demonstrates a common mistake that can lead to service interruptions for the applications running on OpenShift Container Platform when only two nodes are available.

  • Node A is marked unschedulable and all pods are evacuated.
  • The registry pod running on that node is now redeployed on node B. This means node B is now running both registry pods.
  • Node B is now marked unschedulable and is evacuated.
  • The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints until they are redeployed to node A.

The same process using three infrastructure nodes does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is left running zero registries. The other two nodes will run two and one registries respectively. The best solution is to rely on pod anti-affinity. This is an alpha feature in Kubernetes that is available for testing now, but is not yet supported for production workloads.

2.10.2. Using pod anti-affinity

Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 100 3
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: docker-registry 4
              operator: In 5
              values:
              - default
          topologyKey: kubernetes.io/hostname
1
Stanza to configure pod anti-affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with the highest weight is preferred.
4
Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
5
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

This example assumes the container image registry pod has a label of docker-registry=default. Pod anti-affinity can use any Kubernetes match expression.

The last required step is to enable the MatchInterPodAffinity scheduler predicate in /etc/origin/master/scheduler.json. With this in place, if only two infrastructure nodes are available and one is rebooted, the container image registry pod is prevented from running on the other node. oc get pods reports the pod as unready until a suitable node is available. Once a node is available and all pods are back in ready state, the next node can be restarted.

2.10.3. Handling nodes running routers

In most cases, a pod running an OpenShift Container Platform router will expose a host port. The PodFitsPorts scheduler predicate ensures that no router pods using the same port can run on the same node, and pod anti-affinity is achieved. If the routers are relying on IP failover for high availability, there is nothing else that is needed. For router pods relying on an external service such as AWS Elastic Load Balancing for high availability, it is that service’s responsibility to react to router pod restarts.

In rare cases, a router pod might not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.

2.11. Modifying Nodes

During installation, OpenShift Container Platform creates a configmap in the openshift-node project for each type of node group:

  • node-config-master
  • node-config-infra
  • node-config-compute
  • node-config-all-in-one
  • node-config-master-infra

To make configuration changes to an existing node, edit the appropriate configuration map. A sync pod on each node watches for changes in the configuration maps. During installation, the sync pods are created by using sync Daemonsets, and a /etc/origin/node/node-config.yaml file, where the node configuration parameters reside, is added to each node. When a sync pod detects configuration map change, it updates the node-config.yaml on all nodes in that node group and restarts the atomic-openshift-node.service on the appropriate nodes.

$ oc get cm -n openshift-node

Example Output

NAME                       DATA      AGE
node-config-all-in-one     1         1d
node-config-compute        1         1d
node-config-infra          1         1d
node-config-master         1         1d
node-config-master-infra   1         1d

Sample configuration map for the node-config-compute group

apiVersion: v1
authConfig:      1
  authenticationCacheSize: 1000
  authenticationCacheTTL: 5m
  authorizationCacheSize: 1000
  authorizationCacheTTL: 5m
dnsBindAddress: 127.0.0.1:53
dnsDomain: cluster.local
dnsIP: 0.0.0.0               2
dnsNameservers: null
dnsRecursiveResolvConf: /etc/origin/node/resolv.conf
dockerConfig:
  dockerShimRootDirectory: /var/lib/dockershim
  dockerShimSocket: /var/run/dockershim.sock
  execHandlerName: native
enableUnidling: true
imageConfig:
  format: registry.reg-aws.openshift.com/openshift3/ose-${component}:${version}
  latest: false
iptablesSyncPeriod: 30s
kind: NodeConfig
kubeletArguments: 3
  bootstrap-kubeconfig:
  - /etc/origin/node/bootstrap.kubeconfig
  cert-dir:
  - /etc/origin/node/certificates
  cloud-config:
  - /etc/origin/cloudprovider/aws.conf
  cloud-provider:
  - aws
  enable-controller-attach-detach:
  - 'true'
  feature-gates:
  - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
  node-labels:
  - node-role.kubernetes.io/compute=true
  pod-manifest-path:
  - /etc/origin/node/pods  4
  rotate-certificates:
  - 'true'
masterClientConnectionOverrides:
  acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
  burst: 40
  contentType: application/vnd.kubernetes.protobuf
  qps: 20
masterKubeConfig: node.kubeconfig
networkConfig:   5
  mtu: 8951
  networkPluginName: redhat/openshift-ovs-subnet  6
servingInfo:                   7
  bindAddress: 0.0.0.0:10250
  bindNetwork: tcp4
  clientCA: client-ca.crt 8
volumeConfig:
  localQuota:
    perFSGroup: null
volumeDirectory: /var/lib/origin/openshift.local.volumes

1
Authentication and authorization configuration options.
2
IP address prepended to a pod’s /etc/resolv.conf.
3
Key value pairs that are passed directly to the Kubelet that match the Kubelet’s command line arguments.
4
The path to the pod manifest file or directory. A directory must contain one or more manifest files. OpenShift Container Platform uses the manifest files to create pods on the node.
5
The pod network settings on the node.
6
Software defined network (SDN) plug-in. Set to redhat/openshift-ovs-subnet for the ovs-subnet plug-in; redhat/openshift-ovs-multitenant for the ovs-multitenant plug-in; or redhat/openshift-ovs-networkpolicy for the ovs-networkpolicy plug-in.
7
Certificate information for the node.
8
Optional: PEM-encoded certificate bundle. If set, a valid client certificate must be presented and validated against the certificate authorities in the specified file before the request headers are checked for user names.
Note

Do not manually modify the /etc/origin/node/node-config.yaml file.

2.11.1. Configuring Node Resources

You can configure node resources by adding kubelet arguments to the node configuration map.

  1. Edit the configuration map:

    $ oc edit cm node-config-compute -n openshift-node
  2. Add the kubeletArguments section and specify your options:

    kubeletArguments:
      max-pods: 1
        - "40"
      resolv-conf: 2
        - "/etc/resolv.conf"
      image-gc-high-threshold: 3
        - "90"
      image-gc-low-threshold: 4
        - "80"
      kube-api-qps: 5
        - "20"
      kube-api-burst: 6
        - "40"
    1
    2
    Resolver configuration file used as the basis for the container DNS resolution configuration.
    3
    The percent of disk usage after which image garbage collection is always run. Default: 90%
    4
    The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Default: 80%
    5
    The queries per second (QPS) to use while talking with the Kubernetes API server.
    6
    The burst to use while talking with the Kubernetes API server.

    To view all available kubelet options:

    $ hyperkube kubelet -h

2.11.2. Setting maximum pods per node

Note

See the Cluster maximums page for the maximum supported limits for each version of OpenShift Container Platform.

In the /etc/origin/node/node-config.yaml file, one parameter controls the maximum number of pods that can be scheduled to a node: max-pods. When the max-pods option is in use, it limits the number of pods on a node. Exceeding this value can result in:

  • Increased CPU utilization on both OpenShift Container Platform and Docker.
  • Slow pod scheduling.
  • Potential out-of-memory scenarios (depends on the amount of memory in the node).
  • Exhausting the pool of IP addresses.
  • Resource overcommitting, leading to poor user application performance.
Note

In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.

max-pods sets the number of pods the node can run to a fixed value, regardless of the properties of the node. Cluster Limits documents maximum supported values for max-pods.

kubeletArguments:
  max-pods:
    - "250"

Using the above example, the default value for max-pods is 250.

2.12. Resetting Docker storage

As you download container images and run and delete containers, Docker does not always free up mapped disk space. As a result, over time you can run out of space on a node, which might prevent OpenShift Container Platform from being able to create new pods or cause pod creation to take several minutes.

For example, the following shows pods that are still in the ContainerCreating state after six minutes and the events log shows a FailedSync event.

$ oc get pod

Example Output

NAME                               READY     STATUS              RESTARTS   AGE
cakephp-mysql-persistent-1-build   0/1       ContainerCreating   0          6m
mysql-1-9767d                      0/1       ContainerCreating   0          2m
mysql-1-deploy                     0/1       ContainerCreating   0          6m

$ oc get events

Example Output

LASTSEEN   FIRSTSEEN   COUNT     NAME                               KIND                    SUBOBJECT                     TYPE      REASON                         SOURCE                                                 MESSAGE
6m         6m          1         cakephp-mysql-persistent-1-build   Pod                                                   Normal    Scheduled                      default-scheduler                                      Successfully assigned cakephp-mysql-persistent-1-build to ip-172-31-71-195.us-east-2.compute.internal
2m         5m          4         cakephp-mysql-persistent-1-build   Pod                                                   Warning   FailedSync                     kubelet, ip-172-31-71-195.us-east-2.compute.internal   Error syncing pod
2m         4m          4         cakephp-mysql-persistent-1-build   Pod                                                   Normal    SandboxChanged                 kubelet, ip-172-31-71-195.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.

One solution to this problem is to reset Docker storage to remove artifacts not needed by Docker.

On the node where you want to restart Docker storage:

  1. Run the following command to mark the node as unschedulable:

    $ oc adm manage-node <node> --schedulable=false
  2. Run the following command to shut down Docker and the atomic-openshift-node service:

    $ systemctl stop docker atomic-openshift-node
  3. Run the following command to remove the local volume directory:

    $ rm -rf /var/lib/origin/openshift.local.volumes

    This command clears the local image cache. As a result, images, including ose-* images, will need to be re-pulled. This might result in slower pod start times while the image store recovers.

  4. Remove the /var/lib/docker directory:

    $ rm -rf /var/lib/docker
  5. Run the following command to reset the Docker storage:

    $ docker-storage-setup --reset
  6. Run the following command to recreate the Docker storage:

    $ docker-storage-setup
  7. Recreate the /var/lib/docker directory:

    $ mkdir /var/lib/docker
  8. Run the following command to restart Docker and the atomic-openshift-node service:

    $ systemctl start docker atomic-openshift-node
  9. Restart the node service by rebooting the host:

    # systemctl restart atomic-openshift-node.service
  10. Run the following command to mark the node as schedulable:

    $ oc adm manage-node <node> --schedulable=true

Chapter 3. Restoring OpenShift Container Platform components

3.1. Overview

In OpenShift Container Platform, you can restore your cluster and its components by recreating cluster elements, including nodes and applications, from separate storage.

To restore a cluster, you must first back it up.

Important

The following process describes a generic way of restoring applications and the OpenShift Container Platform cluster. It cannot take into account custom requirements. You might need to take additional actions to restore your cluster.

3.2. Restoring a cluster

To restore a cluster, first reinstall OpenShift Container Platform.

Procedure
  1. Reinstall OpenShift Container Platform in the same way that you originally installed OpenShift Container Platform.
  2. Run all of your custom post-installation steps, such as changing services outside of the control of OpenShift Container Platform or installing extra services like monitoring agents.

3.3. Restoring a master host backup

After creating a backup of important master host files, if they become corrupted or accidentally removed, you can restore the files by copying the files back to master, ensuring they contain the proper content, and restarting the affected services.

Procedure
  1. Restore the /etc/origin/master/master-config.yaml file:

    # MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
    # cp /etc/origin/master/master-config.yaml /etc/origin/master/master-config.yaml.old
    # cp /backup/$(hostname)/$(date +%Y%m%d)/origin/master/master-config.yaml /etc/origin/master/master-config.yaml
    # master-restart api
    # master-restart controllers
    Warning

    Restarting the master services can lead to downtime. However, you can remove the master host from the highly available load balancer pool, then perform the restore operation. Once the service has been properly restored, you can add the master host back to the load balancer pool.

    Note

    Perform a full reboot of the affected instance to restore the iptables configuration.

  2. If you cannot restart OpenShift Container Platform because packages are missing, reinstall the packages.

    1. Get the list of the current installed packages:

      $ rpm -qa | sort > /tmp/current_packages.txt
    2. View the differences between the package lists:

      $ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt
      
      > ansible-2.4.0.0-5.el7.noarch
    3. Reinstall the missing packages:

      # yum reinstall -y <packages> 1
      1
      Replace <packages> with the packages that are different between the package lists.
  3. Restore a system certificate by copying the certificate to the /etc/pki/ca-trust/source/anchors/ directory and execute update-ca-trust:

    $ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
    $ sudo cp ${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors/<certificate> /etc/pki/ca-trust/source/anchors/ 1
    $ sudo update-ca-trust
    1
    Replace <certificate> with the file name of the system certificate to restore.
    Note

    Always ensure the user ID and group ID are restored when the files are copied back, as well as the SELinux context.

3.4. Restoring a node host backup

After creating a backup of important node host files, if they become corrupted or accidentally removed, you can restore the file by copying back the file, ensuring it contains the proper content and restart the affected services.

Procedure
  1. Restore the /etc/origin/node/node-config.yaml file:

    # MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
    # cp /etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml.old
    # cp /backup/$(hostname)/$(date +%Y%m%d)/etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml
    # reboot
Warning

Restarting the services can lead to downtime. See Node maintenance, for tips on how to ease the process.

Note

Perform a full reboot of the affected instance to restore the iptables configuration.

  1. If you cannot restart OpenShift Container Platform because packages are missing, reinstall the packages.

    1. Get the list of the current installed packages:

      $ rpm -qa | sort > /tmp/current_packages.txt
    2. View the differences between the package lists:

      $ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt
      
      > ansible-2.4.0.0-5.el7.noarch
    3. Reinstall the missing packages:

      # yum reinstall -y <packages> 1
      1
      Replace <packages> with the packages that are different between the package lists.
  2. Restore a system certificate by copying the certificate to the /etc/pki/ca-trust/source/anchors/ directory and execute update-ca-trust:

    $ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
    $ sudo cp ${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors/<certificate> /etc/pki/ca-trust/source/anchors/
    $ sudo update-ca-trust
    Replace <certificate> with the file name of the system certificate to restore.
    Note

    Always ensure proper user ID and group ID are restored when the files are copied back, as well as the SELinux context.

3.5. Restoring etcd

3.5.1. Restoring the etcd configuration file

If an etcd host has become corrupted and the /etc/etcd/etcd.conf file is lost, restore it using the following procedure:

  1. Access your etcd host:

    $ ssh master-0 1
    1
    Replace master-0 with the name of your etcd host.
  2. Copy the backup etcd.conf file to /etc/etcd/:

    # cp /backup/etcd-config-<timestamp>/etcd/etcd.conf /etc/etcd/etcd.conf
  3. Set the required permissions and selinux context on the file:

    # restorecon -RvF /etc/etcd/etcd.conf

In this example, the backup file is stored in the /backup/etcd-config-<timestamp>/etcd/etcd.conf path where it can be used as an external NFS share, S3 bucket, or other storage solution.

After the etcd configuration file is restored, you must restart the static pod. This is done after you restore the etcd data.

3.5.2. Restoring etcd data

Before restoring etcd on a static pod:

  • etcdctl binaries must be available or, in containerized installations, the rhel7/etcd container must be available.

    You can install the etcdctl binary with the etcd package by running the following commands:

    # yum install etcd

    The package also installs the systemd service. Disable and mask the service so that it does not run as a systemd service when etcd runs in static pod. By disabling and masking the service, you ensure that you do not accidentally start it and prevent it from automatically restarting when you reboot the system.

    # systemctl disable etcd.service
    # systemctl mask etcd.service

To restore etcd on a static pod:

  1. If the pod is running, stop the etcd pod by moving the pod manifest YAML file to another directory:

    # mkdir -p /etc/origin/node/pods-stopped
    # mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped
  2. Move all old data:

    # mv /var/lib/etcd /var/lib/etcd.old

    You use the etcdctl to recreate the data in the node where you restore the pod.

  3. Restore the etcd snapshot to the mount path for the etcd pod:

    # export ETCDCTL_API=3
    # etcdctl snapshot restore /etc/etcd/backup/etcd/snapshot.db \
    	 --data-dir /var/lib/etcd/ \
    	 --name ip-172-18-3-48.ec2.internal \
    	 --initial-cluster "ip-172-18-3-48.ec2.internal=https://172.18.3.48:2380" \
    	 --initial-cluster-token "etcd-cluster-1" \
    	 --initial-advertise-peer-urls https://172.18.3.48:2380 \
    	 --skip-hash-check=true

    Obtain the appropriate values for your cluster from your backup etcd.conf file.

  4. Set required permissions and selinux context on the data directory:

    # restorecon -RvF /var/lib/etcd/
  5. Restart the etcd pod by moving the pod manifest YAML file to the required directory:

    # mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/

3.6. Adding an etcd node

After you restore etcd, you can add more etcd nodes to the cluster. You can either add an etcd host by using an Ansible playbook or by manual steps.

3.6.1. Adding a new etcd host using Ansible

Procedure
  1. In the Ansible inventory file, create a new group named [new_etcd] and add the new host. Then, add the new_etcd group as a child of the [OSEv3] group:

    [OSEv3:children]
    masters
    nodes
    etcd
    new_etcd 1
    
    ... [OUTPUT ABBREVIATED] ...
    
    [etcd]
    master-0.example.com
    master-1.example.com
    master-2.example.com
    
    [new_etcd] 2
    etcd0.example.com 3
    1 2 3
    Add these lines.
    Note

    Replace the old etcd host entry with the new etcd host entry in the inventory file. While replacing the older etcd host, you must create a copy of /etc/etcd/ca/ directory. Alternatively, you can redeploy etcd ca and certs before scaling up the etcd hosts.

  2. From the host that installed OpenShift Container Platform and hosts the Ansible inventory file, change to the playbook directory and run the etcd scaleup playbook:

    $ cd /usr/share/ansible/openshift-ansible
    $ ansible-playbook  playbooks/openshift-etcd/scaleup.yml
  3. After the playbook runs, modify the inventory file to reflect the current status by moving the new etcd host from the [new_etcd] group to the [etcd] group:

    [OSEv3:children]
    masters
    nodes
    etcd
    new_etcd
    
    ... [OUTPUT ABBREVIATED] ...
    
    [etcd]
    master-0.example.com
    master-1.example.com
    master-2.example.com
    etcd0.example.com
  4. If you use Flannel, modify the flanneld service configuration on every OpenShift Container Platform host, located at /etc/sysconfig/flanneld, to include the new etcd host:

    FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,https://etcd0.example.com:2379
  5. Restart the flanneld service:

    # systemctl restart flanneld.service

3.6.2. Manually adding a new etcd host

If you do not run etcd as static pods on master nodes, you might need to add another etcd host.

Procedure
Modify the current etcd cluster

To create the etcd certificates, run the openssl command, replacing the values with those from your environment.

  1. Create some environment variables:

    export NEW_ETCD_HOSTNAME="*etcd0.example.com*"
    export NEW_ETCD_IP="192.168.55.21"
    
    export CN=$NEW_ETCD_HOSTNAME
    export SAN="IP:${NEW_ETCD_IP}, DNS:${NEW_ETCD_HOSTNAME}"
    export PREFIX="/etc/etcd/generated_certs/etcd-$CN/"
    export OPENSSLCFG="/etc/etcd/ca/openssl.cnf"
    Note

    The custom openssl extensions used as etcd_v3_ca_* include the $SAN environment variable as subjectAltName. See /etc/etcd/ca/openssl.cnf for more information.

  2. Create the directory to store the configuration and certificates:

    # mkdir -p ${PREFIX}
  3. Create the server certificate request and sign it: (server.csr and server.crt)

    # openssl req -new -config ${OPENSSLCFG} \
        -keyout ${PREFIX}server.key  \
        -out ${PREFIX}server.csr \
        -reqexts etcd_v3_req -batch -nodes \
        -subj /CN=$CN
    
    # openssl ca -name etcd_ca -config ${OPENSSLCFG} \
        -out ${PREFIX}server.crt \
        -in ${PREFIX}server.csr \
        -extensions etcd_v3_ca_server -batch
  4. Create the peer certificate request and sign it: (peer.csr and peer.crt)

    # openssl req -new -config ${OPENSSLCFG} \
        -keyout ${PREFIX}peer.key \
        -out ${PREFIX}peer.csr \
        -reqexts etcd_v3_req -batch -nodes \
        -subj /CN=$CN
    
    # openssl ca -name etcd_ca -config ${OPENSSLCFG} \
      -out ${PREFIX}peer.crt \
      -in ${PREFIX}peer.csr \
      -extensions etcd_v3_ca_peer -batch
  5. Copy the current etcd configuration and ca.crt files from the current node as examples to modify later:

    # cp /etc/etcd/etcd.conf ${PREFIX}
    # cp /etc/etcd/ca.crt ${PREFIX}
  6. While still on the surviving etcd host, add the new host to the cluster. To add additional etcd members to the cluster, you must first adjust the default localhost peer in the peerURLs value for the first member:

    1. Get the member ID for the first member using the member list command:

      # etcdctl --cert-file=/etc/etcd/peer.crt \
          --key-file=/etc/etcd/peer.key \
          --ca-file=/etc/etcd/ca.crt \
          --peers="https://172.18.1.18:2379,https://172.18.9.202:2379,https://172.18.0.75:2379" \ 1
          member list
      1
      Ensure that you specify the URLs of only active etcd members in the --peers parameter value.
    2. Obtain the IP address where etcd listens for cluster peers:

      $ ss -l4n | grep 2380
    3. Update the value of peerURLs using the etcdctl member update command by passing the member ID and IP address obtained from the previous steps:

      # etcdctl --cert-file=/etc/etcd/peer.crt \
          --key-file=/etc/etcd/peer.key \
          --ca-file=/etc/etcd/ca.crt \
          --peers="https://172.18.1.18:2379,https://172.18.9.202:2379,https://172.18.0.75:2379" \
          member update 511b7fb6cc0001 https://172.18.1.18:2380
    4. Re-run the member list command and ensure the peer URLs no longer include localhost.
  7. Add the new host to the etcd cluster. Note that the new host is not yet configured, so the status stays as unstarted until the you configure the new host.

    Warning

    You must add each member and bring it online one at a time. When you add each additional member to the cluster, you must adjust the peerURLs list for the current peers. The peerURLs list grows by one for each member added. The etcdctl member add command outputs the values that you must set in the etcd.conf file as you add each member, as described in the following instructions.

    # etcdctl -C https://${CURRENT_ETCD_HOST}:2379 \
      --ca-file=/etc/etcd/ca.crt     \
      --cert-file=/etc/etcd/peer.crt     \
      --key-file=/etc/etcd/peer.key member add ${NEW_ETCD_HOSTNAME} https://${NEW_ETCD_IP}:2380 1
    
    Added member named 10.3.9.222 with ID 4e1db163a21d7651 to cluster
    
    ETCD_NAME="<NEW_ETCD_HOSTNAME>"
    ETCD_INITIAL_CLUSTER="<NEW_ETCD_HOSTNAME>=https://<NEW_HOST_IP>:2380,<CLUSTERMEMBER1_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER2_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER3_NAME>=https:/<CLUSTERMEMBER3_IP>:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    1
    In this line, 10.3.9.222 is a label for the etcd member. You can specify the host name, IP address, or a simple name.
  8. Update the sample ${PREFIX}/etcd.conf file.

    1. Replace the following values with the values generated in the previous step:

      • ETCD_NAME
      • ETCD_INITIAL_CLUSTER
      • ETCD_INITIAL_CLUSTER_STATE
    2. Modify the following variables with the new host IP from the output of the previous step. You can use ${NEW_ETCD_IP} as the value.

      ETCD_LISTEN_PEER_URLS
      ETCD_LISTEN_CLIENT_URLS
      ETCD_INITIAL_ADVERTISE_PEER_URLS
      ETCD_ADVERTISE_CLIENT_URLS
    3. If you previously used the member system as an etcd node, you must overwrite the current values in the /etc/etcd/etcd.conf file.
    4. Check the file for syntax errors or missing IP addresses, otherwise the etcd service might fail:

      # vi ${PREFIX}/etcd.conf
  9. On the node that hosts the installation files, update the [etcd] hosts group in the /etc/ansible/hosts inventory file. Remove the old etcd hosts and add the new ones.
  10. Create a tgz file that contains the certificates, the sample configuration file, and the ca and copy it to the new host:

    # tar -czvf /etc/etcd/generated_certs/${CN}.tgz -C ${PREFIX} .
    # scp /etc/etcd/generated_certs/${CN}.tgz ${CN}:/tmp/
Modify the new etcd host
  1. Install iptables-services to provide iptables utilities to open the required ports for etcd:

    # yum install -y iptables-services
  2. Create the OS_FIREWALL_ALLOW firewall rules to allow etcd to communicate:

    • Port 2379/tcp for clients
    • Port 2380/tcp for peer communication

      # systemctl enable iptables.service --now
      # iptables -N OS_FIREWALL_ALLOW
      # iptables -t filter -I INPUT -j OS_FIREWALL_ALLOW
      # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2379 -j ACCEPT
      # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2380 -j ACCEPT
      # iptables-save | tee /etc/sysconfig/iptables
      Note

      In this example, a new chain OS_FIREWALL_ALLOW is created, which is the standard naming the OpenShift Container Platform installer uses for firewall rules.

      Warning

      If the environment is hosted in an IaaS environment, modify the security groups for the instance to allow incoming traffic to those ports as well.

  3. Install etcd:

    # yum install -y etcd

    Ensure version etcd-2.3.7-4.el7.x86_64 or greater is installed,

  4. Ensure the etcd service is not running by removing the etcd pod definition:

    # mkdir -p /etc/origin/node/pods-stopped
    # mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
  5. Remove any etcd configuration and data:

    # rm -Rf /etc/etcd/*
    # rm -Rf /var/lib/etcd/*
  6. Extract the certificates and configuration files:

    # tar xzvf /tmp/etcd0.example.com.tgz -C /etc/etcd/
  7. Start etcd on the new host:

    # systemctl enable etcd --now
  8. Verify that the host is part of the cluster and the current cluster health:

    • If you use the v2 etcd api, run the following command:

      # etcdctl --cert-file=/etc/etcd/peer.crt \
                --key-file=/etc/etcd/peer.key \
                --ca-file=/etc/etcd/ca.crt \
                --peers="https://*master-0.example.com*:2379,\
                https://*master-1.example.com*:2379,\
                https://*master-2.example.com*:2379,\
                https://*etcd0.example.com*:2379"\
                cluster-health
      member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
      member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
      member 8b8904727bf526a5 is healthy: got healthy result from https://192.168.55.21:2379
      member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
      cluster is healthy
    • If you use the v3 etcd api, run the following command:

      # ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \
                --key=/etc/etcd/peer.key \
                --cacert="/etc/etcd/ca.crt" \
                --endpoints="https://*master-0.example.com*:2379,\
                  https://*master-1.example.com*:2379,\
                  https://*master-2.example.com*:2379,\
                  https://*etcd0.example.com*:2379"\
                  endpoint health
      https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms
      https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms
      https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
      https://etcd0.example.com:2379 is healthy: successfully committed proposal: took = 1.498829ms
Modify each OpenShift Container Platform master
  1. Modify the master configuration in the etcClientInfo section of the /etc/origin/master/master-config.yaml file on every master. Add the new etcd host to the list of the etcd servers OpenShift Container Platform uses to store the data, and remove any failed etcd hosts:

    etcdClientInfo:
      ca: master.etcd-ca.crt
      certFile: master.etcd-client.crt
      keyFile: master.etcd-client.key
      urls:
        - https://master-0.example.com:2379
        - https://master-1.example.com:2379
        - https://master-2.example.com:2379
        - https://etcd0.example.com:2379
  2. Restart the master API service:

    • On every master:

      # master-restart api
      # master-restart controllers
      Warning

      The number of etcd nodes must be odd, so you must add at least two hosts.

  3. If you use Flannel, modify the flanneld service configuration located at /etc/sysconfig/flanneld on every OpenShift Container Platform host to include the new etcd host:

    FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,https://etcd0.example.com:2379
  4. Restart the flanneld service:

    # systemctl restart flanneld.service

3.7. Bringing OpenShift Container Platform services back online

After you finish your changes, bring OpenShift Container Platform back online.

Procedure
  1. On each OpenShift Container Platform master, restore your master and node configuration from backup and enable and restart all relevant services:

    # cp ${MYBACKUPDIR}/etc/origin/node/pods/* /etc/origin/node/pods/
    # cp ${MYBACKUPDIR}/etc/origin/master/master.env /etc/origin/master/master.env
    # cp ${MYBACKUPDIR}/etc/origin/master/master-config.yaml.<timestamp> /etc/origin/master/master-config.yaml
    # cp ${MYBACKUPDIR}/etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml
    # cp ${MYBACKUPDIR}/etc/origin/master/scheduler.json.<timestamp> /etc/origin/master/scheduler.json
    # master-restart api
    # master-restart controllers
  2. On each OpenShift Container Platform node, update the node configuration maps as needed, and enable and restart the atomic-openshift-node service:

    # cp /etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml
    # systemctl enable atomic-openshift-node
    # systemctl start atomic-openshift-node

3.8. Restoring a project

To restore a project, create the new project, then restore any exported files by running oc create -f <file_name>.

Procedure
  1. Create the project:

    $ oc new-project <project_name> 1
    1
    This <project_name> value must match the name of the project that was backed up.
  2. Import the project objects:

    $ oc create -f project.yaml
  3. Import any other resources that you exported when backing up the project, such as role bindings, secrets, service accounts, and persistent volume claims:

    $ oc create -f <object>.yaml

    Some resources might fail to import if they require another object to exist. If this occurs, review the error message to identify which resources must be imported first.

Warning

Some resources, such as pods and default service accounts, can fail to be created.

3.9. Restoring application data

You can restore application data by using the oc rsync command, assuming rsync is installed within the container image. The Red Hat rhel7 base image contains rsync. Therefore, all images that are based on rhel7 contain it as well. See Troubleshooting and Debugging CLI Operations - rsync.

Warning

This is a generic restoration of application data and does not take into account application-specific backup procedures, for example, special export and import procedures for database systems.

Other means of restoration might exist depending on the type of the persistent volume you use, for example, Cinder, NFS, or Gluster.

Procedure

Example of restoring a Jenkins deployment’s application data

  1. Verify the backup:

    $ ls -la /tmp/jenkins-backup/
    total 8
    drwxrwxr-x.  3 user     user   20 Sep  6 11:14 .
    drwxrwxrwt. 17 root     root 4096 Sep  6 11:16 ..
    drwxrwsrwx. 12 user     user 4096 Sep  6 11:14 jenkins
  2. Use the oc rsync tool to copy the data into the running pod:

    $ oc rsync /tmp/jenkins-backup/jenkins jenkins-1-37nux:/var/lib
    Note

    Depending on the application, you may be required to restart the application.

  3. Optionally, restart the application with new data:

    $ oc delete pod jenkins-1-37nux

    Alternatively, you can scale down the deployment to 0, and then up again:

    $ oc scale --replicas=0 dc/jenkins
    $ oc scale --replicas=1 dc/jenkins

3.10. Restoring Persistent Volume Claims

This topic describes two methods for restoring data. The first involves deleting the file, then placing the file back in the expected location. The second example shows migrating persistent volume claims. The migration would occur in the event that the storage needs to be moved or in a disaster scenario when the backend storage no longer exists.

Check with the restore procedures for the specific application on any steps required to restore data to the application.

3.10.1. Restoring files to an existing PVC

Procedure
  1. Delete the file:

    $ oc rsh demo-2-fxx6d
    sh-4.2$ ls */opt/app-root/src/uploaded/*
    lost+found  ocp_sop.txt
    sh-4.2$ *rm -rf /opt/app-root/src/uploaded/ocp_sop.txt*
    sh-4.2$ *ls /opt/app-root/src/uploaded/*
    lost+found
  2. Replace the file from the server that contains the rsync backup of the files that were in the pvc:

    $ oc rsync uploaded demo-2-fxx6d:/opt/app-root/src/
  3. Validate that the file is back on the pod by using oc rsh to connect to the pod and view the contents of the directory:

    $ oc rsh demo-2-fxx6d
    sh-4.2$ *ls /opt/app-root/src/uploaded/*
    lost+found  ocp_sop.txt

3.10.2. Restoring data to a new PVC

The following steps assume that a new pvc has been created.

Procedure
  1. Overwrite the currently defined claim-name:

    $ oc set volume dc/demo --add --name=persistent-volume \
    		--type=persistentVolumeClaim --claim-name=filestore \ --mount-path=/opt/app-root/src/uploaded --overwrite
  2. Validate that the pod is using the new PVC:

    $ oc describe dc/demo
    Name:		demo
    Namespace:	test
    Created:	3 hours ago
    Labels:		app=demo
    Annotations:	openshift.io/generated-by=OpenShiftNewApp
    Latest Version:	3
    Selector:	app=demo,deploymentconfig=demo
    Replicas:	1
    Triggers:	Config, Image(demo@latest, auto=true)
    Strategy:	Rolling
    Template:
      Labels:	app=demo
    		deploymentconfig=demo
      Annotations:	openshift.io/container.demo.image.entrypoint=["container-entrypoint","/bin/sh","-c","$STI_SCRIPTS_PATH/usage"]
    		openshift.io/generated-by=OpenShiftNewApp
      Containers:
       demo:
        Image:	docker-registry.default.svc:5000/test/demo@sha256:0a9f2487a0d95d51511e49d20dc9ff6f350436f935968b0c83fcb98a7a8c381a
        Port:	8080/TCP
        Volume Mounts:
          /opt/app-root/src/uploaded from persistent-volume (rw)
        Environment Variables:	<none>
      Volumes:
       persistent-volume:
        Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
        *ClaimName:	filestore*
        ReadOnly:	false
    ...omitted...
  3. Now that the deployment configuration uses the new pvc, run oc rsync to place the files onto the new pvc:

    $ oc rsync uploaded demo-3-2b8gs:/opt/app-root/src/
    sending incremental file list
    uploaded/
    uploaded/ocp_sop.txt
    uploaded/lost+found/
    
    sent 181 bytes  received 39 bytes  146.67 bytes/sec
    total size is 32  speedup is 0.15
  4. Validate that the file is back on the pod by using oc rsh to connect to the pod and view the contents of the directory:

    $ oc rsh demo-3-2b8gs
    sh-4.2$ ls /opt/app-root/src/uploaded/
    lost+found  ocp_sop.txt

Chapter 4. Replacing a master host

You can replace a failed master host.

First, remove the failed master host from your cluster, and then add a replacement master host. If the failed master host ran etcd, scale up etcd by adding etcd to the new master host.

Important

You must complete all sections of this topic.

4.1. Deprecating a master host

Master hosts run important services, such as the OpenShift Container Platform API and controllers services. In order to deprecate a master host, these services must be stopped.

The OpenShift Container Platform API service is an active/active service, so stopping the service does not affect the environment as long as the requests are sent to a separate master server. However, the OpenShift Container Platform controllers service is an active/passive service, where the services use etcd to decide the active master.

Deprecating a master host in a multi-master architecture includes removing the master from the load balancer pool to avoid new connections attempting to use that master. This process depends heavily on the load balancer used. The steps below show the details of removing the master from haproxy. In the event that OpenShift Container Platform is running on a cloud provider, or using a F5 appliance, see the specific product documents to remove the master from rotation.

Procedure
  1. Remove the backend section in the /etc/haproxy/haproxy.cfg configuration file. For example, if deprecating a master named master-0.example.com using haproxy, ensure the host name is removed from the following:

    backend mgmt8443
        balance source
        mode tcp
        # MASTERS 8443
        server master-1.example.com 192.168.55.12:8443 check
        server master-2.example.com 192.168.55.13:8443 check
  2. Then, restart the haproxy service.

    $ sudo systemctl restart haproxy
  3. Once the master is removed from the load balancer, disable the API and controller services by moving definition files out of the static pods dir /etc/origin/node/pods:

    # mkdir -p /etc/origin/node/pods/disabled
    # mv /etc/origin/node/pods/controller.yaml /etc/origin/node/pods/disabled/:
    +
  4. Because the master host is a schedulable OpenShift Container Platform node, follow the steps in the Deprecating a node host section.
  5. Remove the master host from the [masters] and [nodes] groups in the /etc/ansible/hosts Ansible inventory file to avoid issues if running any Ansible tasks using that inventory file.

    Warning

    Deprecating the first master host listed in the Ansible inventory file requires extra precautions.

    The /etc/origin/master/ca.serial.txt file is generated on only the first master listed in the Ansible host inventory. If you deprecate the first master host, copy the /etc/origin/master/ca.serial.txt file to the rest of master hosts before the process.

    Important

    In OpenShift Container Platform 3.11 clusters running multiple masters, one of the master nodes includes additional CA certificates in /etc/origin/master, /etc/etcd/ca, and /etc/etcd/generated_certs. These are required for application node and etcd node scale-up operations and must be restored on another master node if the CA host master is being deprecated.

  6. The kubernetes service includes the master host IPs as endpoints. To verify that the master has been properly deprecated, review the kubernetes service output and see if the deprecated master has been removed:

    $ oc describe svc kubernetes -n default
    Name:			kubernetes
    Namespace:		default
    Labels:			component=apiserver
    			provider=kubernetes
    Annotations:		<none>
    Selector:		<none>
    Type:			ClusterIP
    IP:			10.111.0.1
    Port:			https	443/TCP
    Endpoints:		192.168.55.12:8443,192.168.55.13:8443
    Port:			dns	53/UDP
    Endpoints:		192.168.55.12:8053,192.168.55.13:8053
    Port:			dns-tcp	53/TCP
    Endpoints:		192.168.55.12:8053,192.168.55.13:8053
    Session Affinity:	ClientIP
    Events:			<none>

    After the master has been successfully deprecated, the host where the master was previously running can be safely deleted.

4.2. Adding hosts

You can add new hosts to your cluster by running the scaleup.yml playbook. This playbook queries the master, generates and distributes new certificates for the new hosts, and then runs the configuration playbooks on only the new hosts. Before running the scaleup.yml playbook, complete all prerequisite host preparation steps.

Important

The scaleup.yml playbook configures only the new host. It does not update NO_PROXY in master services, and it does not restart master services.

You must have an existing inventory file, for example /etc/ansible/hosts, that is representative of your current cluster configuration in order to run the scaleup.yml playbook. If you previously used the atomic-openshift-installer command to run your installation, you can check ~/.config/openshift/hosts for the last inventory file that the installer generated and use that file as your inventory file. You can modify this file as required. You must then specify the file location with -i when you run the ansible-playbook.

Important

See the cluster maximums section for the recommended maximum number of nodes.

Procedure
  1. Ensure you have the latest playbooks by updating the openshift-ansible package:

    # yum update openshift-ansible
  2. Edit your /etc/ansible/hosts file and add new_<host_type> to the [OSEv3:children] section. For example, to add a new node host, add new_nodes:

    [OSEv3:children]
    masters
    nodes
    new_nodes

    To add new master hosts, add new_masters.

  3. Create a [new_<host_type>] section to specify host information for the new hosts. Format this section like an existing section, as shown in the following example of adding a new node:

    [nodes]
    master[1:3].example.com
    node1.example.com openshift_node_group_name='node-config-compute'
    node2.example.com openshift_node_group_name='node-config-compute'
    infra-node1.example.com openshift_node_group_name='node-config-infra'
    infra-node2.example.com openshift_node_group_name='node-config-infra'
    
    [new_nodes]
    node3.example.com openshift_node_group_name='node-config-infra'

    See Configuring Host Variables for more options.

    When adding new masters, add hosts to both the [new_masters] section and the [new_nodes] section to ensure that the new master host is part of the OpenShift SDN:

    [masters]
    master[1:2].example.com
    
    [new_masters]
    master3.example.com
    
    [nodes]
    master[1:2].example.com
    node1.example.com openshift_node_group_name='node-config-compute'
    node2.example.com openshift_node_group_name='node-config-compute'
    infra-node1.example.com openshift_node_group_name='node-config-infra'
    infra-node2.example.com openshift_node_group_name='node-config-infra'
    
    [new_nodes]
    master3.example.com
    Important

    If you label a master host with the node-role.kubernetes.io/infra=true label and have no other dedicated infrastructure nodes, you must also explicitly mark the host as schedulable by adding openshift_schedulable=true to the entry. Otherwise, the registry and router pods cannot be placed anywhere.

  4. Change to the playbook directory and run the openshift_node_group.yml playbook. If your inventory file is located somewhere other than the default of /etc/ansible/hosts, specify the location with the -i option:

    $ cd /usr/share/ansible/openshift-ansible
    $ ansible-playbook [-i /path/to/file] \
      playbooks/openshift-master/openshift_node_group.yml

    This creates the ConfigMap for the new node groups, and ultimately, the configuration file of the node on the host.

    Note

    Running the openshift_node_group.yaml playbook only updates new nodes. It cannot be run to update existing nodes in a cluster.

  5. Run the scaleup.yml playbook. If your inventory file is located somewhere other than the default of /etc/ansible/hosts, specify the location with the -i option.

    • For additional nodes:

      $ ansible-playbook [-i /path/to/file] \
          playbooks/openshift-node/scaleup.yml
    • For additional masters:

      $ ansible-playbook [-i /path/to/file] \
          playbooks/openshift-master/scaleup.yml
  6. Set the node label to logging-infra-fluentd=true, if you deployed the EFK stack in your cluster:

    # oc label node/new-node.example.com logging-infra-fluentd=true
  7. After the playbook runs, verify the installation.
  8. Move any hosts that you defined in the [new_<host_type>] section to their appropriate section. By moving these hosts, subsequent playbook runs that use this inventory file treat the nodes correctly. You can keep the empty [new_<host_type>] section. For example, when adding new nodes:

    [nodes]
    master[1:3].example.com
    node1.example.com openshift_node_group_name='node-config-compute'
    node2.example.com openshift_node_group_name='node-config-compute'
    node3.example.com openshift_node_group_name='node-config-compute'
    infra-node1.example.com openshift_node_group_name='node-config-infra'
    infra-node2.example.com openshift_node_group_name='node-config-infra'
    
    [new_nodes]

4.3. Scaling etcd

You can scale the etcd cluster vertically by adding more resources to the etcd hosts or horizontally by adding more etcd hosts.

Note

Due to the voting system etcd uses, the cluster must always contain an odd number of members.

Having a cluster with an odd number of etcd hosts can account for fault tolerance. Having an odd number of etcd hosts does not change the number needed for a quorum but increases the tolerance for failure. For example, with a cluster of three members, quorum is two, which leaves a failure tolerance of one. This ensures the cluster continues to operate if two of the members are healthy.

Having an in-production cluster of three etcd hosts is recommended.

The new host requires a fresh Red Hat Enterprise Linux version 7 dedicated host. The etcd storage should be located on an SSD disk to achieve maximum performance and on a dedicated disk mounted in /var/lib/etcd.

Prerequisites
  1. Before you add a new etcd host, perform a backup of both etcd configuration and data to prevent data loss.
  2. Check the current etcd cluster status to avoid adding new hosts to an unhealthy cluster. Run this command:

    # ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \
              --key=/etc/etcd/peer.key \
              --cacert="/etc/etcd/ca.crt" \
              --endpoints="https://*master-0.example.com*:2379,\
                https://*master-1.example.com*:2379,\
                https://*master-2.example.com*:2379"
                endpoint health
    https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms
    https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms
    https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
  3. Before running the scaleup playbook, ensure the new host is registered to the proper Red Hat software channels:

    # subscription-manager register \
        --username=*<username>* --password=*<password>*
    # subscription-manager attach --pool=*<poolid>*
    # subscription-manager repos --disable="*"
    # subscription-manager repos \
        --enable=rhel-7-server-rpms \
        --enable=rhel-7-server-extras-rpms

    etcd is hosted in the rhel-7-server-extras-rpms software channel.

  4. Make sure all unused etcd members are removed from the etcd cluster. This must be completed before running the scaleup playbook.

    1. List the etcd members:

      # etcdctl --cert="/etc/etcd/peer.crt" --key="/etc/etcd/peer.key" \
        --cacert="/etc/etcd/ca.crt" --endpoints=ETCD_LISTEN_CLIENT_URLS member list -w table

      Copy the unused etcd member ID, if applicable.

    2. Remove the unused member by specifying its ID in the following command:

      # etcdctl --cert="/etc/etcd/peer.crt" --key="/etc/etcd/peer.key" \
        --cacert="/etc/etcd/ca.crt" --endpoints=ETCD_LISTEN_CLIENT_URL member remove UNUSED_ETCD_MEMBER_ID
  5. Upgrade etcd and iptables on the current etcd nodes:

    # yum update etcd iptables-services
  6. Back up the /etc/etcd configuration for the etcd hosts.
  7. If the new etcd members will also be OpenShift Container Platform nodes, add the desired number of hosts to the cluster.
  8. The rest of this procedure assumes you added one host, but if you add multiple hosts, perform all steps on each host.

4.3.1. Adding a new etcd host using Ansible

Procedure
  1. In the Ansible inventory file, create a new group named [new_etcd] and add the new host. Then, add the new_etcd group as a child of the [OSEv3] group:

    [OSEv3:children]
    masters
    nodes
    etcd
    new_etcd 1
    
    ... [OUTPUT ABBREVIATED] ...
    
    [etcd]
    master-0.example.com
    master-1.example.com
    master-2.example.com
    
    [new_etcd] 2
    etcd0.example.com 3
    1 2 3
    Add these lines.
    Note

    Replace the old etcd host entry with the new etcd host entry in the inventory file. While replacing the older etcd host, you must create a copy of /etc/etcd/ca/ directory. Alternatively, you can redeploy etcd ca and certs before scaling up the etcd hosts.

  2. From the host that installed OpenShift Container Platform and hosts the Ansible inventory file, change to the playbook directory and run the etcd scaleup playbook:

    $ cd /usr/share/ansible/openshift-ansible
    $ ansible-playbook  playbooks/openshift-etcd/scaleup.yml
  3. After the playbook runs, modify the inventory file to reflect the current status by moving the new etcd host from the [new_etcd] group to the [etcd] group:

    [OSEv3:children]
    masters
    nodes
    etcd
    new_etcd
    
    ... [OUTPUT ABBREVIATED] ...
    
    [etcd]
    master-0.example.com
    master-1.example.com
    master-2.example.com
    etcd0.example.com
  4. If you use Flannel, modify the flanneld service configuration on every OpenShift Container Platform host, located at /etc/sysconfig/flanneld, to include the new etcd host:

    FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,https://etcd0.example.com:2379
  5. Restart the flanneld service:

    # systemctl restart flanneld.service

4.3.2. Manually adding a new etcd host

If you do not run etcd as static pods on master nodes, you might need to add another etcd host.

Procedure
Modify the current etcd cluster

To create the etcd certificates, run the openssl command, replacing the values with those from your environment.

  1. Create some environment variables:

    export NEW_ETCD_HOSTNAME="*etcd0.example.com*"
    export NEW_ETCD_IP="192.168.55.21"
    
    export CN=$NEW_ETCD_HOSTNAME
    export SAN="IP:${NEW_ETCD_IP}, DNS:${NEW_ETCD_HOSTNAME}"
    export PREFIX="/etc/etcd/generated_certs/etcd-$CN/"
    export OPENSSLCFG="/etc/etcd/ca/openssl.cnf"
    Note

    The custom openssl extensions used as etcd_v3_ca_* include the $SAN environment variable as subjectAltName. See /etc/etcd/ca/openssl.cnf for more information.

  2. Create the directory to store the configuration and certificates:

    # mkdir -p ${PREFIX}
  3. Create the server certificate request and sign it: (server.csr and server.crt)

    # openssl req -new -config ${OPENSSLCFG} \
        -keyout ${PREFIX}server.key  \
        -out ${PREFIX}server.csr \
        -reqexts etcd_v3_req -batch -nodes \
        -subj /CN=$CN
    
    # openssl ca -name etcd_ca -config ${OPENSSLCFG} \
        -out ${PREFIX}server.crt \
        -in ${PREFIX}server.csr \
        -extensions etcd_v3_ca_server -batch
  4. Create the peer certificate request and sign it: (peer.csr and peer.crt)

    # openssl req -new -config ${OPENSSLCFG} \
        -keyout ${PREFIX}peer.key \
        -out ${PREFIX}peer.csr \
        -reqexts etcd_v3_req -batch -nodes \
        -subj /CN=$CN
    
    # openssl ca -name etcd_ca -config ${OPENSSLCFG} \
      -out ${PREFIX}peer.crt \
      -in ${PREFIX}peer.csr \
      -extensions etcd_v3_ca_peer -batch
  5. Copy the current etcd configuration and ca.crt files from the current node as examples to modify later:

    # cp /etc/etcd/etcd.conf ${PREFIX}
    # cp /etc/etcd/ca.crt ${PREFIX}
  6. While still on the surviving etcd host, add the new host to the cluster. To add additional etcd members to the cluster, you must first adjust the default localhost peer in the peerURLs value for the first member:

    1. Get the member ID for the first member using the member list command:

      # etcdctl --cert-file=/etc/etcd/peer.crt \
          --key-file=/etc/etcd/peer.key \
          --ca-file=/etc/etcd/ca.crt \
          --peers="https://172.18.1.18:2379,https://172.18.9.202:2379,https://172.18.0.75:2379" \ 1
          member list
      1
      Ensure that you specify the URLs of only active etcd members in the --peers parameter value.
    2. Obtain the IP address where etcd listens for cluster peers:

      $ ss -l4n | grep 2380
    3. Update the value of peerURLs using the etcdctl member update command by passing the member ID and IP address obtained from the previous steps:

      # etcdctl --cert-file=/etc/etcd/peer.crt \
          --key-file=/etc/etcd/peer.key \
          --ca-file=/etc/etcd/ca.crt \
          --peers="https://172.18.1.18:2379,https://172.18.9.202:2379,https://172.18.0.75:2379" \
          member update 511b7fb6cc0001 https://172.18.1.18:2380
    4. Re-run the member list command and ensure the peer URLs no longer include localhost.
  7. Add the new host to the etcd cluster. Note that the new host is not yet configured, so the status stays as unstarted until the you configure the new host.

    Warning

    You must add each member and bring it online one at a time. When you add each additional member to the cluster, you must adjust the peerURLs list for the current peers. The peerURLs list grows by one for each member added. The etcdctl member add command outputs the values that you must set in the etcd.conf file as you add each member, as described in the following instructions.

    # etcdctl -C https://${CURRENT_ETCD_HOST}:2379 \
      --ca-file=/etc/etcd/ca.crt     \
      --cert-file=/etc/etcd/peer.crt     \
      --key-file=/etc/etcd/peer.key member add ${NEW_ETCD_HOSTNAME} https://${NEW_ETCD_IP}:2380 1
    
    Added member named 10.3.9.222 with ID 4e1db163a21d7651 to cluster
    
    ETCD_NAME="<NEW_ETCD_HOSTNAME>"
    ETCD_INITIAL_CLUSTER="<NEW_ETCD_HOSTNAME>=https://<NEW_HOST_IP>:2380,<CLUSTERMEMBER1_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER2_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER3_NAME>=https:/<CLUSTERMEMBER3_IP>:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    1
    In this line, 10.3.9.222 is a label for the etcd member. You can specify the host name, IP address, or a simple name.
  8. Update the sample ${PREFIX}/etcd.conf file.

    1. Replace the following values with the values generated in the previous step:

      • ETCD_NAME
      • ETCD_INITIAL_CLUSTER
      • ETCD_INITIAL_CLUSTER_STATE
    2. Modify the following variables with the new host IP from the output of the previous step. You can use ${NEW_ETCD_IP} as the value.

      ETCD_LISTEN_PEER_URLS
      ETCD_LISTEN_CLIENT_URLS
      ETCD_INITIAL_ADVERTISE_PEER_URLS
      ETCD_ADVERTISE_CLIENT_URLS
    3. If you previously used the member system as an etcd node, you must overwrite the current values in the /etc/etcd/etcd.conf file.
    4. Check the file for syntax errors or missing IP addresses, otherwise the etcd service might fail:

      # vi ${PREFIX}/etcd.conf
  9. On the node that hosts the installation files, update the [etcd] hosts group in the /etc/ansible/hosts inventory file. Remove the old etcd hosts and add the new ones.
  10. Create a tgz file that contains the certificates, the sample configuration file, and the ca and copy it to the new host:

    # tar -czvf /etc/etcd/generated_certs/${CN}.tgz -C ${PREFIX} .
    # scp /etc/etcd/generated_certs/${CN}.tgz ${CN}:/tmp/
Modify the new etcd host
  1. Install iptables-services to provide iptables utilities to open the required ports for etcd:

    # yum install -y iptables-services
  2. Create the OS_FIREWALL_ALLOW firewall rules to allow etcd to communicate:

    • Port 2379/tcp for clients
    • Port 2380/tcp for peer communication

      # systemctl enable iptables.service --now
      # iptables -N OS_FIREWALL_ALLOW
      # iptables -t filter -I INPUT -j OS_FIREWALL_ALLOW
      # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2379 -j ACCEPT
      # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2380 -j ACCEPT
      # iptables-save | tee /etc/sysconfig/iptables
      Note

      In this example, a new chain OS_FIREWALL_ALLOW is created, which is the standard naming the OpenShift Container Platform installer uses for firewall rules.

      Warning

      If the environment is hosted in an IaaS environment, modify the security groups for the instance to allow incoming traffic to those ports as well.

  3. Install etcd:

    # yum install -y etcd

    Ensure version etcd-2.3.7-4.el7.x86_64 or greater is installed,

  4. Ensure the etcd service is not running by removing the etcd pod definition:

    # mkdir -p /etc/origin/node/pods-stopped
    # mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
  5. Remove any etcd configuration and data:

    # rm -Rf /etc/etcd/*
    # rm -Rf /var/lib/etcd/*
  6. Extract the certificates and configuration files:

    # tar xzvf /tmp/etcd0.example.com.tgz -C /etc/etcd/
  7. Start etcd on the new host:

    # systemctl enable etcd --now
  8. Verify that the host is part of the cluster and the current cluster health:

    • If you use the v2 etcd api, run the following command:

      # etcdctl --cert-file=/etc/etcd/peer.crt \
                --key-file=/etc/etcd/peer.key \
                --ca-file=/etc/etcd/ca.crt \
                --peers="https://*master-0.example.com*:2379,\
                https://*master-1.example.com*:2379,\
                https://*master-2.example.com*:2379,\
                https://*etcd0.example.com*:2379"\
                cluster-health
      member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
      member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
      member 8b8904727bf526a5 is healthy: got healthy result from https://192.168.55.21:2379
      member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
      cluster is healthy
    • If you use the v3 etcd api, run the following command:

      # ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \
                --key=/etc/etcd/peer.key \
                --cacert="/etc/etcd/ca.crt" \
                --endpoints="https://*master-0.example.com*:2379,\
                  https://*master-1.example.com*:2379,\
                  https://*master-2.example.com*:2379,\
                  https://*etcd0.example.com*:2379"\
                  endpoint health
      https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms
      https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms
      https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
      https://etcd0.example.com:2379 is healthy: successfully committed proposal: took = 1.498829ms
Modify each OpenShift Container Platform master
  1. Modify the master configuration in the etcClientInfo section of the /etc/origin/master/master-config.yaml file on every master. Add the new etcd host to the list of the etcd servers OpenShift Container Platform uses to store the data, and remove any failed etcd hosts:

    etcdClientInfo:
      ca: master.etcd-ca.crt
      certFile: master.etcd-client.crt
      keyFile: master.etcd-client.key
      urls:
        - https://master-0.example.com:2379
        - https://master-1.example.com:2379
        - https://master-2.example.com:2379
        - https://etcd0.example.com:2379
  2. Restart the master API service:

    • On every master:

      # master-restart api
      # master-restart controllers
      Warning

      The number of etcd nodes must be odd, so you must add at least two hosts.

  3. If you use Flannel, modify the flanneld service configuration located at /etc/sysconfig/flanneld on every OpenShift Container Platform host to include the new etcd host:

    FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,https://etcd0.example.com:2379
  4. Restart the flanneld service:

    # systemctl restart flanneld.service

Chapter 5. Managing Users

5.1. Overview

A user is an entity that interacts with the OpenShift Container Platform API. These can be a developer for developing applications or an administrator for managing the cluster. Users can be assigned to groups, which set the permissions applied to all the group’s members. For example, you can give API access to a group, which give all members of the group API access.

This topic describes the management of user accounts, including how new user accounts are created in OpenShift Container Platform and how they can be deleted.

5.2. Creating a User

The process for creating a user depends on the configured identity provider. By default, OpenShift Container Platform uses the DenyAll identity provider, which denies access for all user names and passwords.

The following process creates a new user, then adds a role to the user:

  1. Create the user account depending on your identity provider. This can depend on the mappingmethod used as part of the identity provider configuration.
  2. Give the new user the desired role:

    # oc create clusterrolebinding <clusterrolebinding_name> \
      --clusterrole=<role> --user=<user>

    Where the --clusterrole option is the desired cluster role. For example, to give the new user cluster-admin privileges, which gives the user access to everything within a cluster:

    # oc create clusterrolebinding registry-controller \
      --clusterrole=cluster-admin --user=admin

    For an explanation and list of roles, see the Cluster Roles and Local Roles section of the Architecture Guide.

As a cluster administrator, you can also manage the access level of each user.

Note

Depending on the identity provider, and on the defined group structure, some roles may be given to users automatically. See the Synching groups with LDAP section for more information.

5.3. Viewing User and Identity Lists

OpenShift Container Platform user configuration is stored in several locations within OpenShift Container Platform. Regardless of the identity provider, OpenShift Container Platform internally stores details like role-based access control (RBAC) information and group membership. To completely remove user information, this data must be removed in addition to the user account.

In OpenShift Container Platform, two object types contain user data outside the identification provider: user and identity.

To get the current list of users:

$ oc get user
NAME      UID                                    FULL NAME   IDENTITIES
demo     75e4b80c-dbf1-11e5-8dc6-0e81e52cc949               htpasswd_auth:demo

To get the current list of identities:

$ oc get identity
NAME                  IDP NAME        IDP USER NAME   USER NAME   USER UID
htpasswd_auth:demo    htpasswd_auth   demo            demo        75e4b80c-dbf1-11e5-8dc6-0e81e52cc949

Note the matching UID between the two object types. If you attempt to change the authentication provider after starting to use OpenShift Container Platform, the user names that overlap will not work because of the entries in the identity list, which will still point to the old authentication method.

5.4. Creating Groups

While a user is an entity making requests to OpenShift Container Platform, users can be organized into one or more groups made up from a set of users. Groups are useful for managing many users at one time, such as for authorization policies, or to grant permissions to multiple users at once.

If your organization is using LDAP, you can synchronize any LDAP records to OpenShift Container Platform so that you can configure groups on one place. This presumes that information about your users is in an LDAP server. See the Synching groups with LDAP section for more information.

If you are not using LDAP, you can use the following procedure to manually create groups.

To create a new group:

# oc adm groups new <group_name> <user1> <user2>

For example, to create the west groups and in it place the john and betty users:

# oc adm groups new west john betty

To verify that the group has been created, and list the users associated with the group, run the following:

# oc get groups
NAME      USERS
west      john, betty

Next steps:

5.5. Managing User and Group Labels

To add a label to a user or group:

$ oc label user/<user_name> <label_name>=<label_value>

For example, if the user name is theuser and the label is level=gold:

$ oc label user/theuser level=gold

To remove the label:

$ oc label user/<user_name> <label_name>-

To show labels for a user or group:

$ oc describe user/<user_name>

5.6. Deleting a User

To delete a user:

  1. Delete the user record:

    $ oc delete user demo
    user "demo" deleted
  2. Delete the user identity.

    The identity of the user is related to the identification provider you use. Get the provider name from the user record in oc get user.

    In this example, the identity provider name is htpasswd_auth. The command is:

    # oc delete identity htpasswd_auth:demo
    identity "htpasswd_auth:demo" deleted

    If you skip this step, the user will not be able to log in again.

After you complete these steps, a new account will be created in OpenShift Container Platform when the user logs in again.

If your intention is to prevent the user from being able to log in again (for example, if an employee has left the company and you want to permanently delete the account), you can also remove the user from your authentication back end (like htpasswd, kerberos, or others) for the configured identity provider.

For example, if you are using htpasswd, delete the entry in the htpasswd file that is configured for OpenShift Container Platform with the user name and password.

For external identification management like Lightweight Directory Access Protocol (LDAP) or Red Hat Identity Management (IdM), use the user management tools to remove the user entry.

Chapter 6. Managing Projects

6.1. Overview

In OpenShift Container Platform, projects are used to group and isolate related objects. As an administrator, you can give developers access to certain projects, allow them to create their own, and give them administrative rights within individual projects.

6.2. Self-provisioning Projects

You can allow developers to create their own projects. There is an endpoint that will provision a project according to a template. The web console and oc new-project command use this endpoint when a developer creates a new project.

6.2.1. Modifying the Template for New Projects

The API server automatically provisions projects based on the template that is identified by the projectRequestTemplate parameter of the master-config.yaml file. If the parameter is not defined, the API server creates a default template that creates a project with the requested name, and assigns the requesting user to the "admin" role for that project.

To create your own custom project template:

  1. Start with the current default project template:

    $ oc adm create-bootstrap-project-template -o yaml > template.yaml
  2. Use a text editor to modify the template.yaml file by adding objects or modifying existing objects.
  3. Load the template:

    $ oc create -f template.yaml -n default
  4. Modify the master-config.yaml file to reference the loaded template:

    ...
    projectConfig:
      projectRequestTemplate: "default/project-request"
      ...

When a project request is submitted, the API substitutes the following parameters into the template:

ParameterDescription

PROJECT_NAME

The name of the project. Required.

PROJECT_DISPLAYNAME

The display name of the project. May be empty.

PROJECT_DESCRIPTION

The description of the project. May be empty.

PROJECT_ADMIN_USER

The username of the administrating user.

PROJECT_REQUESTING_USER

The username of the requesting user.

Access to the API is granted to developers with the self-provisioner role and the self-provisioners cluster role binding. This role is available to all authenticated developers by default.

6.2.2. Disabling Self-provisioning

You can prevent an authenticated user group from self-provisioning new projects.

  1. Log in as a user with cluster-admin privileges.
  2. Review the self-provisionersclusterrolebinding usage. Run the following command, then review the subjects in the self-provisioners section.

    $ oc  describe clusterrolebinding.rbac self-provisioners
    
    Name:		self-provisioners
    Labels:		<none>
    Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
    Role:
      Kind:	ClusterRole
      Name:	self-provisioner
    Subjects:
      Kind	Name				Namespace
      ----	----				---------
      Group	system:authenticated:oauth
  3. Remove the self-provisioner cluster role from the group system:authenticated:oauth.

    • If the self-provisioners cluster role binding binds only the self-provisioner role to the system:authenticated:oauth group, run the following command:

      $ oc patch clusterrolebinding.rbac self-provisioners -p '{"subjects": null}'
    • If the self-provisioners clusterrolebinding binds the self-provisioner role to more users, groups, or serviceaccounts than the system:authenticated:oauth group, run the following command:

      $ oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth
  4. Set the projectRequestMessage parameter value in the master-config.yaml file to instruct developers how to request a new project. This parameter value is a string that will be presented to a user in the web console and command line when the user attempts to self-provision a project. You might use one of the following messages:

    • To request a project, contact your system administrator at projectname@example.com.
    • To request a new project, fill out the project request form located at https://internal.example.com/openshift-project-request.

    Example YAML file

    ...
    projectConfig:
      ProjectRequestMessage: "message"
      ...

  5. Edit the self-provisioners cluster role binding to prevent automatic updates to the role. Automatic updates reset the cluster roles to the default state.

    • To update the role binding from the command line:

      1. Run the following command:

        $ oc edit clusterrolebinding.rbac self-provisioners
      2. In the displayed role binding, set the rbac.authorization.kubernetes.io/autoupdate parameter value to false, as shown in the following example:

        apiVersion: authorization.openshift.io/v1
        kind: ClusterRoleBinding
        metadata:
          annotations:
            rbac.authorization.kubernetes.io/autoupdate: "false"
        ...
    • To update the role binding by using a single command:

      $ oc patch clusterrolebinding.rbac self-provisioners -p '{ "metadata": { "annotations": { "rbac.authorization.kubernetes.io/autoupdate": "false" } } }'

6.3. Using Node Selectors

Node selectors are used in conjunction with labeled nodes to control pod placement.

Note

6.3.1. Setting the Cluster-wide Default Node Selector

As a cluster administrator, you can set the cluster-wide default node selector to restrict pod placement to specific nodes.

Edit the master configuration file at /etc/origin/master/master-config.yaml and add a value for a default node selector. This is applied to the pods created in all projects without a specified nodeSelector value:

...
projectConfig:
  defaultNodeSelector: "type=user-node,region=east"
...

Restart the OpenShift service for the changes to take effect:

# master-restart api
# master-restart controllers

6.3.2. Setting the Project-wide Node Selector

To create an individual project with a node selector, use the --node-selector option when creating a project. For example, if you have an OpenShift Container Platform topology with multiple regions, you can use a node selector to restrict specific OpenShift Container Platform projects to only deploy pods onto nodes in a specific region.

The following creates a new project named myproject and dictates that pods be deployed onto nodes labeled user-node and east:

$ oc adm new-project myproject \
    --node-selector='type=user-node,region=east'

Once this command is run, this becomes the administrator-set node selector for all pods contained in the specified project.

Note

While the new-project subcommand is available for both oc adm and oc, the cluster administrator and developer commands respectively, creating a new project with a node selector is only available with the oc adm command. The new-project subcommand is not available to project developers when self-provisioning projects.

Using the oc adm new-project command adds an annotation section to the project. You can edit a project, and change the openshift.io/node-selector value to override the default:

...
metadata:
  annotations:
    openshift.io/node-selector: type=user-node,region=east
...

You can also override the default value for an existing project namespace by using the following command:

# oc patch namespace myproject -p \
    '{"metadata":{"annotations":{"openshift.io/node-selector":"node-role.kubernetes.io/infra=true"}}}'

If openshift.io/node-selector is set to an empty string (oc adm new-project --node-selector=""), the project will not have an administrator-set node selector, even if the cluster-wide default has been set. This means that, as a cluster administrator, you can set a default to restrict developer projects to a subset of nodes and still enable infrastructure or other projects to schedule the entire cluster.

6.3.3. Developer-specified Node Selectors

OpenShift Container Platform developers can set a node selector on their pod configuration if they wish to restrict nodes even further. This will be in addition to the project node selector, meaning that you can still dictate node selector values for all projects that have a node selector value.

For example, if a project has been created with the above annotation (openshift.io/node-selector: type=user-node,region=east) and a developer sets another node selector on a pod in that project, for example clearance=classified, the pod will only ever be scheduled on nodes that have all three labels (type=user-node, region=east, and clearance=classified). If they set region=west on a pod, their pods would be demanding nodes with labels region=east and region=west, which cannot work. The pods will never be scheduled, because labels can only be set to one value.

6.4. Limiting Number of Self-Provisioned Projects Per User

The number of self-provisioned projects requested by a given user can be limited with the ProjectRequestLimitadmission control plug-in.

Important

If your project request template was created in OpenShift Container Platform 3.1 or earlier using the process described in Modifying the Template for New Projects, then the generated template does not include the annotation openshift.io/requester: ${PROJECT_REQUESTING_USER}, which is used for the ProjectRequestLimitConfig. You must add the annotation.

In order to specify limits for users, a configuration must be specified for the plug-in within the master configuration file at /etc/origin/master/master-config.yaml. The plug-in configuration takes a list of user label selectors and the associated maximum project requests.

Selectors are evaluated in order. The first one matching the current user will be used to determine the maximum number of projects. If a selector is not specified, a limit applies to all users. If a maximum number of projects is not specified, then an unlimited number of projects are allowed for a specific selector.

The following configuration sets a global limit of 2 projects per user while allowing 10 projects for users with a label of level=advanced and unlimited projects for users with a label of level=admin.

admissionConfig:
  pluginConfig:
    ProjectRequestLimit:
      configuration:
        apiVersion: v1
        kind: ProjectRequestLimitConfig
        limits:
        - selector:
            level: admin 1
        - selector:
            level: advanced 2
          maxProjects: 10
        - maxProjects: 2 3
1
For selector level=admin, no maxProjects is specified. This means that users with this label will not have a maximum of project requests.
2
For selector level=advanced, a maximum number of 10 projects will be allowed.
3
For the third entry, no selector is specified. This means that it will be applied to any user that doesn’t satisfy the previous two rules. Because rules are evaluated in order, this rule should be specified last.
Note

Managing User and Group Labels provides further guidance on how to add, remove, or show labels for users and groups.

Once your changes are made, restart OpenShift Container Platform for the changes to take effect.

# master-restart api
# master-restart controllers

6.5. Enabling and Limiting Self-Provisioned Projects Per Service Account

By default, service accounts cannot create projects. However, administrators can enable this capability per service account, and the number of self-provisioned projects requested by any given service account can be limited with the ProjectRequestLimitadmission control plug-in.

Note

If service accounts are allowed to create projects, you cannot trust any labels placed on them because project editors can manipulate those labels.

  1. Create a service account in the project, if it is does not exist:

    $ oc create sa <sa_name>
  2. As a user with cluster-admin privileges, add the self-provisioner cluster role to the service account:

    $ oc adm policy \
        add-cluster-role-to-user self-provisioner \
        system:serviceaccount:<project>:<sa_name>
  3. Edit the master configuration file at /etc/origin/master/master-config.yaml and set the maxProjectsForServiceAccounts parameter value in the ProjectRequestLimit section to the maximum number of projects any given self-provisioner-enabled service account can create.

    For example, the following configuration sets a global limit of three projects per service account:

    admissionConfig:
      pluginConfig:
        ProjectRequestLimit:
          configuration:
            apiVersion: v1
            kind: ProjectRequestLimitConfig
            maxProjectsForServiceAccounts: 3
  4. After you save the changes, restart OpenShift Container Platform for the changes to take effect:

    # master-restart api
    # master-restart controllers
  5. Verify that your changes have been applied by logging in as the service account and creating a new project.

    1. Log in as the service account by using its token:

      $ oc login --token <token>
    2. Create a new project:

      $ oc new-project <project_name>

Chapter 7. Managing Pods

7.1. Overview

This topic describes the management of pods, including limiting their run-once duration, and how much bandwidth they can use.

7.2. Viewing Pods

You can display usage statistics about pods, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.

To view the usage statistics:

$ oc adm top pods
NAME                         CPU(cores)   MEMORY(bytes)
hawkular-cassandra-1-pqx6l   219m         1240Mi
hawkular-metrics-rddnv       20m          1765Mi
heapster-n94r4               3m           37Mi

To view the usage statistics for pods with labels:

$ oc adm top pod --selector=''

You must choose the selector (label query) to filter on. Supports =, ==, and !=.

Note

You must have cluster-reader permission to view the usage statistics.

Note

The metrics-server must be installed to view the usage statistics. See Requirements for Using Horizontal Pod Autoscalers.

7.3. Limiting Run-once Pod Duration

OpenShift Container Platform relies on run-once pods to perform tasks such as deploying a pod or performing a build. Run-once pods are pods that have a RestartPolicy of Never or OnFailure.

The cluster administrator can use the RunOnceDuration admission control plug-in to force a limit on the time that those run-once pods can be active. Once the time limit expires, the cluster will try to actively terminate those pods. The main reason to have such a limit is to prevent tasks such as builds to run for an excessive amount of time.

7.3.1. Configuring the RunOnceDuration Plug-in

The plug-in configuration should include the default active deadline for run-once pods. This deadline is enforced globally, but can be superseded on a per-project basis.

admissionConfig:
  pluginConfig:
    RunOnceDuration:
      configuration:
        apiVersion: v1
        kind: RunOnceDurationConfig
        activeDeadlineSecondsOverride: 3600 1
....
1
Specify the global default for run-once pods in seconds.

7.3.2. Specifying a Custom Duration per Project

In addition to specifying a global maximum duration for run-once pods, an administrator can add an annotation (openshift.io/active-deadline-seconds-override) to a specific project to override the global default.

  • For a new project, define the annotation in the project specification .yaml file.

    apiVersion: v1
    kind: Project
    metadata:
      annotations:
        openshift.io/active-deadline-seconds-override: "1000" 1
      name: myproject
    1
    Overrides the default active deadline seconds for run-once pods to 1000 seconds. Note that the value of the override must be specified in string form.
  • For an existing project,

    • Run oc edit and add the openshift.io/active-deadline-seconds-override: 1000 annotation in the editor.

      $ oc edit namespace <project-name>

      Or

    • Use the oc patch command:

      $ oc patch namespace <project_name> -p '{"metadata":{"annotations":{"openshift.io/active-deadline-seconds-override":"1000"}}}'
7.3.2.1. Deploying an Egress Router Pod

Example 7.1. Example Pod Definition for an Egress Router

apiVersion: v1
kind: Pod
metadata:
  name: egress-1
  labels:
    name: egress-1
  annotations:
    pod.network.openshift.io/assign-macvlan: "true"
spec:
  containers:
  - name: egress-router
    image: openshift3/ose-egress-router
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE 1
      value: 192.168.12.99
    - name: EGRESS_GATEWAY 2
      value: 192.168.12.1
    - name: EGRESS_DESTINATION 3
      value: 203.0.113.25
  nodeSelector:
    site: springfield-1 4
1
IP address on the node subnet reserved by the cluster administrator for use by this pod.
2
Same value as the default gateway used by the node itself.
3
Connections to the pod are redirected to 203.0.113.25, with a source IP address of 192.168.12.99
4
The pod will only be deployed to nodes with the label site springfield-1.

The pod.network.openshift.io/assign-macvlan annotation creates a Macvlan network interface on the primary network interface, and then moves it into the pod’s network name space before starting the egress-router container.

Note

Preserve the quotation marks around "true". Omitting them will result in errors.

The pod contains a single container, using the openshift3/ose-egress-router image, and that container is run privileged so that it can configure the Macvlan interface and set up iptables rules.

The environment variables tell the egress-router image what addresses to use; it will configure the Macvlan interface to use EGRESS_SOURCE as its IP address, with EGRESS_GATEWAY as its gateway.

NAT rules are set up so that connections to any TCP or UDP port on the pod’s cluster IP address are redirected to the same port on EGRESS_DESTINATION.

If only some of the nodes in your cluster are capable of claiming the specified source IP address and using the specified gateway, you can specify a nodeName or nodeSelector indicating which nodes are acceptable.

7.3.2.2. Deploying an Egress Router Service

Though not strictly necessary, you normally want to create a service pointing to the egress router:

apiVersion: v1
kind: Service
metadata:
  name: egress-1
spec:
  ports:
  - name: http
    port: 80
  - name: https
    port: 443
  type: ClusterIP
  selector:
    name: egress-1

Your pods can now connect to this service. Their connections are redirected to the corresponding ports on the external server, using the reserved egress IP address.

7.3.3. Limiting Pod Access with Egress Firewall

As an OpenShift Container Platform cluster administrator, you can use egress policy to limit the external addresses that some or all pods can access from within the cluster, so that:

  • A pod can only talk to internal hosts, and cannot initiate connections to the public Internet.

    Or,

  • A pod can only talk to the public Internet, and cannot initiate connections to internal hosts (outside the cluster).

    Or,

  • A pod cannot reach specified internal subnets/hosts that it should have no reason to contact.

For example, you can configure projects with different egress policies, allowing <project A> access to a specified IP range, but denying the same access to <project B>.

Caution

You must have the ovs-multitenant plug-in enabled in order to limit pod access via egress policy.

Project administrators can neither create EgressNetworkPolicy objects, nor edit the ones you create in their project. There are also several other restrictions on where EgressNetworkPolicy can be created:

  1. The default project (and any other project that has been made global via oc adm pod-network make-projects-global) cannot have egress policy.
  2. If you merge two projects together (via oc adm pod-network join-projects), then you cannot use egress policy in any of the joined projects.
  3. No project may have more than one egress policy object.

Violating any of these restrictions will result in broken egress policy for the project, and may cause all external network traffic to be dropped.

7.3.3.1. Configuring Pod Access Limits

To configure pod access limits, you must use the oc command or the REST API. You can use oc [create|replace|delete] to manipulate EgressNetworkPolicy objects. The api/swagger-spec/oapi-v1.json file has API-level details on how the objects actually work.

To configure pod access limits:

  1. Navigate to the project you want to affect.
  2. Create a JSON file for the pod limit policy:

    # oc create -f <policy>.json
  3. Configure the JSON file with policy details. For example:

    {
        "kind": "EgressNetworkPolicy",
        "apiVersion": "v1",
        "metadata": {
            "name": "default"
        },
        "spec": {
            "egress": [
                {
                    "type": "Allow",
                    "to": {
                        "cidrSelector": "1.2.3.0/24"
                    }
                },
                {
                    "type": "Allow",
                    "to": {
                        "dnsName": "www.foo.com"
                    }
                },
                {
                    "type": "Deny",
                    "to": {
                        "cidrSelector": "0.0.0.0/0"
                    }
                }
            ]
        }
    }

    When the example above is added in a project, it allows traffic to IP range 1.2.3.0/24 and domain name www.foo.com, but denies access to all other external IP addresses. (Traffic to other pods is not affected because the policy only applies to external traffic.)

    The rules in an EgressNetworkPolicy are checked in order, and the first one that matches takes effect. If the three rules in the above example were reversed, then traffic would not be allowed to 1.2.3.0/24 and www.foo.com because the 0.0.0.0/0 rule would be checked first, and it would match and deny all traffic.

    Domain name updates are reflected within 30 minutes. In the above example, suppose www.foo.com resolved to 10.11.12.13, but later it was changed to 20.21.22.23. Then, OpenShift Container Platform will take up to 30 minutes to adapt to these DNS updates.

7.4. Limiting the Bandwidth Available to Pods

You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. The limits you place on a pod do not affect the bandwidth of other pods.

To limit the bandwidth on a pod:

  1. Write an object definition JSON file, and specify the data traffic speed using kubernetes.io/ingress-bandwidth and kubernetes.io/egress-bandwidth annotations. For example, to limit both pod egress and ingress bandwidth to 10M/s:

    Limited Pod Object Definition

    {
        "kind": "Pod",
        "spec": {
            "containers": [
                {
                    "image": "openshift/hello-openshift",
                    "name": "hello-openshift"
                }
            ]
        },
        "apiVersion": "v1",
        "metadata": {
            "name": "iperf-slow",
            "annotations": {
                "kubernetes.io/ingress-bandwidth": "10M",
                "kubernetes.io/egress-bandwidth": "10M"
            }
        }
    }

  2. Create the pod using the object definition:

    $ oc create -f <file_or_dir_path>

7.5. Setting Pod Disruption Budgets

A pod disruption budget is part of the Kubernetes API, which can be managed with oc commands like other object types. They allow the specification of safety constraints on pods during operations, such as draining a node for maintenance.

PodDisruptionBudget is an API object that specifies the minimum number or percentage of replicas that must be up at a time. Setting these in projects can be helpful during node maintenance (such as scaling a cluster down or a cluster upgrade) and is only honored on voluntary evictions (not on node failures).

A PodDisruptionBudget object’s configuration consists of the following key parts:

  • A label selector, which is a label query over a set of pods.
  • An availability level, which specifies the minimum number of pods that must be available simultaneously.

The following is an example of a PodDisruptionBudget resource:

apiVersion: policy/v1beta1 1
kind: PodDisruptionBudget
metadata:
  name: my-pdb
spec:
  selector:  2
    matchLabels:
      foo: bar
  minAvailable: 2  3
1
PodDisruptionBudget is part of the policy/v1beta1 API group.
2
A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.
3
The minimum number of pods that must be available simultaneously. This can be either an integer or a string specifying a percentage (for example, 20%).

If you created a YAML file with the above object definition, you could add it to project with the following:

$ oc create -f </path/to/file> -n <project_name>

You can check for pod disruption budgets across all projects with the following:

$ oc get poddisruptionbudget --all-namespaces

NAMESPACE         NAME          MIN-AVAILABLE   SELECTOR
another-project   another-pdb   4               bar=foo
test-project      my-pdb        2               foo=bar

The PodDisruptionBudget is considered healthy when there are at least minAvailable pods running in the system. Every pod above that limit can be evicted.

Note

Depending on your pod priority and preemption settings, lower-priority pods might be removed despite their pod disruption budget requirements.

7.6. Configuring Critical Pods

There are a number of core components, such as DNS, that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster may stop working properly if a critical add-on is evicted. You can make a pod critical by adding the scheduler.alpha.kubernetes.io/critical-pod annotation to the pod specification so that the descheduler will not remove these pods.

spec:
  template:
    metadata:
      name: critical-pod
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: "true"

Chapter 8. Managing Networking

8.1. Overview

This topic describes the management of the overall cluster network, including project isolation and outbound traffic control.

Pod-level networking features, such as per-pod bandwidth limits, are discussed in Managing Pods.

8.2. Managing Pod Networks

When your cluster is configured to use the ovs-multitenant SDN plug-in, you can manage the separate pod overlay networks for projects using the administrator CLI. See the Configuring the SDN section for plug-in configuration steps, if necessary.

8.2.1. Joining Project Networks

To join projects to an existing project network:

$ oc adm pod-network join-projects --to=<project1> <project2> <project3>

In the above example, all the pods and services in <project2> and <project3> can now access any pods and services in <project1> and vice versa. Services can be accessed either by IP or fully qualified DNS name (<service>.<pod_namespace>.svc.cluster.local). For example, to access a service named db in a project myproject, use db.myproject.svc.cluster.local.

Alternatively, instead of specifying specific project names, you can use the --selector=<project_selector> option.

To verify the networks you have joined together:

$ oc get netnamespaces

Then look at the NETID column. Projects in the same pod-network will have the same NetID.

8.3. Isolating Project Networks

To isolate the project network in the cluster and vice versa, run:

$ oc adm pod-network isolate-projects <project1> <project2>

In the above example, all of the pods and services in <project1> and <project2> can not access any pods and services from other non-global projects in the cluster and vice versa.

Alternatively, instead of specifying specific project names, you can use the --selector=<project_selector> option.

8.3.1. Making Project Networks Global

To allow projects to access all pods and services in the cluster and vice versa:

$ oc adm pod-network make-projects-global <project1> <project2>

In the above example, all the pods and services in <project1> and <project2> can now access any pods and services in the cluster and vice versa.

Alternatively, instead of specifying specific project names, you can use the --selector=<project_selector> option.

8.4. Disabling Host Name Collision Prevention For Routes and Ingress Objects

In OpenShift Container Platform, host name collision prevention for routes and ingress objects is enabled by default. This means that users without the cluster-admin role can set the host name in a route or ingress object only on creation and cannot change it afterwards. However, you can relax this restriction on routes and ingress objects for some or all users.

Warning

Because OpenShift Container Platform uses the object creation timestamp to determine the oldest route or ingress object for a given host name, a route or ingress object can hijack a host name of a newer route if the older route changes its host name, or if an ingress object is introduced.

As an OpenShift Container Platform cluster administrator, you can edit the host name in a route even after creation. You can also create a role to allow specific users to do so:

$ oc create clusterrole route-editor --verb=update --resource=routes.route.openshift.io/custom-host

You can then bind the new role to a user:

$ oc adm policy add-cluster-role-to-user route-editor user

You can also disable host name collision prevention for ingress objects. Doing so lets users without the cluster-admin role edit a host name for ingress objects after creation. This is useful to OpenShift Container Platform installations that depend upon Kubernetes behavior, including allowing the host names in ingress objects be edited.

  1. Add the following to the master.yaml file:

    admissionConfig:
      pluginConfig:
        openshift.io/IngressAdmission:
          configuration:
            apiVersion: v1
            allowHostnameChanges: true
            kind: IngressAdmissionConfig
          location: ""
  2. Restart the master services for the changes to take effect:

    $ master-restart api
    $ master-restart controllers

8.5. Controlling Egress Traffic

As a cluster administrator you can allocate a number of static IP addresses to a specific node at the host level. If an application developer needs a dedicated IP address for their application service, they can request one during the process they use to ask for firewall access. They can then deploy an egress router from the developer’s project, using a nodeSelector in the deployment configuration to ensure that the pod lands on the host with the pre-allocated static IP address.

The egress pod’s deployment declares one of the source IPs, the destination IP of the protected service, and a gateway IP to reach the destination. After the pod is deployed, you can create a service to access the egress router pod, then add that source IP to the corporate firewall. The developer then has access information to the egress router service that was created in their project, for example, service.project.cluster.domainname.com.

When the developer needs to access the external, firewalled service, they can call out to the egress router pod’s service (service.project.cluster.domainname.com) in their application (for example, the JDBC connection information) rather than the actual protected service URL.

You can also assign static IP addresses to projects, ensuring that all outgoing external connections from the specified project have recognizable origins. This is different from the default egress router, which is used to send traffic to specific destinations.

See the Enabling Fixed IPs for External Project Traffic section for more information.

As an OpenShift Container Platform cluster administrator, you can control egress traffic in these ways:

Firewall
Using an egress firewall allows you to enforce the acceptable outbound traffic policies, so that specific endpoints or IP ranges (subnets) are the only acceptable targets for the dynamic endpoints (pods within OpenShift Container Platform) to talk to.
Router
Using an egress router allows you to create identifiable services to send traffic to certain destinations, ensuring those external destinations treat traffic as though it were coming from a known source. This helps with security, because it allows you to secure an external database so that only specific pods in a namespace can talk to a service (the egress router), which proxies the traffic to your database.
iptables
In addition to the above OpenShift Container Platform-internal solutions, it is also possible to create iptables rules that will be applied to outgoing traffic. These rules allow for more possibilities than the egress firewall, but cannot be limited to particular projects.

8.6. Using an Egress Firewall to Limit Access to External Resources

As an OpenShift Container Platform cluster administrator, you can use egress firewall policy to limit the external IP addresses that some or all pods can access from within the cluster. Egress firewall policy supports the following scenarios:

  • A pod can only connect to internal hosts, and cannot initiate connections to the public Internet.
  • A pod can only connect to the public Internet, and cannot initiate connections to internal hosts that are outside the OpenShift Container Platform cluster.
  • A pod cannot reach specified internal subnets or hosts that should be unreachable.

Egress policies can be set by specifying an IP address range in CIDR format or by specifying a DNS name. For example, you can allow <project_A> access to a specified IP range but deny the same access to <project_B>. Alternatively, you can restrict application developers from updating from (Python) pip mirrors, and force updates to only come from approved sources.

Caution

You must have the ovs-multitenant or ovs-networkpolicy plug-in enabled in order to limit pod access via egress policy.

If you are using the ovs-multitenant plug-in, egress policy is compatible with only one policy per project, and will not work with projects that share a network, such as global projects.

Project administrators can neither create EgressNetworkPolicy objects, nor edit the ones you create in their project. There are also several other restrictions on where EgressNetworkPolicy can be created:

  • The default project (and any other project that has been made global via oc adm pod-network make-projects-global) cannot have egress policy.
  • If you merge two projects together (via oc adm pod-network join-projects), then you cannot use egress policy in any of the joined projects.
  • No project may have more than one egress policy object.

Violating any of these restrictions results in broken egress policy for the project, and may cause all external network traffic to be dropped.

Use the oc command or the REST API to configure egress policy. You can use oc [create|replace|delete] to manipulate EgressNetworkPolicy objects. The api/swagger-spec/oapi-v1.json file has API-level details on how the objects actually work.

To configure egress policy:

  1. Navigate to the project you want to affect.
  2. Create a JSON file with the policy configuration you want to use, as in the following example:

    {
        "kind": "EgressNetworkPolicy",
        "apiVersion": "v1",
        "metadata": {
            "name": "default"
        },
        "spec": {
            "egress": [
                {
                    "type": "Allow",
                    "to": {
                        "cidrSelector": "1.2.3.0/24"
                    }
                },
                {
                    "type": "Allow",
                    "to": {
                        "dnsName": "www.foo.com"
                    }
                },
                {
                    "type": "Deny",
                    "to": {
                        "cidrSelector": "0.0.0.0/0"
                    }
                }
            ]
        }
    }

    When the example above is added to a project, it allows traffic to IP range 1.2.3.0/24 and domain name www.foo.com, but denies access to all other external IP addresses. Traffic to other pods is not affected because the policy only applies to external traffic.

    The rules in an EgressNetworkPolicy are checked in order, and the first one that matches takes effect. If the three rules in the above example were reversed, then traffic would not be allowed to 1.2.3.0/24 and www.foo.com because the 0.0.0.0/0 rule would be checked first, and it would match and deny all traffic.

    Domain name updates are polled based on the TTL (time to live) value of the domain returned by the local non-authoritative servers. The pod should also resolve the domain from the same local nameservers when necessary, otherwise the IP addresses for the domain perceived by the egress network policy controller and the pod will be different, and the egress network policy may not be enforced as expected. Since egress network policy controller and pod are asynchronously polling the same local nameserver, there could be a race condition where pod may get the updated IP before the egress controller. Due to this current limitation, domain name usage in EgressNetworkPolicy is only recommended for domains with infrequent IP address changes.

    Note

    The egress firewall always allows pods access to the external interface of the node the pod is on for DNS resolution. If your DNS resolution is not handled by something on the local node, then you will need to add egress firewall rules allowing access to the DNS server’s IP addresses if you are using domain names in your pods.

  3. Use the JSON file to create an EgressNetworkPolicy object:

    $ oc create -f <policy>.json
Caution

Exposing services by creating routes will ignore EgressNetworkPolicy. Egress network policy service endpoint filtering is done at the node kubeproxy. When the router is involved, kubeproxy is bypassed and egress network policy enforcement is not applied. Administrators can prevent this bypass by limiting access to create routes.

8.6.1. Using an Egress Router to Allow External Resources to Recognize Pod Traffic

The OpenShift Container Platform egress router runs a service that redirects traffic to a specified remote server, using a private source IP address that is not used for anything else. The service allows pods to talk to servers that are set up to only allow access from whitelisted IP addresses.

Important

The egress router is not intended for every outgoing connection. Creating large numbers of egress routers can push the limits of your network hardware. For example, creating an egress router for every project or application could exceed the number of local MAC addresses that the network interface can handle before falling back to filtering MAC addresses in software.

Important

Currently, the egress router is not compatible with Amazon AWS, Azure Cloud, or any other cloud platform that does not support layer 2 manipulations due to their incompatibility with macvlan traffic.

Deployment Considerations

The Egress router adds a second IP address and MAC address to the node’s primary network interface. If you are not running OpenShift Container Platform on bare metal, you may need to configure your hypervisor or cloud provider to allow the additional address.

Red Hat OpenStack Platform

If you are deploying OpenShift Container Platform on Red Hat OpenStack Platform, you need to whitelist the IP and MAC addresses on your OpenStack environment, otherwise communication will fail:

neutron port-update $neutron_port_uuid \
  --allowed_address_pairs list=true \
  type=dict mac_address=<mac_address>,ip_address=<ip_address>
Red Hat Enterprise Virtualization
If you are using Red Hat Enterprise Virtualization, you should set EnableMACAntiSpoofingFilterRules to false.
VMware vSphere
If you are using VMware vSphere, see the VMWare documentation for securing vSphere standard switches. View and change VMWare vSphere default settings by selecting the host’s virtual switch from the vSphere Web Client.

Specifically, ensure that the following are enabled:

Egress Router Modes

The egress router can run in three different modes: redirect mode, HTTP proxy mode and DNS proxy mode. Redirect mode works for all services except for HTTP and HTTPS. For HTTP and HTTPS services, use HTTP proxy mode. For TCP-based services with IP addresses or domain names, use DNS proxy mode.

8.6.1.1. Deploying an Egress Router Pod in Redirect Mode

In redirect mode, the egress router sets up iptables rules to redirect traffic from its own IP address to one or more destination IP addresses. Client pods that want to make use of the reserved source IP address must be modified to connect to the egress router rather than connecting directly to the destination IP.

  1. Create a pod configuration using the following:

    apiVersion: v1
    kind: Pod
    metadata:
      name: egress-1
      labels:
        name: egress-1
      annotations:
        pod.network.openshift.io/assign-macvlan: "true" 1
    spec:
      initContainers:
      - name: egress-router
        image: registry.redhat.io/openshift3/ose-egress-router
        securityContext:
          privileged: true
        env:
        - name: EGRESS_SOURCE 2
          value: 192.168.12.99/24
        - name: EGRESS_GATEWAY 3
          value: 192.168.12.1
        - name: EGRESS_DESTINATION 4
          value: 203.0.113.25
        - name: EGRESS_ROUTER_MODE 5
          value: init
      containers:
      - name: egress-router-wait
        image: registry.redhat.io/openshift3/ose-pod
      nodeSelector:
        site: springfield-1 6
    1
    Creates a Macvlan network interface on the primary network interface, and moves it into the pod’s network project before starting the egress-router container. Preserve the quotation marks around "true". Omitting them results in errors. To create the Macvlan interface on a network interface other than the primary one, set the annotation value to the name of that interface. For example, eth1.
    2
    IP address from the physical network that the node is on and is reserved by the cluster administrator for use by this pod. Optionally, you can include the subnet length, the /24 suffix, so that a proper route to the local subnet can be set up. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
    3
    Same value as the default gateway used by the node.
    4
    The external server to direct traffic to. Using this example, connections to the pod are redirected to 203.0.113.25, with a source IP address of 192.168.12.99.
    5
    This tells the egress router image that it is being deployed as an "init container". Previous versions of OpenShift Container Platform (and the egress router image) did not support this mode and had to be run as an ordinary container.
    6
    The pod is only deployed to nodes with the label site=springfield-1.
  2. Create the pod using the above definition:

    $ oc create -f <pod_name>.json

    To check to see if the pod has been created:

    $ oc get pod <pod_name>
  3. Ensure other pods can find the pod’s IP address by creating a service to point to the egress router:

    apiVersion: v1
    kind: Service
    metadata:
      name: egress-1
    spec:
      ports:
      - name: http
        port: 80
      - name: https
        port: 443
      type: ClusterIP
      selector:
        name: egress-1

    Your pods can now connect to this service. Their connections are redirected to the corresponding ports on the external server, using the reserved egress IP address.

The egress router setup is performed by an "init container" created from the openshift3/ose-egress-router image, and that container is run privileged so that it can configure the Macvlan interface and set up iptables rules. After it finishes setting up the iptables rules, it exits and the openshift3/ose-pod container will run (doing nothing) until the pod is killed.

The environment variables tell the egress-router image what addresses to use; it will configure the Macvlan interface to use EGRESS_SOURCE as its IP address, with EGRESS_GATEWAY as its gateway.

NAT rules are set up so that connections to any TCP or UDP port on the pod’s cluster IP address are redirected to the same port on EGRESS_DESTINATION.

If only some of the nodes in your cluster are capable of claiming the specified source IP address and using the specified gateway, you can specify a nodeName or nodeSelector indicating which nodes are acceptable.

8.6.1.2. Redirecting to Multiple Destinations

In the previous example, connections to the egress pod (or its corresponding service) on any port are redirected to a single destination IP. You can also configure different destination IPs depending on the port:

apiVersion: v1
kind: Pod
metadata:
  name: egress-multi
  labels:
    name: egress-multi
  annotations:
    pod.network.openshift.io/assign-macvlan: "true"
spec:
  initContainers:
  - name: egress-router
    image: registry.redhat.io/openshift3/ose-egress-router
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE 1
      value: 192.168.12.99/24
    - name: EGRESS_GATEWAY
      value: 192.168.12.1
    - name: EGRESS_DESTINATION 2
      value: |
        80   tcp 203.0.113.25
        8080 tcp 203.0.113.26 80
        8443 tcp 203.0.113.26 443
        203.0.113.27
    - name: EGRESS_ROUTER_MODE
      value: init
  containers:
  - name: egress-router-wait
    image: registry.redhat.io/openshift3/ose-pod
1
IP address from the physical network that the node is on and is reserved by the cluster administrator for use by this pod. Optionally, you can include the subnet length, the /24 suffix, so that a proper route to the local subnet can be set up. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
2
EGRESS_DESTINATION uses YAML syntax for its values, and can be a multi-line string. See the following for more information.

Each line of EGRESS_DESTINATION can be one of three types:

  • <port> <protocol> <IP_address> - This says that incoming connections to the given <port> should be redirected to the same port on the given <IP_address>. <protocol> is either tcp or udp. In the example above, the first line redirects traffic from local port 80 to port 80 on 203.0.113.25.
  • <port> <protocol> <IP_address> <remote_port> - As above, except that the connection is redirected to a different <remote_port> on <IP_address>. In the example above, the second and third lines redirect local ports 8080 and 8443 to remote ports 80 and 443 on 203.0.113.26.
  • <fallback_IP_address> - If the last line of EGRESS_DESTINATION is a single IP address, then any connections on any other port will be redirected to the corresponding port on that IP address (eg, 203.0.113.27 in the example above). If there is no fallback IP address then connections on other ports would simply be rejected.)
8.6.1.3. Using a ConfigMap to specify EGRESS_DESTINATION

For a large or frequently-changing set of destination mappings, you can use a ConfigMap to externally maintain the list, and have the egress router pod read it from there. This comes with the advantage of project administrators being able to edit the ConfigMap, whereas they may not be able to edit the Pod definition directly, because it contains a privileged container.

  1. Create a file containing the EGRESS_DESTINATION data:

    $ cat my-egress-destination.txt
    # Egress routes for Project "Test", version 3
    
    80   tcp 203.0.113.25
    
    8080 tcp 203.0.113.26 80
    8443 tcp 203.0.113.26 443
    
    # Fallback
    203.0.113.27

    Note that you can put blank lines and comments into this file

  2. Create a ConfigMap object from the file:

    $ oc delete configmap egress-routes --ignore-not-found
    $ oc create configmap egress-routes \
      --from-file=destination=my-egress-destination.txt

    Here egress-routes is the name of the ConfigMap object being created and my-egress-destination.txt is the name of the file the data is being read from.

  3. Create a egress router pod definition as above, but specifying the ConfigMap for EGRESS_DESTINATION in the environment section:

        ...
        env:
        - name: EGRESS_SOURCE 1
          value: 192.168.12.99/24
        - name: EGRESS_GATEWAY
          value: 192.168.12.1
        - name: EGRESS_DESTINATION
          valueFrom:
            configMapKeyRef:
              name: egress-routes
              key: destination
        - name: EGRESS_ROUTER_MODE
          value: init
        ...
    1
    IP address from the physical network that the node is on and is reserved by the cluster administrator for use by this pod. Optionally, you can include the subnet length, the /24 suffix, so that a proper route to the local subnet can be set up. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
Note

The egress router does not automatically update when the ConfigMap changes. Restart the pod to get updates.

8.6.1.4. Deploying an Egress Router HTTP Proxy Pod

In HTTP proxy mode, the egress router runs as an HTTP proxy on port 8080. This only works for clients talking to HTTP or HTTPS-based services, but usually requires fewer changes to the client pods to get them to work. Programs can be told to use an HTTP proxy by setting an environment variable.

  1. Create the pod using the following as an example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: egress-http-proxy
      labels:
        name: egress-http-proxy
      annotations:
        pod.network.openshift.io/assign-macvlan: "true" 1
    spec:
      initContainers:
      - name: egress-router-setup
        image: registry.redhat.io/openshift3/ose-egress-router
        securityContext:
          privileged: true
        env:
        - name: EGRESS_SOURCE 2
          value: 192.168.12.99/24
        - name: EGRESS_GATEWAY 3
          value: 192.168.12.1
        - name: EGRESS_ROUTER_MODE 4
          value: http-proxy
      containers:
      - name: egress-router-proxy
        image: registry.redhat.io/openshift3/ose-egress-http-proxy
        env:
        - name: EGRESS_HTTP_PROXY_DESTINATION 5
          value: |
            !*.example.com
            !192.168.1.0/24
            *
    1
    Creates a Macvlan network interface on the primary network interface, then moves it into the pod’s network project before starting the egress-router container. Preserve the quotation marks around "true". Omitting them results in errors.
    2
    IP address from the physical network that the node is on and is reserved by the cluster administrator for use by this pod. Optionally, you can include the subnet length, the /24 suffix, so that a proper route to the local subnet can be set up. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
    3
    Same value as the default gateway used by the node itself.
    4
    This tells the egress router image that it is being deployed as part of an HTTP proxy, and so it should not set up iptables redirecting rules.
    5
    A string or YAML multi-line string specifying how to configure the proxy. Note that this is specified as an environment variable in the HTTP proxy container, not with the other environment variables in the init container.

    You can specify any of the following for the EGRESS_HTTP_PROXY_DESTINATION value. You can also use *, meaning "allow connections to all remote destinations". Each line in the configuration specifies one group of connections to allow or deny:

    • An IP address (eg, 192.168.1.1) allows connections to that IP address.
    • A CIDR range (eg, 192.168.1.0/24) allows connections to that CIDR range.
    • A host name (eg, www.example.com) allows proxying to that host.
    • A domain name preceded by *. (eg, *.example.com) allows proxying to that domain and all of its subdomains.
    • A ! followed by any of the above denies connections rather than allowing them
    • If the last line is *, then anything that hasn’t been denied will be allowed. Otherwise, anything that hasn’t been allowed will be denied.
  2. Ensure other pods can find the pod’s IP address by creating a service to point to the egress router:

    apiVersion: v1
    kind: Service
    metadata:
      name: egress-1
    spec:
      ports:
      - name: http-proxy
        port: 8080 1
      type: ClusterIP
      selector:
        name: egress-1
    1
    Ensure the http port is always set to 8080.
  3. Configure the client pod (not the egress proxy pod) to use the HTTP proxy by setting the http_proxy or https_proxy variables:

        ...
        env:
        - name: http_proxy
          value: http://egress-1:8080/ 1
        - name: https_proxy
          value: http://egress-1:8080/
        ...
    1
    The service created in step 2.
    Note

    Using the http_proxy and https_proxy environment variables is not necessary for all setups. If the above does not create a working setup, then consult the documentation for the tool or software you are running in the pod.

You can also specify the EGRESS_HTTP_PROXY_DESTINATION using a ConfigMap, similarly to the redirecting egress router example above.

8.6.1.5. Deploying an Egress Router DNS Proxy Pod

In DNS proxy mode, the egress router runs as a DNS proxy for TCP-based services from its own IP address to one or more destination IP addresses. Client pods that want to make use of the reserved, source IP address must be modified to connect to the egress router rather than connecting directly to the destination IP. This ensures that external destinations treat traffic as though it were coming from a known source.

  1. Create the pod using the following as an example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: egress-dns-proxy
      labels:
        name: egress-dns-proxy
      annotations:
        pod.network.openshift.io/assign-macvlan: "true" 1
    spec:
      initContainers:
      - name: egress-router-setup
        image: registry.redhat.io/openshift3/ose-egress-router
        securityContext:
          privileged: true
        env:
        - name: EGRESS_SOURCE 2
          value: 192.168.12.99/24
        - name: EGRESS_GATEWAY 3
          value: 192.168.12.1
        - name: EGRESS_ROUTER_MODE 4
          value: dns-proxy
      containers:
      - name: egress-dns-proxy
        image: registry.redhat.io/openshift3/ose-egress-dns-proxy
        env:
        - name: EGRESS_DNS_PROXY_DEBUG 5
          value: "1"
        - name: EGRESS_DNS_PROXY_DESTINATION 6
          value: |
            # Egress routes for Project "Foo", version 5
    
            80  203.0.113.25
    
            100 example.com
    
            8080 203.0.113.26 80
    
            8443 foobar.com 443
    1
    Using pod.network.openshift.io/assign-macvlan annotation creates a Macvlan network interface on the primary network interface, then moves it into the pod’s network name space before starting the egress-router-setup container. Preserve the quotation marks around "true". Omitting them results in errors.
    2
    IP address from the physical network that the node is on and is reserved by the cluster administrator for use by this pod. Optionally, you can include the subnet length, the /24 suffix, so that a proper route to the local subnet can be set up. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
    3
    Same value as the default gateway used by the node itself.
    4
    This tells the egress router image that it is being deployed as part of a DNS proxy, and so it should not set up iptables redirecting rules.
    5
    Optional. Setting this variable will display DNS proxy log output on stdout.
    6
    This uses the YAML syntax for a multi-line string. See below for details.
    Note

    Each line of EGRESS_DNS_PROXY_DESTINATION can be set in one of two ways:

    • <port> <remote_address> - This says that incoming connections to the given <port> should be proxied to the same TCP port on the given <remote_address>. <remote_address> can be an IP address or DNS name. In case of DNS name, DNS resolution is done at runtime. In the example above, the first line proxies TCP traffic from local port 80 to port 80 on 203.0.113.25. The second line proxies TCP traffic from local port 100 to port 100 on example.com.
    • <port> <remote_address> <remote_port> - As above, except that the connection is proxied to a different <remote_port> on <remote_address>. In the example above, the third line proxies local port 8080 to remote port 80 on 203.0.113.26 and the fourth line proxies local port 8443 to remote port 443 on foobar.com.
  2. Ensure other pods can find the pod’s IP address by creating a service to point to the egress router:

    apiVersion: v1
    kind: Service
    metadata:
      name: egress-dns-svc
    spec:
      ports:
      - name: con1
        protocol: TCP
        port: 80
        targetPort: 80
      - name: con2
        protocol: TCP
        port: 100
        targetPort: 100
      - name: con3
        protocol: TCP
        port: 8080
        targetPort: 8080
      - name: con4
        protocol: TCP
        port: 8443
        targetPort: 8443
      type: ClusterIP
      selector:
        name: egress-dns-proxy

    Pods can now connect to this service. Their connections are proxied to the corresponding ports on the external server, using the reserved egress IP address.

You can also specify the EGRESS_DNS_PROXY_DESTINATION using a ConfigMap, similarly to the redirecting egress router example above.

8.6.1.6. Enabling Failover for Egress Router Pods

Using a replication controller, you can ensure that there is always one copy of the egress router pod in order to prevent downtime.

  1. Create a replication controller configuration file using the following:

    apiVersion: v1
    kind: ReplicationController
    metadata:
      name: egress-demo-controller
    spec:
      replicas: 1 1
      selector:
        name: egress-demo
      template:
        metadata:
          name: egress-demo
          labels:
            name: egress-demo
          annotations:
            pod.network.openshift.io/assign-macvlan: "true"
        spec:
          initContainers:
          - name: egress-demo-init
            image: registry.redhat.io/openshift3/ose-egress-router
            env:
            - name: EGRESS_SOURCE 2
              value: 192.168.12.99/24
            - name: EGRESS_GATEWAY
              value: 192.168.12.1
            - name: EGRESS_DESTINATION
              value: 203.0.113.25
            - name: EGRESS_ROUTER_MODE
              value: init
            securityContext:
              privileged: true
          containers:
          - name: egress-demo-wait
            image: registry.redhat.io/openshift3/ose-pod
          nodeSelector:
            site: springfield-1
    1
    Ensure replicas is set to 1, because only one pod can be using a given EGRESS_SOURCE value at any time. This means that only a single copy of the router will be running, on a node with the label site=springfield-1.
    2
    IP address from the physical network that the node is on and is reserved by the cluster administrator for use by this pod. Optionally, you can include the subnet length, the /24 suffix, so that a proper route to the local subnet can be set up. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
  2. Create the pod using the definition:

    $ oc create -f <replication_controller>.json
  3. To verify, check to see if the replication controller pod has been created:

    $ oc describe rc <replication_controller>

8.6.2. Using iptables Rules to Limit Access to External Resources

Some cluster administrators may want to perform actions on outgoing traffic that do not fit within the model of EgressNetworkPolicy or the egress router. In some cases, this can be done by creating iptables rules directly.

For example, you could create rules that log traffic to particular destinations, or to prevent more than a certain number of outgoing connections per second.

OpenShift Container Platform does not provide a way to add custom iptables rules automatically, but it does provide a place where such rules can be added manually by the administrator. Each node, on startup, will create an empty chain called OPENSHIFT-ADMIN-OUTPUT-RULES in the filter table (assuming that the chain does not already exist). Any rules added to that chain by an administrator will be applied to all traffic going from a pod to a destination outside the cluster (and not to any other traffic).

There are a few things to watch out for when using this functionality:

  1. It is up to you to ensure that rules get created on each node; OpenShift Container Platform does not provide any way to make that happen automatically.
  2. The rules are not applied to traffic that exits the cluster via an egress router, and they run after EgressNetworkPolicy rules are applied (and so will not see traffic that is denied by an EgressNetworkPolicy).
  3. The handling of connections from pods to nodes or pods to the master is complicated, because nodes have both "external" IP addresses and "internal" SDN IP addresses. Thus, some pod-to-node/master traffic may pass through this chain, but other pod-to-node/master traffic may bypass it.

8.7. Enabling Static IPs for External Project Traffic

As a cluster administrator, you can assign specific, static IP addresses to projects, so that traffic is externally easily recognizable. This is different from the default egress router, which is used to send traffic to specific destinations.

Recognizable IP traffic increases cluster security by ensuring the origin is visible. Once enabled, all outgoing external connections from the specified project will share the same, fixed source IP, meaning that any external resources can recognize the traffic.

Unlike the egress router, this is subject to EgressNetworkPolicy firewall rules.

Note

Assigning static IPs addresses for projects in your cluster requires the SDN to use either the ovs-networkpolicy or ovs-multitenant network plug-ins.

Note

If you use OpenShift SDN in multitenant mode, you cannot use egress IP addresses with any namespace that is joined to another namespace by the projects that are associated with them. For example, if project1 and project2 are joined by running the oc adm pod-network join-projects --to=project1 project2 command, neither project can use an egress IP address. For more information, see BZ#1645577.

To enable static source IPs:

  1. Update the NetNamespace with the desired IP:

    $ oc patch netnamespace <project_name> -p '{"egressIPs": ["<IP_address>"]}'

    For example, to assign the MyProject project to an IP address of 192.168.1.100:

    $ oc patch netnamespace MyProject -p '{"egressIPs": ["192.168.1.100"]}'

    The egressIPs field is an array. You can set egressIPs to two or more IP addresses on different nodes to provide high availability. If multiple egress IP addresses are set, pods use the first IP in the list for egress, but if the node hosting that IP address fails, pods switch to using the next IP in the list after a short delay.

  2. Manually assign the egress IP to the desired node hosts. Set the egressIPs field on the HostSubnet object on the node host. Include as many IPs as you want to assign to that node host:

    $ oc patch hostsubnet <node_name> -p \
      '{"egressIPs": ["<IP_address_1>", "<IP_address_2>"]}'

    For example, to say that node1 should have the egress IPs 192.168.1.100, 192.168.1.101, and 192.168.1.102:

    $ oc patch hostsubnet node1 -p \
      '{"egressIPs": ["192.168.1.100", "192.168.1.101", "192.168.1.102"]}'
    Important

    Egress IPs are implemented as additional IP addresses on the primary network interface, and must be in the same subnet as the node’s primary IP. Additionally, any external IPs should not be configured in any Linux network configuration files, such as ifcfg-eth0.

    Allowing additional IP addresses on the primary network interface might require extra configuration when using some cloud or VM solutions.

If the above is enabled for a project, all egress traffic from that project will be routed to the node hosting that egress IP, then connected (using NAT) to that IP address. If egressIPs is set on a NetNamespace, but there is no node hosting that egress IP, then egress traffic from the namespace will be dropped.

8.8. Enabling Automatic Egress IPs

Similar to Enabling Static IPs for External Project Traffic, as a cluster administrator, you can assign egress IP addresses to namespaces by setting the egressIPs parameter to the NetNamespace resource. You can associate only a single IP address with a project.

Note

If you use OpenShift SDN in multitenant mode, you cannot use egress IP addresses with any namespace that is joined to another namespace by the projects that are associated with them. For example, if project1 and project2 are joined by running the oc adm pod-network join-projects --to=project1 project2 command, neither project can use an egress IP address. For more information, see BZ#1645577.

With fully automatic egress IPs, you can set the egressCIDRs parameter of each node’s HostSubnet resource to indicate the range of egress IP addresses that can be hosted. Namespaces that have requested egress IP addresses are matched with nodes that are able to host those egress IP addresses, then the egress IP addresses are assigned to those nodes.

High availability is automatic. If a node hosting egress IP addresses goes down and there are nodes that are able to host those egress IP addresses, based on the egressCIDR values of the HostSubnet resources, then the egress IP addresses will move to a new node. When the original egress IP address node comes back online, the egress IP addresses automatically move to balance egress IP addresses across nodes.

Important

You cannot use manually assigned and automatically assigned egress IP addresses on the same nodes or with the same IP address ranges.

  1. Update the NetNamespace with the egress IP address:

     $ oc patch netnamespace <project_name> -p '{"egressIPs": ["<IP_address>"]}'

    You can specify only a single IP address for the egressIPs parameter. Using multiple IP addresses is not supported.

    For example, to assign project1 to an IP address of 192.168.1.100 and project2 to an IP address of 192.168.1.101:

    $ oc patch netnamespace project1 -p '{"egressIPs": ["192.168.1.100"]}'
    $ oc patch netnamespace project2 -p '{"egressIPs": ["192.168.1.101"]}''
  2. Indicate which nodes can host egress IP addresses by setting their egressCIDRs fields:

    $ oc patch hostsubnet <node_name> -p \
      '{"egressCIDRs": ["<IP_address_range_1>", "<IP_address_range_2>"]}'

    For example, to set node1 and node2 to host egress IP addresses in the range 192.168.1.0 to 192.168.1.255:

    $ oc patch hostsubnet node1 -p '{"egressCIDRs": ["192.168.1.0/24"]}'
    $ oc patch hostsubnet node2 -p '{"egressCIDRs": ["192.168.1.0/24"]}'
  3. OpenShift Container Platform automatically assigns specific egress IP addresses to available nodes, in a balanced way. In this case, it assigns the egress IP address 192.168.1.100 to node1 and the egress IP address 192.168.1.101 to node2 or vice versa.

8.9. Enabling Multicast

Important

At this time, multicast is best used for low bandwidth coordination or service discovery and not a high-bandwidth solution.

Multicast traffic between OpenShift Container Platform pods is disabled by default. If you are using the ovs-multitenant or ovs-networkpolicy plugin, you can enable multicast on a per-project basis by setting an annotation on the project’s corresponding netnamespace object:

$ oc annotate netnamespace <namespace> \
    netnamespace.network.openshift.io/multicast-enabled=true

Disable multicast by removing the annotation:

$ oc annotate netnamespace <namespace> \
    netnamespace.network.openshift.io/multicast-enabled-

When using the ovs-multitenant plugin:

  1. In an isolated project, multicast packets sent by a pod will be delivered to all other pods in the project.
  2. If you have joined networks together, you will need to enable multicast in each project’s netnamespace in order for it to take effect in any of the projects. Multicast packets sent by a pod in a joined network will be delivered to all pods in all of the joined-together networks.
  3. To enable multicast in the default project, you must also enable it in the kube-service-catalog project and all other projects that have been made global. Global projects are not "global" for purposes of multicast; multicast packets sent by a pod in a global project will only be delivered to pods in other global projects, not to all pods in all projects. Likewise, pods in global projects will only receive multicast packets sent from pods in other global projects, not from all pods in all projects.

When using the ovs-networkpolicy plugin:

  1. Multicast packets sent by a pod will be delivered to all other pods in the project, regardless of NetworkPolicy objects. (Pods may be able to communicate over multicast even when they can’t communicate over unicast.)
  2. Multicast packets sent by a pod in one project will never be delivered to pods in any other project, even if there are NetworkPolicy objects allowing communication between the projects.

8.10. Enabling NetworkPolicy

The ovs-subnet and ovs-multitenant plug-ins have their own legacy models of network isolation and do not support Kubernetes NetworkPolicy. However, NetworkPolicy support is available by using the ovs-networkpolicy plug-in.

Note

The Egress policy type, the ipBlock parameter, and the ability to combine the podSelector and namespaceSelector parameters are not available in OpenShift Container Platform.

Note

Do not apply NetworkPolicy features on default OpenShift Container Platform projects, because they can disrupt communication with the cluster.

Warning

NetworkPolicy rules do not apply to the host network namespace. Pods with host networking enabled are unaffected by NetworkPolicy rules.

In a cluster configured to use the ovs-networkpolicy plug-in, network isolation is controlled entirely by NetworkPolicy objects. By default, all pods in a project are accessible from other pods and network endpoints. To isolate one or more pods in a project, you can create NetworkPolicy objects in that project to indicate the allowed incoming connections. Project administrators can create and delete NetworkPolicy objects within their own project.

Pods that do not have NetworkPolicy objects pointing to them are fully accessible, whereas, pods that have one or more NetworkPolicy objects pointing to them are isolated. These isolated pods only accept connections that are accepted by at least one of their NetworkPolicy objects.

Following are a few sample NetworkPolicy object definitions supporting different scenarios:

  • Deny All Traffic

    To make a project "deny by default" add a NetworkPolicy object that matches all pods but accepts no traffic.

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: deny-by-default
    spec:
      podSelector:
      ingress: []
  • Only Accept connections from pods within project

    To make pods accept connections from other pods in the same project, but reject all other connections from pods in other projects:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-same-namespace
    spec:
      podSelector:
      ingress:
      - from:
        - podSelector: {}
  • Only allow HTTP and HTTPS traffic based on pod labels

    To enable only HTTP and HTTPS access to the pods with a specific label (role=frontend in following example), add a NetworkPolicy object similar to:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-http-and-https
    spec:
      podSelector:
        matchLabels:
          role: frontend
      ingress:
      - ports:
        - protocol: TCP
          port: 80
        - protocol: TCP
          port: 443

NetworkPolicy objects are additive, which means you can combine multiple NetworkPolicy objects together to satisfy complex network requirements.

For example, for the NetworkPolicy objects defined in previous samples, you can define both allow-same-namespace and allow-http-and-https policies within the same project. Thus allowing the pods with the label role=frontend, to accept any connection allowed by each policy. That is, connections on any port from pods in the same namespace, and connections on ports 80 and 443 from pods in any namespace.

8.10.1. Using NetworkPolicy Efficiently

NetworkPolicy objects allow you to isolate pods that are differentiated from one another by labels, within a namespace.

It is inefficient to apply NetworkPolicy objects to large numbers of individual pods in a single namespace. Pod labels do not exist at the IP level, so NetworkPolicy objects generate a separate OVS flow rule for every single possible link between every pod selected with podSelector.

For example, if the spec podSelector and the ingress podSelector within a NetworkPolicy object each match 200 pods, then 40000 (200*200) OVS flow rules are generated. This might slow down the machine.

To reduce the amount of OVS flow rules, use namespaces to contain groups of pods that need to be isolated.

NetworkPolicy objects that select a whole namespace, by using namespaceSelectors or empty podSelectors, only generate a single OVS flow rule that matches the VXLAN VNID of the namespace.

Keep the pods that do not need to be isolated in their original namespace, and move the pods that require isolation into one or more different namespaces.

Create additional targeted cross-namespace policies to allow the specific traffic that you do want to allow from the isolated pods.

8.10.2. NetworkPolicy and Routers

When using the ovs-multitenant plug-in, traffic from the routers is automatically allowed into all namespaces. This is because the routers are usually in the default namespace, and all namespaces allow connections from pods in that namespace. With the ovs-networkpolicy plug-in, this does not happen automatically. Therefore, if you have a policy that isolates a namespace by default, you need to take additional steps to allow routers to access it.

One option is to create a policy for each service, allowing access from all sources. for example,

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-to-database-service
spec:
  podSelector:
    matchLabels:
      role: database
  ingress:
  - ports:
    - protocol: TCP
      port: 5432

This allows routers to access the service, but will also allow pods in other users' namespaces to access it as well. This should not cause any issues, as those pods can normally access the service by using the public router.

Alternatively, you can create a policy allowing full access from the default namespace, as in the ovs-multitenant plug-in:

  1. Add a label to the default namespace.

    Important

    If you labeled the default project with the default label in a previous procedure, then skip this step. The cluster administrator role is required to add labels to namespaces.

    $ oc label namespace default name=default
  2. Create policies allowing connections from that namespace.

    Note

    Perform this step for each namespace you want to allow connections into. Users with the Project Administrator role can create policies.

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-from-default-namespace
    spec:
      podSelector:
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              name: default

8.10.3. Setting a Default NetworkPolicy for New Projects

The cluster administrators can modify the default project template to enable automatic creation of default NetworkPolicy objects (one or more), whenever a new project is created. To do this:

  1. Create a custom project template and configure the master to use it.
  2. Label the default project with the default label:

    Important

    If you labeled the default project with the default label in a previous procedure, then skip this step. The cluster administrator role is required to add labels to namespaces.

    $ oc label namespace default name=default
  3. Edit the template to include the desired NetworkPolicy objects:

    $ oc edit template project-request -n default
    Note

    To include NetworkPolicy objects into existing template, use the oc edit command. Currently, it is not possible to use oc patch to add objects to a Template resource.

    1. Add each default policy as an element in the objects array:

      objects:
      ...
      - apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        metadata:
          name: allow-from-same-namespace
        spec:
          podSelector:
          ingress:
          - from:
            - podSelector: {}
      - apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        metadata:
          name: allow-from-default-namespace
        spec:
          podSelector:
          ingress:
          - from:
            - namespaceSelector:
                matchLabels:
                  name: default
      ...

8.11. Enabling HTTP Strict Transport Security

HTTP Strict Transport Security (HSTS) policy is a security enhancement, which ensures that only HTTPS traffic is allowed on the host. Any HTTP requests are dropped by default. This is useful for ensuring secure interactions with websites, or to offer a secure application for the user’s benefit.

When HSTS is enabled, HSTS adds a Strict Transport Security header to HTTPS responses from the site. You can use the insecureEdgeTerminationPolicy value in a route to redirect to send HTTP to HTTPS. However, when HSTS is enabled, the client changes all requests from the HTTP URL to HTTPS before the request is sent, eliminating the need for a redirect. This is not required to be supported by the client, and can be disabled by setting max-age=0.

Important

HSTS works only with secure routes (either edge terminated or re-encrypt). The configuration is ineffective on HTTP or passthrough routes.

To enable HSTS to a route, add the haproxy.router.openshift.io/hsts_header value to the edge terminated or re-encrypt route:

apiVersion: v1
kind: Route
metadata:
  annotations:
    haproxy.router.openshift.io/hsts_header: max-age=31536000;includeSubDomains;preload
Important

Ensure there are no spaces and no other values in the parameters in the haproxy.router.openshift.io/hsts_header value. Only max-age is required.

The required max-age parameter indicates the length of time, in seconds, the HSTS policy is in effect for. The client updates max-age whenever a response with a HSTS header is received from the host. When max-age times out, the client discards the policy.

The optional includeSubDomains parameter tells the client that all subdomains of the host are to be treated the same as the host.

If max-age is greater than 0, the optional preload parameter allows external services to include this site in their HSTS preload lists. For example, sites such as Google can construct a list of sites that have preload set. Browsers can then use these lists to determine which sites to only talk to over HTTPS, even before they have interacted with the site. Without preload set, they need to have talked to the site over HTTPS to get the header.

8.12. Troubleshooting Throughput Issues

Sometimes applications deployed through OpenShift Container Platform can cause network throughput issues such as unusually high latency between specific services.

Use the following methods to analyze performance issues if pod logs do not reveal any cause of the problem:

  • Use a packet analyzer, such as ping or tcpdump to analyze traffic between a pod and its node.

    For example, run the tcpdump tool on each pod while reproducing the behavior that led to the issue. Review the captures on both sides to compare send and receive timestamps to analyze the latency of traffic to/from a pod. Latency can occur in OpenShift Container Platform if a node interface is overloaded with traffic from other pods, storage devices, or the data plane.

    $ tcpdump -s 0 -i any -w /tmp/dump.pcap host <podip 1> && host <podip 2> 1
    1
    podip is the IP address for the pod. Run the following command to get the IP address of the pods:
    # oc get pod <podname> -o wide

    tcpdump generates a file at /tmp/dump.pcap containing all traffic between these two pods. Ideally, run the analyzer shortly before the issue is reproduced and stop the analyzer shortly after the issue is finished reproducing to minimize the size of the file. You can also run a packet analyzer between the nodes (eliminating the SDN from the equation) with:

    # tcpdump -s 0 -i any -w /tmp/dump.pcap port 4789
  • Use a bandwidth measuring tool, such as iperf, to measure streaming throughput and UDP throughput. Run the tool from the pods first, then from the nodes to attempt to locate any bottlenecks. The iperf3 tool is included as part of RHEL 7.

For information on installing and using iperf3, see this Red Hat Solution.

Chapter 9. Configuring Service Accounts

9.1. Overview

When a person uses the OpenShift Container Platform CLI or web console, their API token authenticates them to the OpenShift Container Platform API. However, when a regular user’s credentials are not available, it is common for components to make API calls independently. For example:

  • Replication controllers make API calls to create or delete pods.
  • Applications inside containers can make API calls for discovery purposes.
  • External applications can make API calls for monitoring or integration purposes.

Service accounts provide a flexible way to control API access without sharing a regular user’s credentials.

9.2. User Names and Groups

Every service account has an associated user name that can be granted roles, just like a regular user. The user name is derived from its project and name:

system:serviceaccount:<project>:<name>

For example, to add the view role to the robot service account in the top-secret project:

$ oc policy add-role-to-user view system:serviceaccount:top-secret:robot
Important

If you want to grant access to a specific service account in a project, you can use the -z flag. From the project to which the service account belongs, use the -z flag and specify the <serviceaccount_name>. This is highly recommended, as it helps prevent typos and ensures that access is granted only to the specified service account. For example:

 $ oc policy add-role-to-user <role_name> -z <serviceaccount_name>

If not in the project, use the -n option to indicate the project namespace it applies to, as shown in the examples below.

Every service account is also a member of two groups:

system:serviceaccounts
Includes all service accounts in the system.
system:serviceaccounts:<project>
Includes all service accounts in the specified project.

For example, to allow all service accounts in all projects to view resources in the top-secret project:

$ oc policy add-role-to-group view system:serviceaccounts -n top-secret

To allow all service accounts in the managers project to edit resources in the top-secret project:

$ oc policy add-role-to-group edit system:serviceaccounts:managers -n top-secret

9.3. Managing Service Accounts

Service accounts are API objects that exist within each project. To manage service accounts, you can use the oc command with the sa or serviceaccount object type or use the web console.

To get a list of existing service accounts in the current project:

$ oc get sa
NAME       SECRETS   AGE
builder    2         2d
default    2         2d
deployer   2         2d

To create a new service account:

$ oc create sa robot
serviceaccount "robot" created

As soon as a service account is created, two secrets are automatically added to it:

  • an API token
  • credentials for the OpenShift Container Registry

These can be seen by describing the service account:

$ oc describe sa robot
Name:		robot
Namespace:	project1
Labels:		<none>
Annotations:	<none>

Image pull secrets:	robot-dockercfg-qzbhb

Mountable secrets: 	robot-token-f4khf
                   	robot-dockercfg-qzbhb

Tokens:            	robot-token-f4khf
                   	robot-token-z8h44

The system ensures that service accounts always have an API token and registry credentials.

The generated API token and registry credentials do not expire, but they can be revoked by deleting the secret. When the secret is deleted, a new one is automatically generated to take its place.

9.4. Enabling Service Account Authentication

Service accounts authenticate to the API using tokens signed by a private RSA key. The authentication layer verifies the signature using a matching public RSA key.

To enable service account token generation, update the serviceAccountConfig stanza in the /etc/origin/master/master-config.yml file on the master to specify a privateKeyFile (for signing), and a matching public key file in the publicKeyFiles list:

serviceAccountConfig:
  ...
  masterCA: ca.crt 1
  privateKeyFile: serviceaccount.private.key 2
  publicKeyFiles:
  - serviceaccount.public.key 3
  - ...
1
CA file used to validate the API server’s serving certificate.
2
Private RSA key file (for token signing).
3
Public RSA key files (for token verification). If private key files are provided, then the public key component is used. Multiple public key files can be specified, and a token will be accepted if it can be validated by one of the public keys. This allows rotation of the signing key, while still accepting tokens generated by the previous signer.

9.5. Managed Service Accounts

Service accounts are required in each project to run builds, deployments, and other pods. The managedNames setting in the /etc/origin/master/master-config.yml file on the master controls which service accounts are automatically created in every project:

serviceAccountConfig:
  ...
  managedNames: 1
  - builder 2
  - deployer 3
  - default 4
  - ...
1
List of service accounts to automatically create in every project.
2
A builder service account in each project is required by build pods, and is given the system:image-builder role, which allows pushing images to any image stream in the project using the internal container image registry.
3
A deployer service account in each project is required by deployment pods, and is given the system:deployer role, which allows viewing and modifying replication controllers and pods in the project.
4
A default service account is used by all other pods unless they specify a different service account.

All service accounts in a project are given the system:image-puller role, which allows pulling images from any image stream in the project using the internal container image registry.

9.6. Infrastructure Service Accounts

Several infrastructure controllers run using service account credentials. The following service accounts are created in the OpenShift Container Platform infrastructure project (openshift-infra) at server start, and given the following roles cluster-wide:

Service AccountDescription

replication-controller

Assigned the system:replication-controller role

deployment-controller

Assigned the system:deployment-controller role

build-controller

Assigned the system:build-controller role. Additionally, the build-controller service account is included in the privileged security context constraint in order to create privileged build pods.

To configure the project where those service accounts are created, set the openshiftInfrastructureNamespace field in the /etc/origin/master/master-config.yml file on the master:

policyConfig:
  ...
  openshiftInfrastructureNamespace: openshift-infra

9.7. Service Accounts and Secrets

Set the limitSecretReferences field in the /etc/origin/master/master-config.yml file on the master to true to require pod secret references to be whitelisted by their service accounts. Set its value to false to allow pods to reference any secret in the project.

serviceAccountConfig:
  ...
  limitSecretReferences: false

Chapter 10. Managing Role-based Access Control (RBAC)

10.1. Overview

You can use the CLI to view RBAC resources and the administrator CLI to manage the roles and bindings.

10.2. Viewing roles and bindings

Roles can be used to grant various levels of access both cluster-wide as well as at the project-scope. Users and groups can be associated with, or bound to, multiple roles at the same time. You can view details about the roles and their bindings using the oc describe command.

Users with the cluster-admindefault cluster role bound cluster-wide can perform any action on any resource. Users with the admin default cluster role bound locally can manage roles and bindings in that project.

Note

Review a full list of verbs in the Evaluating Authorization section.

10.2.1. Viewing cluster roles

To view the cluster roles and their associated rule sets:

$ oc describe clusterrole.rbac
Name:		admin
Labels:		<none>
Annotations:	openshift.io/description=A user that has edit rights within the project and can change the project's membership.
		rbac.authorization.kubernetes.io/autoupdate=true
PolicyRule:
  Resources							Non-Resource URLs	Resource Names	Verbs
  ---------							-----------------	--------------	-----
  appliedclusterresourcequotas					[]			[]		[get list watch]
  appliedclusterresourcequotas.quota.openshift.io		[]			[]		[get list watch]
  bindings							[]			[]		[get list watch]
  buildconfigs							[]			[]		[create delete deletecollection get list patch update watch]
  buildconfigs.build.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  buildconfigs/instantiate					[]			[]		[create]
  buildconfigs.build.openshift.io/instantiate			[]			[]		[create]
  buildconfigs/instantiatebinary				[]			[]		[create]
  buildconfigs.build.openshift.io/instantiatebinary		[]			[]		[create]
  buildconfigs/webhooks						[]			[]		[create delete deletecollection get list patch update watch]
  buildconfigs.build.openshift.io/webhooks			[]			[]		[create delete deletecollection get list patch update watch]
  buildlogs							[]			[]		[create delete deletecollection get list patch update watch]
  buildlogs.build.openshift.io					[]			[]		[create delete deletecollection get list patch update watch]
  builds							[]			[]		[create delete deletecollection get list patch update watch]
  builds.build.openshift.io					[]			[]		[create delete deletecollection get list patch update watch]
  builds/clone							[]			[]		[create]
  builds.build.openshift.io/clone				[]			[]		[create]
  builds/details						[]			[]		[update]
  builds.build.openshift.io/details				[]			[]		[update]
  builds/log							[]			[]		[get list watch]
  builds.build.openshift.io/log					[]			[]		[get list watch]
  configmaps							[]			[]		[create delete deletecollection get list patch update watch]
  cronjobs.batch						[]			[]		[create delete deletecollection get list patch update watch]
  daemonsets.extensions						[]			[]		[get list watch]
  deploymentconfigrollbacks					[]			[]		[create]
  deploymentconfigrollbacks.apps.openshift.io			[]			[]		[create]
  deploymentconfigs						[]			[]		[create delete deletecollection get list patch update watch]
  deploymentconfigs.apps.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  deploymentconfigs/instantiate					[]			[]		[create]
  deploymentconfigs.apps.openshift.io/instantiate		[]			[]		[create]
  deploymentconfigs/log						[]			[]		[get list watch]
  deploymentconfigs.apps.openshift.io/log			[]			[]		[get list watch]
  deploymentconfigs/rollback					[]			[]		[create]
  deploymentconfigs.apps.openshift.io/rollback			[]			[]		[create]
  deploymentconfigs/scale					[]			[]		[create delete deletecollection get list patch update watch]
  deploymentconfigs.apps.openshift.io/scale			[]			[]		[create delete deletecollection get list patch update watch]
  deploymentconfigs/status					[]			[]		[get list watch]
  deploymentconfigs.apps.openshift.io/status			[]			[]		[get list watch]
  deployments.apps						[]			[]		[create delete deletecollection get list patch update watch]
  deployments.extensions					[]			[]		[create delete deletecollection get list patch update watch]
  deployments.extensions/rollback				[]			[]		[create delete deletecollection get list patch update watch]
  deployments.apps/scale					[]			[]		[create delete deletecollection get list patch update watch]
  deployments.extensions/scale					[]			[]		[create delete deletecollection get list patch update watch]
  deployments.apps/status					[]			[]		[create delete deletecollection get list patch update watch]
  endpoints							[]			[]		[create delete deletecollection get list patch update watch]
  events							[]			[]		[get list watch]
  horizontalpodautoscalers.autoscaling				[]			[]		[create delete deletecollection get list patch update watch]
  horizontalpodautoscalers.extensions				[]			[]		[create delete deletecollection get list patch update watch]
  imagestreamimages						[]			[]		[create delete deletecollection get list patch update watch]
  imagestreamimages.image.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  imagestreamimports						[]			[]		[create]
  imagestreamimports.image.openshift.io				[]			[]		[create]
  imagestreammappings						[]			[]		[create delete deletecollection get list patch update watch]
  imagestreammappings.image.openshift.io			[]			[]		[create delete deletecollection get list patch update watch]
  imagestreams							[]			[]		[create delete deletecollection get list patch update watch]
  imagestreams.image.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  imagestreams/layers						[]			[]		[get update]
  imagestreams.image.openshift.io/layers			[]			[]		[get update]
  imagestreams/secrets						[]			[]		[create delete deletecollection get list patch update watch]
  imagestreams.image.openshift.io/secrets			[]			[]		[create delete deletecollection get list patch update watch]
  imagestreams/status						[]			[]		[get list watch]
  imagestreams.image.openshift.io/status			[]			[]		[get list watch]
  imagestreamtags						[]			[]		[create delete deletecollection get list patch update watch]
  imagestreamtags.image.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  jenkins.build.openshift.io					[]			[]		[admin edit view]
  jobs.batch							[]			[]		[create delete deletecollection get list patch update watch]
  limitranges							[]			[]		[get list watch]
  localresourceaccessreviews					[]			[]		[create]
  localresourceaccessreviews.authorization.openshift.io		[]			[]		[create]
  localsubjectaccessreviews					[]			[]		[create]
  localsubjectaccessreviews.authorization.k8s.io		[]			[]		[create]
  localsubjectaccessreviews.authorization.openshift.io		[]			[]		[create]
  namespaces							[]			[]		[get list watch]
  namespaces/status						[]			[]		[get list watch]
  networkpolicies.extensions					[]			[]		[create delete deletecollection get list patch update watch]
  persistentvolumeclaims					[]			[]		[create delete deletecollection get list patch update watch]
  pods								[]			[]		[create delete deletecollection get list patch update watch]
  pods/attach							[]			[]		[create delete deletecollection get list patch update watch]
  pods/exec							[]			[]		[create delete deletecollection get list patch update watch]
  pods/log							[]			[]		[get list watch]
  pods/portforward						[]			[]		[create delete deletecollection get list patch update watch]
  pods/proxy							[]			[]		[create delete deletecollection get list patch update watch]
  pods/status							[]			[]		[get list watch]
  podsecuritypolicyreviews					[]			[]		[create]
  podsecuritypolicyreviews.security.openshift.io		[]			[]		[create]
  podsecuritypolicyselfsubjectreviews				[]			[]		[create]
  podsecuritypolicyselfsubjectreviews.security.openshift.io	[]			[]		[create]
  podsecuritypolicysubjectreviews				[]			[]		[create]
  podsecuritypolicysubjectreviews.security.openshift.io		[]			[]		[create]
  processedtemplates						[]			[]		[create delete deletecollection get list patch update watch]
  processedtemplates.template.openshift.io			[]			[]		[create delete deletecollection get list patch update watch]
  projects							[]			[]		[delete get patch update]
  projects.project.openshift.io					[]			[]		[delete get patch update]
  replicasets.extensions					[]			[]		[create delete deletecollection get list patch update watch]
  replicasets.extensions/scale					[]			[]		[create delete deletecollection get list patch update watch]
  replicationcontrollers					[]			[]		[create delete deletecollection get list patch update watch]
  replicationcontrollers/scale					[]			[]		[create delete deletecollection get list patch update watch]
  replicationcontrollers.extensions/scale			[]			[]		[create delete deletecollection get list patch update watch]
  replicationcontrollers/status					[]			[]		[get list watch]
  resourceaccessreviews						[]			[]		[create]
  resourceaccessreviews.authorization.openshift.io		[]			[]		[create]
  resourcequotas						[]			[]		[get list watch]
  resourcequotas/status						[]			[]		[get list watch]
  resourcequotausages						[]			[]		[get list watch]
  rolebindingrestrictions					[]			[]		[get list watch]
  rolebindingrestrictions.authorization.openshift.io		[]			[]		[get list watch]
  rolebindings							[]			[]		[create delete deletecollection get list patch update watch]
  rolebindings.authorization.openshift.io			[]			[]		[create delete deletecollection get list patch update watch]
  rolebindings.rbac.authorization.k8s.io			[]			[]		[create delete deletecollection get list patch update watch]
  roles								[]			[]		[create delete deletecollection get list patch update watch]
  roles.authorization.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  roles.rbac.authorization.k8s.io				[]			[]		[create delete deletecollection get list patch update watch]
  routes							[]			[]		[create delete deletecollection get list patch update watch]
  routes.route.openshift.io					[]			[]		[create delete deletecollection get list patch update watch]
  routes/custom-host						[]			[]		[create]
  routes.route.openshift.io/custom-host				[]			[]		[create]
  routes/status							[]			[]		[get list watch update]
  routes.route.openshift.io/status				[]			[]		[get list watch update]
  scheduledjobs.batch						[]			[]		[create delete deletecollection get list patch update watch]
  secrets							[]			[]		[create delete deletecollection get list patch update watch]
  serviceaccounts						[]			[]		[create delete deletecollection get list patch update watch impersonate]
  services							[]			[]		[create delete deletecollection get list patch update watch]
  services/proxy						[]			[]		[create delete deletecollection get list patch update watch]
  statefulsets.apps						[]			[]		[create delete deletecollection get list patch update watch]
  subjectaccessreviews						[]			[]		[create]
  subjectaccessreviews.authorization.openshift.io		[]			[]		[create]
  subjectrulesreviews						[]			[]		[create]
  subjectrulesreviews.authorization.openshift.io		[]			[]		[create]
  templateconfigs						[]			[]		[create delete deletecollection get list patch update watch]
  templateconfigs.template.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]
  templateinstances						[]			[]		[create delete deletecollection get list patch update watch]
  templateinstances.template.openshift.io			[]			[]		[create delete deletecollection get list patch update watch]
  templates							[]			[]		[create delete deletecollection get list patch update watch]
  templates.template.openshift.io				[]			[]		[create delete deletecollection get list patch update watch]


Name:		basic-user
Labels:		<none>
Annotations:	openshift.io/description=A user that can get basic information about projects.
		rbac.authorization.kubernetes.io/autoupdate=true
PolicyRule:
  Resources						Non-Resource URLs	Resource Names	Verbs
  ---------						-----------------	--------------	-----
  clusterroles						[]			[]		[get list]
  clusterroles.authorization.openshift.io		[]			[]		[get list]
  clusterroles.rbac.authorization.k8s.io		[]			[]		[get list watch]
  projectrequests					[]			[]		[list]
  projectrequests.project.openshift.io			[]			[]		[list]
  projects						[]			[]		[list watch]
  projects.project.openshift.io				[]			[]		[list watch]
  selfsubjectaccessreviews.authorization.k8s.io		[]			[]		[create]
  selfsubjectrulesreviews				[]			[]		[create]
  selfsubjectrulesreviews.authorization.openshift.io	[]			[]		[create]
  storageclasses.storage.k8s.io				[]			[]		[get list]
  users							[]			[~]		[get]
  users.user.openshift.io				[]			[~]		[get]


Name:		cluster-admin
Labels:		<none>
Annotations:	authorization.openshift.io/system-only=true
		openshift.io/description=A super-user that can perform any action in the cluster. When granted to a user within a project, they have full control over quota and membership and can perform every action...
		rbac.authorization.kubernetes.io/autoupdate=true
PolicyRule:
  Resources	Non-Resource URLs	Resource Names	Verbs
  ---------	-----------------	--------------	-----
  		[*]			[]		[*]
  *.*		[]			[]		[*]


Name:		cluster-debugger
Labels:		<none>
Annotations:	authorization.openshift.io/system-only=true
		rbac.authorization.kubernetes.io/autoupdate=true
PolicyRule:
  Resources	Non-Resource URLs	Resource Names	Verbs
  ---------	-----------------	--------------	-----
  		[/debug/pprof]		[]		[get]
  		[/debug/pprof/*]	[]		[get]
  		[/metrics]		[]		[get]


Name:		cluster-reader
Labels:		<none>
Annotations:	authorization.openshift.io/system-only=true
		rbac.authorization.kubernetes.io/autoupdate=true
PolicyRule:
  Resources							Non-Resource URLs	Resource Names	Verbs
  ---------							-----------------	--------------	-----
  								[*]			[]		[get]
  apiservices.apiregistration.k8s.io				[]			[]		[get list watch]
  apiservices.apiregistration.k8s.io/status			[]			[]		[get list watch]
  appliedclusterresourcequotas					[]			[]		[get list watch]

...

10.2.2. Viewing cluster role bindings

To view the current set of cluster role bindings, which show the users and groups that are bound to various roles:

$ oc describe clusterrolebinding.rbac
Name:		admin
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	admin
Subjects:
  Kind			Name				Namespace
  ----			----				---------
  ServiceAccount	template-instance-controller	openshift-infra


Name:		basic-users
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	basic-user
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated


Name:		cluster-admin
Labels:		kubernetes.io/bootstrapping=rbac-defaults
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	cluster-admin
Subjects:
  Kind			Name		Namespace
  ----			----		---------
  ServiceAccount	pvinstaller	default
  Group			system:masters


Name:		cluster-admins
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	cluster-admin
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:cluster-admins
  User	system:admin


Name:		cluster-readers
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	cluster-reader
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:cluster-readers


Name:		cluster-status-binding
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	cluster-status
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated
  Group	system:unauthenticated


Name:		registry-registry-role
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:registry
Subjects:
  Kind			Name		Namespace
  ----			----		---------
  ServiceAccount	registry	default


Name:		router-router-role
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:router
Subjects:
  Kind			Name	Namespace
  ----			----	---------
  ServiceAccount	router	default


Name:		self-access-reviewers
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	self-access-reviewer
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated
  Group	system:unauthenticated


Name:		self-provisioners
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	self-provisioner
Subjects:
  Kind	Name				Namespace
  ----	----				---------
  Group	system:authenticated:oauth


Name:		system:basic-user
Labels:		kubernetes.io/bootstrapping=rbac-defaults
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	system:basic-user
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated
  Group	system:unauthenticated


Name:		system:build-strategy-docker-binding
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	system:build-strategy-docker
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated


Name:		system:build-strategy-jenkinspipeline-binding
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	system:build-strategy-jenkinspipeline
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated


Name:		system:build-strategy-source-binding
Labels:		<none>
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	system:build-strategy-source
Subjects:
  Kind	Name			Namespace
  ----	----			---------
  Group	system:authenticated


Name:		system:controller:attachdetach-controller
Labels:		kubernetes.io/bootstrapping=rbac-defaults
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	system:controller:attachdetach-controller
Subjects:
  Kind			Name			Namespace
  ----			----			---------
  ServiceAccount	attachdetach-controller	kube-system


Name:		system:controller:certificate-controller
Labels:		kubernetes.io/bootstrapping=rbac-defaults
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true
Role:
  Kind:	ClusterRole
  Name:	system:controller:certificate-controller
Subjects:
  Kind			Name			Namespace
  ----			----			---------
  ServiceAccount	certificate-controller	kube-system


Name:		system:controller:cronjob-controller
Labels:		kubernetes.io/bootstrapping=rbac-defaults
Annotations:	rbac.authorization.kubernetes.io/autoupdate=true

...

10.2.3. Viewing local roles and bindings

All of the default cluster roles can be bound locally to users or groups.

Custom local roles can be created.

The local role bindings are also viewable.

To view the current set of local role bindings, which show the users and groups that are bound to various roles:

$ oc describe rolebinding.rbac

By default, the current project is used when viewing local role bindings. Alternatively, a project can be specified with the -n flag. This is useful for viewing the local role bindings of another project, if the user already has the admindefault cluster role in it.

$ oc describe rolebinding.rbac -n joe-project
Name:		admin
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	admin
Subjects:
  Kind	Name	Namespace
  ----	----	---------
  User	joe


Name:		system:deployers
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:deployer
Subjects:
  Kind			Name		Namespace
  ----			----		---------
  ServiceAccount	deployer	joe-project


Name:		system:image-builders
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:image-builder
Subjects:
  Kind			Name	Namespace
  ----			----	---------
  ServiceAccount	builder	joe-project


Name:		system:image-pullers
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:image-puller
Subjects:
  Kind	Name					Namespace
  ----	----					---------
  Group	system:serviceaccounts:joe-project

10.3. Managing role bindings

Adding, or binding, a role to users or groups gives the user or group the relevant access granted by the role. You can add and remove roles to and from users and groups using oc adm policy commands.

When managing a user or group’s associated roles for local role bindings using the following operations, a project may be specified with the -n flag. If it is not specified, then the current project is used.

Table 10.1. Local role binding operations
CommandDescription

$ oc adm policy who-can <verb> <resource>

Indicates which users can perform an action on a resource.

$ oc adm policy add-role-to-user <role> <username>

Binds a given role to specified users in the current project.

$ oc adm policy remove-role-from-user <role> <username>

Removes a given role from specified users in the current project.

$ oc adm policy remove-user <username>

Removes specified users and all of their roles in the current project.

$ oc adm policy add-role-to-group <role> <groupname>

Binds a given role to specified groups in the current project.

$ oc adm policy remove-role-from-group <role> <groupname>

Removes a given role from specified groups in the current project.

$ oc adm policy remove-group <groupname>

Removes specified groups and all of their roles in the current project.

--rolebinding-name=

Can be used with oc adm policy commands to retain the rolebinding name assigned to a user or group.

You can also manage cluster role bindings using the following operations. The -n flag is not used for these operations because cluster role bindings use non-namespaced resources.

Table 10.2. Cluster role binding operations
CommandDescription

$ oc adm policy add-cluster-role-to-user <role> <username>

Binds a given role to specified users for all projects in the cluster.

$ oc adm policy remove-cluster-role-from-user <role> <username>

Removes a given role from specified users for all projects in the cluster.

$ oc adm policy add-cluster-role-to-group <role> <groupname>

Binds a given role to specified groups for all projects in the cluster.

$ oc adm policy remove-cluster-role-from-group <role> <groupname>

Removes a given role from specified groups for all projects in the cluster.

--rolebinding-name=

Can be used with oc adm policy commands to retain the rolebinding name assigned to a user or group.

For example, you can add the admin role to the alice user in joe-project by running:

$ oc adm policy add-role-to-user admin alice -n joe-project

You can then view the local role bindings and verify the addition in the output:

$ oc describe rolebinding.rbac -n joe-project
Name:		admin
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	admin
Subjects:
  Kind	Name	Namespace
  ----	----	---------
  User	joe


Name:		admin-0 1
Labels:		<none>
Annotations:	<none>

Role:
  Kind:  ClusterRole
  Name:  admin
Subjects:
  Kind  Name   Namespace
  ----  ----   ---------
  User  alice 2


Name:		system:deployers
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:deployer
Subjects:
  Kind			Name		Namespace
  ----			----		---------
  ServiceAccount	deployer	joe-project


Name:		system:image-builders
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:image-builder
Subjects:
  Kind			Name	Namespace
  ----			----	---------
  ServiceAccount	builder	joe-project


Name:		system:image-pullers
Labels:		<none>
Annotations:	<none>
Role:
  Kind:	ClusterRole
  Name:	system:image-puller
Subjects:
  Kind	Name					Namespace
  ----	----					---------
  Group	system:serviceaccounts:joe-project
1
A new role binding is created with a default name, incremented as necessary. To specify an existing role binding to modify, use the --rolebinding-name option when adding the role to the user.
2
The user alice is added.

10.4. Creating a local role

You can create a local role for a project and then bind it to a user.

  1. To create a local role for a project, run the following command:

    $ oc create role <name> --verb=<verb> --resource=<resource> -n <project>

    In this command, specify: * <name>, the local role’s name * <verb>, a comma-separated list of the verbs to apply to the role * <resource>, the resources that the role applies to * <project>, the project name

    + For example, to create a local role that allows a user to view pods in the blue project, run the following command:

    +

    $ oc create role podview --verb=get --resource=pod -n blue
  2. To bind the new role to a user, run the following command:
$ oc adm policy add-role-to-user podview user2 --role-namespace=blue -n blue

10.5. Creating a cluster role

To create a cluster role, run the following command:

$ oc create clusterrole <name> --verb=<verb> --resource=<resource>

In this command, specify:

  • <name>, the local role’s name
  • <verb>, a comma-separated list of the verbs to apply to the role
  • <resource>, the resources that the role applies to

For example, to create a cluster role that allows a user to view pods, run the following command:

$ oc create clusterrole podviewonly --verb=get --resource=pod

10.6. Cluster and local role bindings

A cluster role binding is a binding that exists at the cluster level. A role binding exists at the project level. The cluster role view must be bound to a user using a local role binding for that user to view the project. Create local roles only if a cluster role does not provide the set of permissions needed for a particular situation.

Some cluster role names are initially confusing. You can bind the cluster-admin to a user, using a local role binding, making it appear that this user has the privileges of a cluster administrator. This is not the case. Binding the cluster-admin to a certain project is more like a super administrator for that project, granting the permissions of the cluster role admin, plus a few additional permissions like the ability to edit rate limits. This can appear confusing especially via the web console UI, which does not list cluster role bindings that are bound to true cluster administrators. However, it does list local role bindings that you can use to locally bind cluster-admin.

10.7. Updating Policy Definitions

During a cluster upgrade, and on every restart of any master, the default cluster roles are automatically reconciled to restore any missing permissions.

If you customized default cluster roles and want to ensure a role reconciliation does not modify them:

  1. Protect each role from reconciliation:

    $ oc annotate clusterrole.rbac <role_name> --overwrite rbac.authorization.kubernetes.io/autoupdate=false
    Warning

    You must manually update the roles that contain this setting to include any new or required permissions after upgrading.

  2. Generate a default bootstrap policy template file:

    $ oc adm create-bootstrap-policy-file --filename=policy.json
    Note

    The contents of the file vary based on the OpenShift Container Platform version, but the file contains only the default policies.

  3. Update the policy.json file to include any cluster role customizations.
  4. Use the policy file to automatically reconcile roles and role bindings that are not reconcile protected:

    $ oc auth reconcile -f policy.json
  5. Reconcile security context constraints:

    # oc adm policy reconcile-sccs \
        --additive-only=true \
        --confirm

Chapter 11. Image Policy

11.1. Overview

You can control which images can be imported, tagged, and run in a cluster. There are two facilities for this purpose.

Allowed Registries for import is an image policy configuration that allows you to restrict image origins to particular set of external registries. This set of rules is applied to any image being imported or tagged into any image stream. Therefore any image referencing registry not matched by the rule set will be rejected.

ImagePolicy admission plug-in lets you specify which images are allowed to be run on your cluster. This is currently considered beta. It allows you to control:

  • Image sources: which registries can be used to pull images
  • Image resolution: force pods to run with immutable digests to ensure the image does not change due to a re-tag
  • Container image label restrictions: limits or requires labels on an image
  • Image annotation restrictions: limits or requires the annotations on an image in the integrated container image registry

11.2. Configuring Registries Allowed for Import

You can configure registries allowed for import in master-config.yaml under imagePolicyConfig:allowedRegistriesForImport section as demonstrated in the following example. If the setting is not present, all images are allowed, which is the default.

Example 11.1. Example Configuration of Registries Allowed for Import

imagePolicyConfig:
  allowedRegistriesForImport:
  -
    domainName: registry.redhat.io 1
  -
    domainName: *.mydomain.com
    insecure: true 2
  -
    domainName: local.registry.corp:5000 3
1
Allow any image from the specified secure registry.
2
Allow any image from any insecure registry hosted on any sub-domain of mydomain.com. The mydomain.com is not whitelisted.
3
Allow any image from the given registry with port specified.

Each rule is composed of the following attributes:

  • domainName: is a hostname optionally terminated by :<port> suffix where special wildcard characters (?, *) are recognized. The former matches a sequence of characters of any length while the later matches exactly one character. The wildcard characters can be present both before and after : separator. The wildcards apply only to the part before or after the separator regardless of separator’s presence.
  • insecure: is a boolean used to decide which ports are matched if the :<port> part is missing from domainName. If true, the domainName will match registries with :80 suffix or unspecified port as long as the insecure flag is used during import. If false, registries with :443 suffix or unspecified port will be matched.

If a rule should match both secure and insecure ports of the same domain, the rule must be listed twice (once with insecure=true and once with insecure=false.

Unqualified images references are qualified to docker.io before any rule evaluation. To whitelist them, use domainName: docker.io.

domainName: * rule matches any registry hostname, but port is still restricted to 443. To match arbitrary registry serving on arbitrary port, use domainName: *:*.

Based on the rules established in Example Configuration of Registries Allowed for Import:

  • oc tag --insecure reg.mydomain.com/app:v1 app:v1 is whitelisted by the handling of the mydomain.com rule
  • oc import-image --from reg1.mydomain.com:80/foo foo:latest will be also whitelisted
  • oc tag local.registry.corp/bar bar:latest will be rejected because the port does not match 5000 in the third rule

Rejected image imports will generate error messages similar to the following text:

The ImageStream "bar" is invalid:
* spec.tags[latest].from.name: Forbidden: registry "local.registry.corp" not allowed by whitelist: "local.registry.corp:5000", "*.mydomain.com:80", "registry.redhat.io:443"
* status.tags[latest].items[0].dockerImageReference: Forbidden: registry "local.registry.corp" not allowed by whitelist: "local.registry.corp:5000", "*.mydomain.com:80", "registry.redhat.io:443"

11.3. Configuring the ImagePolicy Admission Plug-in

To configure which images can run on your cluster, configure the ImagePolicy Admission plug-in in the master-config.yaml file. You can set one or more rules as required.

  • Reject images with a particular annotation:

    Use this rule to reject all images that have a specific annotation set on them. The following rejects all images using the images.openshift.io/deny-execution annotation:

    - name: execution-denied
      onResources:
      - resource: pods
      - resource: builds
      reject: true
      matchImageAnnotations:
      - key: images.openshift.io/deny-execution 1
        value: "true"
      skipOnResolutionFailure: true
    1
    If a particular image has been deemed harmful, administrators can set this annotation to flag those images.
  • Enable user to run images from Docker Hub:

    Use this rule to allow users to use images from Docker Hub:

    - name: allow-images-from-dockerhub
      onResources:
        - resource: pods
        - resource: builds
        matchRegistries:
        - docker.io

Following is an example configuration for setting multiple ImagePolicy addmission plugin rules in the master-config.yaml file:

Annotated Example File

admissionConfig:
  pluginConfig:
    openshift.io/ImagePolicy:
      configuration:
        kind: ImagePolicyConfig
        apiVersion: v1
        resolveImages: AttemptRewrite 1
        executionRules: 2
        - name: execution-denied
          # Reject all images that have the annotation images.openshift.io/deny-execution set to true.
          # This annotation may be set by infrastructure that wishes to flag particular images as dangerous
          onResources: 3
          - resource: pods
          - resource: builds
          reject: true 4
          matchImageAnnotations: 5
          - key: images.openshift.io/deny-execution
            value: "true"
          skipOnResolutionFailure: true 6
        - name: allow-images-from-internal-registry
          # allows images from the internal registry and tries to resolve them
          onResources:
          - resource: pods
          - resource: builds
          matchIntegratedRegistry: true
        - name: allow-images-from-dockerhub
          onResources:
          - resource: pods
          - resource: builds
          matchRegistries:
          - docker.io
        resolutionRules: 7
        - targetResource:
            resource: pods
          localNames: true
          policy: AttemptRewrite
        - targetResource: 8
            group: batch
            resource: jobs
          localNames: true 9
          policy: AttemptRewrite

1
Try to resolve images to an immutable image digest and update the image pull specification in the pod.
2
Array of rules to evaluate against incoming resources. If you only have reject: true rules, the default is allow all. If you have any accept rule, that is reject: false in any of the rules, the default behaviour of the ImagePolicy switches to deny-all.
3
Indicates which resources to enforce rules upon. If nothing is specified, the default is pods.
4
Indicates that if this rule matches, the pod should be rejected.
5
List of annotations to match on the image object’s metadata.
6
If you are not able to resolve the image, do not fail the pod.
7
Array of rules allowing use of image streams in Kubernetes resources. The default configuration allows pods, replicationcontrollers, replicasets, statefulsets, daemonsets, deployments, and jobs to use same-project image stream tag references in their image fields.
8
Identifies the group and resource to which this rule applies. If resource is *, this rule will apply to all resources in that group.
9
LocalNames will allow single segment names (for example, ruby:2.5) to be interpreted as namespace-local image stream tags, but only if the resource or target image stream has local name resolution enabled.
Note

If you normally rely on infrastructure images being pulled using a default registry prefix (such as docker.io or registry.redhat.io), those images will not match to any matchRegistries value since they will have no registry prefix. To ensure infrastructure images have a registry prefix that can match your image policy, set the imageConfig.format value in your master-config.yaml file.

11.4. Using an Admission Controller to Always Pull Images

After an image is pulled to a node, any Pod on that node from any user can use the image without an authorization check against the image. To ensure that Pods do not use images for which they do not have credentials, use the AlwaysPullImages admission controller.

This admission controller modifies every new Pod to force the image pull policy to Always, ensuring that private images can only be used by those who have the credentials to pull them, even if the Pod specification uses an image pull policy of Never.

To enable the AlwaysPullImages admission controller:

  1. Add the following to the master-config.yaml:

    admissionConfig:
      pluginConfig:
        AlwaysPullImages: 1
          configuration:
            kind: DefaultAdmissionConfig
            apiVersion: v1
            disable: false 2
    1
    Admission plug-in name.
    2
    Specify false to indicate that the plug-in should be enabled.
  2. Restart master services running in control plane static Pods using the master-restart command:

    $ master-restart api
    $ master-restart controllers

11.5. Testing the ImagePolicy Admission Plug-in

  1. Use the openshift/image-policy-check to test your configuration.

    For example, use the information above, then test like this:

    $ oc import-image openshift/image-policy-check:latest --confirm
  2. Create a pod using this YAML. The pod should be created.

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: test-pod
    spec:
      containers:
      - image: docker.io/openshift/image-policy-check:latest
        name: first
  3. Create another pod pointing to a different registry. The pod should be rejected.

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: test-pod
    spec:
      containers:
      - image: different-registry/openshift/image-policy-check:latest
        name: first
  4. Create a pod pointing to the internal registry using the imported image. The pod should be created and if you look at the image specification, you should see a digest in place of the tag.

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: test-pod
    spec:
      containers:
      - image: <internal registry IP>:5000/<namespace>/image-policy-check:latest
        name: first
  5. Create a pod pointing to the internal registry using the imported image. The pod should be created and if you look at the image specification, you should see the tag unmodified.

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: test-pod
    spec:
      containers:
      - image: <internal registry IP>:5000/<namespace>/image-policy-check:v1
        name: first
  6. Get the digest from oc get istag/image-policy-check:latest and use it for oc annotate images/<digest> images.openshift.io/deny-execution=true. For example:

    $ oc annotate images/sha256:09ce3d8b5b63595ffca6636c7daefb1a615a7c0e3f8ea68e5db044a9340d6ba8 images.openshift.io/deny-execution=true
  7. Create this pod again, and you should see the pod rejected:

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: test-pod
    spec:
      containers:
      - image: <internal registry IP>:5000/<namespace>/image-policy-check:latest
        name: first

Chapter 12. Image Signatures

12.1. Overview

Container image signing on Red Hat Enterprise Linux (RHEL) systems provides a means of:

  • Validating where a container image came from,
  • Checking that the image has not been tampered with, and
  • Setting policies to determine which validated images can be pulled to a host.

For a more complete understanding of the architecture of container image signing on RHEL systems, see the Container Image Signing Integration Guide.

The OpenShift Container Registry allows the ability to store signatures via REST API. The oc CLI can be used to verify image signatures, with their validated displayed in the web console or CLI.

12.2. Signing Images Using Atomic CLI

OpenShift Container Platform does not automate image signing. Signing requires a developer’s private GPG key, typically stored securely on a workstation. This document describes that workflow.

The atomic command line interface (CLI), version 1.12.5 or greater, provides commands for signing container images, which can be pushed to an OpenShift Container Registry. The atomic CLI is available on Red Hat-based distributions: RHEL, Centos, and Fedora. The atomic CLI is pre-installed on RHEL Atomic Host systems. For information on installing the atomic package on a RHEL host, see Enabling Image Signature Support.

Important

The atomic CLI uses the authenticated credentials from oc login. Be sure to use the same user on the same host for both atomic and oc commands. For example, if you execute atomic CLI as sudo, be sure to log in to OpenShift Container Platform using sudo oc login.

In order to attach the signature to the image, the user must have the image-signer cluster role. Cluster administrators can add this using:

$ oc adm policy add-cluster-role-to-user system:image-signer <user_name>

Images may be signed at push time:

$ atomic push [--sign-by <gpg_key_id>] --type atomic <image>

Signatures are stored in OpenShift Container Platform when the atomic transport type argument is specified. See Signature Transports for more information.

For full details on how to set up and perform image signing using the atomic CLI, see the RHEL Atomic Host Managing Containers: Signing Container Images documentation or the atomic push --help output for argument details.

A specific example workflow of working with the atomic CLI and an OpenShift Container Registry is documented in the Container Image Signing Integration Guide.

12.3. Verifying Image Signatures Using OpenShift CLI

You can verify the signatures of an image imported to an OpenShift Container Registry using the oc adm verify-image-signature command. This command verifies if the image identity contained in the image signature can be trusted by using the public GPG key to verify the signature itself then match the provided expected identity with the identity (the pull spec) of the given image.

By default, this command uses the public GPG keyring located in $GNUPGHOME/pubring.gpg, typically in path ~/.gnupg. By default, this command does not save the result of the verification back to the image object. To do so, you must specify the --save flag, as shown below.

Note

In order to verify the signature of an image, the user must have the image-auditor cluster role. Cluster administrators can add this using:

$ oc adm policy add-cluster-role-to-user system:image-auditor <user_name>
Important

Using the --save flag on already verified image together with invalid GPG key or invalid expected identity causes the saved verification status and all signatures to be removed, and the image will become unverified.

In order to avoid deleting all signatures by mistake, you can run the command without the --save flag first and check the logs for potential issues.

To verify an image signature use the following format:

$ oc adm verify-image-signature <image> --expected-identity=<pull_spec> [--save] [options]

The <pull_spec> can be found by describing the image stream. The <image> may be found by describing the image stream tag. See the following example command output.

Example Image Signature Verification

$ oc describe is nodejs -n openshift
Name:             nodejs
Namespace:        openshift
Created:          2 weeks ago
Labels:           <none>
Annotations:      openshift.io/display-name=Node.js
                  openshift.io/image.dockerRepositoryCheck=2017-07-05T18:24:01Z
Docker Pull Spec: 172.30.1.1:5000/openshift/nodejs
...

$ oc describe istag nodejs:latest -n openshift
Image Name:	sha256:2bba968aedb7dd2aafe5fa8c7453f5ac36a0b9639f1bf5b03f95de325238b288
...

$ oc adm verify-image-signature \
    sha256:2bba968aedb7dd2aafe5fa8c7453f5ac36a0b9639f1bf5b03f95de325238b288 \
    --expected-identity 172.30.1.1:5000/openshift/nodejs:latest \
    --public-key /etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release \
    --save

Note

If the oc adm verify-image-signature command returns an x509: certificate signed by unknown authority error, you might need to add the registry’s certificate authority (CA) to the list of CAs trusted on the system. You can do this by performing the following steps:

  1. Transfer the CA certificate from the cluster to the client machine.

    For example, to add the CA for docker-registry.default.svc, transfer the file located at /etc/docker/certs.d/docker-registry.default.svc\:5000/node-client-ca.crt.

  2. Copy the CA certificate to the /etc/pki/ca-trust/source/anchors/ directory. For example:

    # cp </path_to_file>/node-client-ca.crt \
        /etc/pki/ca-trust/source/anchors/
  3. Execute update-ca-trust to update the list of trusted CAs:

    # update-ca-trust

12.4. Accessing Image Signatures Using Registry API

The OpenShift Container Registry provides an extensions endpoint that allows you to write and read image signatures. The image signatures are stored in the OpenShift Container Platform key-value store via the container image registry API.

Note

This endpoint is experimental and not supported by the upstream container image registry project. See the upstream API documentation for general information about the container image registry API.

12.4.1. Writing Image Signatures via API

In order to add a new signature to the image, you can use the HTTP PUT method to send a JSON payload to the extensions endpoint:

PUT /extensions/v2/<namespace>/<name>/signatures/<digest>
$ curl -X PUT --data @signature.json http://<user>:<token>@<registry_endpoint>:5000/extensions/v2/<namespace>/<name>/signatures/sha256:<digest>

The JSON payload with the signature content should have the following structure:

{
  "version": 2,
  "type":    "atomic",
  "name":    "sha256:4028782c08eae4a8c9a28bf661c0a8d1c2fc8e19dbaae2b018b21011197e1484@cddeb7006d914716e2728000746a0b23",
  "content": "<cryptographic_signature>"
}

The name field contains the name of the image signature, which must be unique and in the format <digest>@<name>. The <digest> represents an image name and the <name> is the name of the signature. The signature name must be 32 characters long. The <cryptographic_signature> must follow the specification documented in the containers/image library.

12.4.2. Reading Image Signatures via API

Assuming a signed image has already been pushed into the OpenShift Container Registry, you can read the signatures using the following command:

GET /extensions/v2/<namespace>/<name>/signatures/<digest>
$ curl http://<user>:<token>@<registry_endpoint>:5000/extensions/v2/<namespace>/<name>/signatures/sha256:<digest>

The <namespace> represents the OpenShift Container Platform project name or registry repository name and the <name> refers to the name of the image repository. The digest represents the SHA-256 checksum of the image.

If the given image contains the signature data, the output of the command above should produce following JSON response:

{
  "signatures": [
  {
    "version": 2,
    "type":    "atomic",
    "name":    "sha256:4028782c08eae4a8c9a28bf661c0a8d1c2fc8e19dbaae2b018b21011197e1484@cddeb7006d914716e2728000746a0b23",
    "content": "<cryptographic_signature>"
  }
  ]
}

The name field contains the name of the image signature, which must be unique and in the format <digest>@<name>. The <digest> represents an image name and the <name> is the name of the signature. The signature name must be 32 characters long. The <cryptographic_signature> must follow the specification documented in the containers/image library.

12.4.3. Importing Image Signatures Automatically from Signature Stores

OpenShift Container Platform can automatically import image signatures if a signature store is configured on all OpenShift Container Platform master nodes through the registries configuration directory.

The registries configuration directory contains the configuration for various registries (servers storing remote container images) and for the content stored in them. The single directory ensures that the configuration does not have to be provided in command-line options for each command, so that it can be shared by all the users of the containers/image.

The default registries configuration directory is located in the /etc/containers/registries.d/default.yaml file.

A sample configuration that will cause image signatures to be imported automatically for all Red Hat images:

docker:
  registry.redhat.io:
    sigstore: https://registry.redhat.io/containers/sigstore 1
1
Defines the URL of a signature store. This URL is used for reading existing signatures.
Note

Signatures imported automatically by OpenShift Container Platform will be unverified by default and will have to be verified by image administrators.

For more details about the registries configuration directory, see Registries Configuration Directory.

Chapter 13. Scoped Tokens

13.1. Overview

A user may want to give another entity the power to act as they have, but only in a limited way. For example, a project administrator may want to delegate the power to create pods. One way to do this is to create a scoped token.

A scoped token is a token that identifies as a given user, but is limited to certain actions by its scope. Right now, only a cluster-admin can create scoped tokens.

13.2. Evaluation

Scopes are evaluated by converting the set of scopes for a token into a set of PolicyRules. Then, the request is matched against those rules. The request attributes must match at least one of the scope rules to be passed to the "normal" authorizer for further authorization checks.

13.3. User Scopes

User scopes are focused on getting information about a given user. They are intent-based, so the rules are automatically created for you:

  • user:full - Allows full read/write access to the API with all of the user’s permissions.
  • user:info - Allows read-only access to information about the user: name, groups, and so on.
  • user:check-access - Allows access to self-localsubjectaccessreviews and self-subjectaccessreviews. These are the variables where you pass an empty user and groups in your request object.
  • user:list-projects - Allows read-only access to list the projects the user has access to.

13.4. Role Scope

The role scope allows you to have the same level of access as a given role filtered by namespace.

  • role:<cluster-role name>:<namespace or * for all> - Limits the scope to the rules specified by the cluster-role, but only in the specified namespace .

    Note

    Caveat: This prevents escalating access. Even if the role allows access to resources like secrets, rolebindings, and roles, this scope will deny access to those resources. This helps prevent unexpected escalations. Many people do not think of a role like edit as being an escalating role, but with access to a secret it is.

  • role:<cluster-role name>:<namespace or * for all>:! - This is similar to the example above, except that including the bang causes this scope to allow escalating access.

Chapter 14. Monitoring Images

14.1. Overview

You can monitor images and nodes in your instance using the CLI.

14.2. Viewing Images Statistics

You can display usage statistics about all of the images that OpenShift Container Platform manages. In other words, all the images pushed to the internal registry either directly or through a build.

To view the usage statistics:

$ oc adm top images
NAME                 IMAGESTREAMTAG            PARENTS                   USAGE                         METADATA    STORAGE
sha256:80c985739a78b openshift/python (3.5)                                                            yes         303.12MiB
sha256:64461b5111fc7 openshift/ruby (2.2)                                                              yes         234.33MiB
sha256:0e19a0290ddc1 test/ruby-ex (latest)     sha256:64461b5111fc71ec   Deployment: ruby-ex-1/test    yes         150.65MiB
sha256:a968c61adad58 test/django-ex (latest)   sha256:80c985739a78b760   Deployment: django-ex-1/test  yes         186.07MiB

The command displays the following information:

  • Image ID
  • Project, name, and tag of the accompanying ImageStreamTag
  • Potential parents of the image, listed by their IDs
  • Information about where the image is used
  • Flag informing whether the image contains proper Docker metadata information
  • Size of the image

14.3. Viewing ImageStreams Statistics

You can display usage statistics about ImageStreams.

To view the usage statistics:

$ oc adm top imagestreams
NAME                STORAGE     IMAGES  LAYERS
openshift/python    1.21GiB     4       36
openshift/ruby      717.76MiB   3       27
test/ruby-ex        150.65MiB   1       10
test/django-ex      186.07MiB   1       10

The command displays the following information:

  • Project and name of the ImageStream
  • Size of the entire ImageStream stored in the internal Red Hat Container Registry
  • Number of images this particular ImageStream is pointing to
  • Number of layers ImageStream consists of

14.4. Pruning Images

The information returned from the previous commands is helpful when performing image pruning.

Chapter 15. Managing Security Context Constraints

15.1. Overview

Security context constraints allow administrators to control permissions for pods. To learn more about this API type, see the security context constraints (SCCs) architecture documentation. You can manage SCCs in your instance as normal API objects using the CLI.

Note

You must have cluster-admin privileges to manage SCCs.

Important

Do not modify the default SCCs. Customizing the default SCCs can lead to issues when upgrading. Instead, create new SCCs.

15.2. Listing Security Context Constraints

To get a current list of SCCs:

$ oc get scc

NAME               PRIV      CAPS      SELINUX     RUNASUSER          FSGROUP     SUPGROUP    PRIORITY   READONLYROOTFS   VOLUMES
anyuid             false     []        MustRunAs   RunAsAny           RunAsAny    RunAsAny    10         false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]
hostaccess         false     []        MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath persistentVolumeClaim secret]
hostmount-anyuid   false     []        MustRunAs   RunAsAny           RunAsAny    RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath nfs persistentVolumeClaim secret]
hostnetwork        false     []        MustRunAs   MustRunAsRange     MustRunAs   MustRunAs   <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]
nonroot            false     []        MustRunAs   MustRunAsNonRoot   RunAsAny    RunAsAny    <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]
privileged         true      [*]       RunAsAny    RunAsAny           RunAsAny    RunAsAny    <none>     false            [*]
restricted         false     []        MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]

15.3. Examining a Security Context Constraints Object

You can view information about a particular SCC, including which users, service accounts, and groups the SCC is applied to.

For example, to examine the restricted SCC:

$ oc describe scc restricted
Name:					restricted
Priority:				<none>
Access:
  Users:				<none> 1
  Groups:				system:authenticated 2
Settings:
  Allow Privileged:			false
  Default Add Capabilities:		<none>
  Required Drop Capabilities:		KILL,MKNOD,SYS_CHROOT,SETUID,SETGID
  Allowed Capabilities:			<none>
  Allowed Seccomp Profiles:		<none>
  Allowed Volume Types:			configMap,downwardAPI,emptyDir,persistentVolumeClaim,projected,secret
  Allow Host Network:			false
  Allow Host Ports:			false
  Allow Host PID:			false
  Allow Host IPC:			false
  Read Only Root Filesystem:		false
  Run As User Strategy: MustRunAsRange
    UID:				<none>
    UID Range Min:			<none>
    UID Range Max:			<none>
  SELinux Context Strategy: MustRunAs
    User:				<none>
    Role:				<none>
    Type:				<none>
    Level:				<none>
  FSGroup Strategy: MustRunAs
    Ranges:				<none>
  Supplemental Groups Strategy: RunAsAny
    Ranges:				<none>
1
Lists which users and service accounts the SCC is applied to.
2
Lists which groups the SCC is applied to.
Note

In order to preserve customized SCCs during upgrades, do not edit settings on the default SCCs other than priority, users, groups, labels, and annotations.

15.4. Creating New Security Context Constraints

To create a new SCC:

  1. Define the SCC in a JSON or YAML file:

    Security Context Constraint Object Definition

    kind: SecurityContextConstraints
    apiVersion: v1
    metadata:
      name: scc-admin
    allowPrivilegedContainer: true
    runAsUser:
      type: RunAsAny
    seLinuxContext:
      type: RunAsAny
    fsGroup:
      type: RunAsAny
    supplementalGroups:
      type: RunAsAny
    users:
    - my-admin-user
    groups:
    - my-admin-group

    Optionally, you can add drop capabilities to an SCC by setting the requiredDropCapabilities field with the desired values. Any specified capabilities will be dropped from the container. For example, to create an SCC with the KILL, MKNOD, and SYS_CHROOT required drop capabilities, add the following to the SCC object:

    requiredDropCapabilities:
    - KILL
    - MKNOD
    - SYS_CHROOT

    You can see the list of possible values in the Docker documentation.

    Tip

    Because capabilities are passed to the Docker, you can use a special ALL value to drop all possible capabilities.

  2. Then, run oc create passing the file to create it:

    $ oc create -f scc_admin.yaml
    securitycontextconstraints "scc-admin" created
  3. Verify that the SCC was created:

    $ oc get scc scc-admin
    NAME        PRIV      CAPS      SELINUX    RUNASUSER   FSGROUP    SUPGROUP   PRIORITY   READONLYROOTFS   VOLUMES
    scc-admin   true      []        RunAsAny   RunAsAny    RunAsAny   RunAsAny   <none>     false            [awsElasticBlockStore azureDisk azureFile cephFS cinder configMap downwardAPI emptyDir fc flexVolume flocker gcePersistentDisk glusterfs iscsi nfs persistentVolumeClaim photonPersistentDisk quobyte rbd secret vsphere]

15.5. Deleting Security Context Constraints

To delete an SCC:

$ oc delete scc <scc_name>
Note

If you delete a default SCC, it will be regenerated upon restart.

15.6. Updating Security Context Constraints

To update an existing SCC:

$ oc edit scc <scc_name>
Note

In order to preserve customized SCCs during upgrades, do not edit settings on the default SCCs other than priority, users, and groups.

15.6.1. Example Security Context Constraints Settings

Without Explicit runAsUser Setting

apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext: 1
  containers:
  - name: sec-ctx-demo
    image: gcr.io/google-samples/node-hello:1.0

1
When a container or pod does not request a user ID under which it should be run, the effective UID depends on the SCC that emits this pod. Because restricted SCC is granted to all authenticated users by default, it will be available to all users and service accounts and used in most cases. The restricted SCC uses MustRunAsRange strategy for constraining and defaulting the possible values of the securityContext.runAsUser field. The admission plug-in will look for the openshift.io/sa.scc.uid-range annotation on the current project to populate range fields, as it does not provide this range. In the end, a container will have runAsUser equal to the first value of the range that is hard to predict because every project has different ranges. See Understanding Pre-allocated Values and Security Context Constraints for more information.

With Explicit runAsUser Setting

apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext:
    runAsUser: 1000 1
  containers:
    - name: sec-ctx-demo
      image: gcr.io/google-samples/node-hello:1.0

1
A container or pod that requests a specific user ID will be accepted by OpenShift Container Platform only when a service account or a user is granted access to a SCC that allows such a user ID. The SCC can allow arbitrary IDs, an ID that falls into a range, or the exact user ID specific to the request.

This works with SELinux, fsGroup, and Supplemental Groups. See Volume Security for more information.

15.7. Updating the Default Security Context Constraints

Default SCCs will be created when the master is started if they are missing. To reset SCCs to defaults, or update existing SCCs to new default definitions after an upgrade you may:

  1. Delete any SCC you would like to be reset and let it be recreated by restarting the master
  2. Use the oc adm policy reconcile-sccs command

The oc adm policy reconcile-sccs command will set all SCC policies to the default values but retain any additional users, groups, labels, and annotations as well as priorities you may have already set. To view which SCCs will be changed you may run the command with no options or by specifying your preferred output with the -o <format> option.

After reviewing it is recommended that you back up your existing SCCs and then use the --confirm option to persist the data.

Note

If you would like to reset priorities and grants, use the --additive-only=false option.

Note

If you have customized settings other than priority, users, groups, labels, or annotations in an SCC, you will lose those settings when you reconcile.

15.8. How Do I?

The following describe common scenarios and procedures using SCCs.

15.8.1. Grant Access to the Privileged SCC

In some cases, an administrator might want to allow users or groups outside the administrator group access to create more privileged pods. To do so, you can:

  1. Determine the user or group you would like to have access to the SCC.

    Warning

    Granting access to a user only works when the user directly creates a pod. For pods created on behalf of a user, in most cases by the system itself, access should be given to a service account under which related controller is operated upon. Examples of resources that create pods on behalf of a user are Deployments, StatefulSets, DaemonSets, etc.

  2. Run:

    $ oc adm policy add-scc-to-user <scc_name> <user_name>
    $ oc adm policy add-scc-to-group <scc_name> <group_name>

    For example, to allow the e2e-user access to the privileged SCC, run:

    $ oc adm policy add-scc-to-user privileged e2e-user
  3. Modify SecurityContext of a container to request a privileged mode.

15.8.2. Grant a Service Account Access to the Privileged SCC

First, create a service account. For example, to create service account mysvcacct in project myproject:

$ oc create serviceaccount mysvcacct -n myproject

Then, add the service account to the privileged SCC.

$ oc adm policy add-scc-to-user privileged system:serviceaccount:myproject:mysvcacct

Then, ensure that the resource is being created on behalf of the service account. To do so, set the spec.serviceAccountName field to a service account name. Leaving the service account name blank will result in the default service account being used.

Then, ensure that at least one of the pod’s containers is requesting a privileged mode in the security context.

15.8.3. Enable Images to Run with USER in the Dockerfile

To relax the security in your cluster so that images are not forced to run as a pre-allocated UID, without granting everyone access to the privileged SCC:

  1. Grant all authenticated users access to the anyuid SCC:

    $ oc adm policy add-scc-to-group anyuid system:authenticated
Warning

This allows images to run as the root UID if no USER is specified in the Dockerfile.

15.8.4. Enable Container Images that Require Root

Some container images (examples: postgres and redis) require root access and have certain expectations about how volumes are owned. For these images, add the service account to the anyuid SCC.

$ oc adm policy add-scc-to-user anyuid system:serviceaccount:myproject:mysvcacct

15.8.5. Use --mount-host on the Registry

It is recommended that persistent storage using PersistentVolume and PersistentVolumeClaim objects be used for registry deployments. If you are testing and would like to instead use the oc adm registry command with the --mount-host option, you must first create a new service account for the registry and add it to the privileged SCC. See the Administrator Guide for full instructions.

15.8.6. Provide Additional Capabilities

In some cases, an image may require capabilities that Docker does not provide out of the box. You can provide the ability to request additional capabilities in the pod specification which will be validated against an SCC.

Important

This allows images to run with elevated capabilities and should be used only if necessary. You should not edit the default restricted SCC to enable additional capabilities.

When used in conjunction with a non-root user, you must also ensure that the file that requires the additional capability is granted the capabilities using the setcap command. For example, in the Dockerfile of the image:

setcap cap_net_raw,cap_net_admin+p /usr/bin/ping

Further, if a capability is provided by default in Docker, you do not need to modify the pod specification to request it. For example, NET_RAW is provided by default and capabilities should already be set on ping, therefore no special steps should be required to run ping.

To provide additional capabilities:

  1. Create a new SCC
  2. Add the allowed capability using the allowedCapabilities field.
  3. When creating the pod, request the capability in the securityContext.capabilities.add field.

15.8.7. Modify Cluster Default Behavior

When you grant access to the anyuid SCC for everyone, your cluster:

  • Does not pre-allocate UIDs
  • Allows containers to run as any user
  • Prevents privileged containers
 $ oc adm policy add-scc-to-group anyuid system:authenticated

To modify your cluster so that it does not pre-allocate UIDs and does not allow containers to run as root, grant access to the nonroot SCC for everyone:

 $ oc adm policy add-scc-to-group nonroot system:authenticated
Warning

Be very careful with any modifications that have a cluster-wide impact. When you grant an SCC to all authenticated users, as in the previous example, or modify an SCC that applies to all users, such as the restricted SCC, it also affects Kubernetes and OpenShift Container Platform components, including the web console and integrated container image registry. Changes made with these SCCs can cause these components to stop functioning.

Instead, create a custom SCC and target it to only specific users or groups. This way potential issues are confined to the affected users or groups and do not impact critical cluster components.

15.8.8. Use the hostPath Volume Plug-in

To relax the security in your cluster so that pods are allowed to use the hostPath volume plug-in without granting everyone access to more privileged SCCs such as privileged, hostaccess, or hostmount-anyuid, perform the following actions:

  1. Create a new SCC named hostpath
  2. Set the allowHostDirVolumePlugin parameter to true for the new SCC:

    $ oc patch scc hostpath -p '{"allowHostDirVolumePlugin": true}'
  3. Grant access to this SCC to all users:

    $ oc adm policy add-scc-to-group hostpath system:authenticated

Now, all the pods that request hostPath volumes are admitted by the hostpath SCC.

15.8.9. Ensure That Admission Attempts to Use a Specific SCC First

You may control the sort ordering of SCCs in admission by setting the Priority field of the SCCs. See the SCC Prioritization section for more information on sorting.

15.8.10. Add an SCC to a User, Group, or Project

Before adding an SCC to a user or group, you can first use the scc-review option to check if the user or group can create a pod. See the Authorization topic for more information.

SCCs are not granted directly to a project. Instead, you add a service account to an SCC and either specify the service account name on your pod or, when unspecified, run as the default service account.

To add an SCC to a user:

$ oc adm policy add-scc-to-user <scc_name> <user_name>

To add an SCC to a service account:

$ oc adm policy add-scc-to-user <scc_name> \
    system:serviceaccount:<serviceaccount_namespace>:<serviceaccount_name>

If you are currently in the project to which the service account belongs, you can use the -z flag and just specify the <serviceaccount_name>.

$ oc adm policy add-scc-to-user <scc_name> -z <serviceaccount_name>
Important

Usage of the -z flag as described above is highly recommended, as it helps prevent typos and ensures that access is granted only to the specified service account. If not in the project, use the -n option to indicate the project namespace it applies to.

To add an SCC to a group:

$ oc adm policy add-scc-to-group <scc_name> <group_name>

To add an SCC to all service accounts in a namespace:

$ oc adm policy add-scc-to-group <scc_name> \
    system:serviceaccounts:<serviceaccount_namespace>

Chapter 16. Scheduling

16.1. Overview

16.1.1. Overview

Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.

The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.

16.1.2. Default scheduling

OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customizable tools to determine the best fit for a pod.

For information on how the default scheduler determines pod placement and available customizable parameters, see Default Scheduling.

16.1.3. Advanced scheduling

In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required to (or has a preference to) run on a particular node or alongside a specific pod. Advanced scheduling also allows you to prevent pods from being placed on a node or with another pod.

For information about advanced scheduling, see Advanced Scheduling.

16.1.4. Custom scheduling

OpenShift Container Platform also allows you to use your own or third-party schedulers by editing the pod specification.

For more information, see Custom Schedulers.

16.2. Default Scheduling

16.2.1. Overview

The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.

16.2.2. Generic Scheduler

The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:

16.2.3. Filter the Nodes

The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.

16.2.3.1. Prioritize the Filtered List of Nodes

This is achieved by passing each node through a series of priority functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.

16.2.3.2. Select the Best Fit Node

The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.

16.2.4. Scheduler Policy

The selection of the predicate and priorities defines the policy for the scheduler.

The scheduler configuration file is a JSON file that specifies the predicates and priorities the scheduler will consider.

In the absence of the scheduler policy file, the default configuration file, /etc/origin/master/scheduler.json, gets applied.

Important

The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the scheduler configuration file.

Default scheduler configuration file

{
    "apiVersion": "v1",
    "kind": "Policy",
    "predicates": [
        {
            "name": "NoVolumeZoneConflict"
        },
        {
            "name": "MaxEBSVolumeCount"
        },
        {
            "name": "MaxGCEPDVolumeCount"
        },
        {
            "name": "MaxAzureDiskVolumeCount"
        },
        {
            "name": "MatchInterPodAffinity"
        },
        {
            "name": "NoDiskConflict"
        },
        {
            "name": "GeneralPredicates"
        },
        {
            "name": "PodToleratesNodeTaints"
        },
        {
            "argument": {
                "serviceAffinity": {
                    "labels": [
                        "region"
                    ]
                }
            },
            "name": "Region"

         }
    ],
    "priorities": [
        {
            "name": "SelectorSpreadPriority",
            "weight": 1
        },
        {
            "name": "InterPodAffinityPriority",
            "weight": 1
        },
        {
            "name": "LeastRequestedPriority",
            "weight": 1
        },
        {
            "name": "BalancedResourceAllocation",
            "weight": 1
        },
        {
            "name": "NodePreferAvoidPodsPriority",
            "weight": 10000
        },
        {
            "name": "NodeAffinityPriority",
            "weight": 1
        },
        {
            "name": "TaintTolerationPriority",
            "weight": 1
        },
        {
            "argument": {
                "serviceAntiAffinity": {
                    "label": "zone"
                }
            },
            "name": "Zone",
            "weight": 2
        }
    ]
}

16.2.4.1. Modifying Scheduler Policy

The scheduler policy is defined in a file on the master, named /etc/origin/master/scheduler.json by default, unless overridden by the kubernetesMasterConfig.schedulerConfigFile field in the master configuration file.

Sample modified scheduler configuration file

kind: "Policy"
version: "v1"
"predicates": [
        {
            "name": "PodFitsResources"
        },
        {
            "name": "NoDiskConflict"
        },
        {
            "name": "MatchNodeSelector"
        },
        {
            "name": "HostName"
        },
        {
            "argument": {
                "serviceAffinity": {
                    "labels": [
                        "region"
                    ]
                }
            },
            "name": "Region"
        }
    ],
    "priorities": [
        {
            "name": "LeastRequestedPriority",
            "weight": 1
        },
        {
            "name": "BalancedResourceAllocation",
            "weight": 1
        },
        {
            "name": "ServiceSpreadingPriority",
            "weight": 1
        },
        {
            "argument": {
                "serviceAntiAffinity": {
                    "label": "zone"
                }
            },
            "name": "Zone",
            "weight": 2
        }
    ]

To modify the scheduler policy:

  1. Edit the scheduler configuration file to configure the desired default predicates and priorities. You can create a custom configuration, or use and modify one of the sample policy configurations.
  2. Add any configurable predicates and configurable priorities you require.
  3. Restart the OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers

16.2.5. Available Predicates

Predicates are rules that filter out unqualified nodes.

There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.

16.2.5.1. Static Predicates

These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.

16.2.5.1.1. Default Predicates

The default scheduler policy includes the following predicates:

NoVolumeZoneConflict checks that the volumes a pod requests are available in the zone.

{"name" : "NoVolumeZoneConflict"}

MaxEBSVolumeCount checks the maximum number of volumes that can be attached to an AWS instance.

{"name" : "MaxEBSVolumeCount"}

MaxGCEPDVolumeCount checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).

{"name" : "MaxGCEPDVolumeCount"}

MatchInterPodAffinity checks if the pod affinity/antiaffinity rules permit the pod.

{"name" : "MatchInterPodAffinity"}

NoDiskConflict checks if the volume requested by a pod is available.

{"name" : "NoDiskConflict"}

PodToleratesNodeTaints checks if a pod can tolerate the node taints.

{"name" : "PodToleratesNodeTaints"}
16.2.5.1.2. Other Static Predicates

OpenShift Container Platform also supports the following predicates:

CheckVolumeBinding evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs. * For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node. * For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.

The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.

{"name" : "CheckVolumeBinding"}

The CheckVolumeBinding predicate must be enabled in non-default schedulers.

CheckNodeCondition checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.

{"name" : "CheckNodeCondition"}

PodToleratesNodeNoExecuteTaints checks if a pod tolerations can tolerate a node NoExecute taints.

{"name" : "PodToleratesNodeNoExecuteTaints"}

CheckNodeLabelPresence checks if all of the specified labels exist on a node, regardless of their value.

{"name" : "CheckNodeLabelPresence"}

checkServiceAffinity checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.

{"name" : "checkServiceAffinity"}

MaxAzureDiskVolumeCount checks the maximum number of Azure Disk Volumes.

{"name" : "MaxAzureDiskVolumeCount"}
16.2.5.2. General Predicates

The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods need to pass and essential predicates are the predicates that all pods need to pass.

The default scheduler policy includes the general predicates.

Non-critical general predicates

PodFitsResources determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.

{"name" : "PodFitsResources"}
Essential general predicates

PodFitsHostPorts determines if a node has free ports for the requested pod ports (absence of port conflicts).

{"name" : "PodFitsHostPorts"}

HostName determines fit based on the presence of the Host parameter and a string match with the name of the host.

{"name" : "HostName"}

MatchNodeSelector determines fit based on node selector (nodeSelector) queries defined in the pod.

{"name" : "MatchNodeSelector"}
16.2.5.3. Configurable Predicates

You can configure these predicates in the scheduler configuration, by default /etc/origin/master/scheduler.json, to add labels to affect how the predicate functions.

Since these are configurable, multiple predicates of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.

For information on using these priorities, see Modifying Scheduler Policy.

ServiceAffinity places pods on nodes based on the service running on that pod. Placing pods of the same service on the same or co-located nodes can lead to higher efficiency.

This predicate attempts to place pods with specific labels in its node selector on nodes that have the same label.

If the pod does not specify the labels in its node selector, then the first pod is placed on any node based on availability and all subsequent pods of the service are scheduled on nodes that have the same label values as that node.

"predicates":[
      {
         "name":"<name>", 1
         "argument":{
            "serviceAffinity":{
               "labels":[
                  "<label>" 2
               ]
            }
         }
      }
   ],
1
Specify a name for the predicate.
2
Specify a label to match.

For example:

        "name":"ZoneAffinity",
        "argument":{
            "serviceAffinity":{
                "labels":[
                    "rack"
                ]
            }
        }

For example. if the first pod of a service had a node selector rack was scheduled to a node with label region=rack, all the other subsequent pods belonging to the same service will be scheduled on nodes with the same region=rack label. For more information, see Controlling Pod Placement.

Multiple-level labels are also supported. Users can also specify all pods for a service to be scheduled on nodes within the same region and within the same zone (under the region).

The labelsPresence parameter checks whether a particular node has a specific label. The labels create node groups that the LabelPreference priority uses. Matching by label can be useful, for example, where nodes have their physical location or status defined by labels.

"predicates":[
      {
         "name":"<name>", 1
         "argument":{
            "labelsPresence":{
               "labels":[
                  "<label>" 2
                ],
                "presence": true 3
            }
         }
      }
   ],
1
Specify a name for the predicate.
2
Specify a label to match.
3
Specify whether the labels are required, either true or false.
  • For presence:false, if any of the requested labels are present in the node labels, the pod cannot be scheduled. If the labels are not present, the pod can be scheduled.
  • For presence:true, if all of the requested labels are present in the node labels, the pod can be scheduled. If all of the labels are not present, the pod is not scheduled.

For example:

        "name":"RackPreferred",
        "argument":{
            "labelsPresence":{
                "labels":[
                    "rack",
                    "region"
                ],
                "presence": true
            }
        }

16.2.6. Available Priorities

Priorities are rules that rank remaining nodes according to preferences.

A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.

16.2.6.1. Static Priorities

Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.

These are specified in the scheduler configuration, by default /etc/origin/master/scheduler.json.

16.2.6.1.1. Default Priorities

The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1 except NodePreferAvoidPodsPriority, which has a weight of 10000.

SelectorSpreadPriority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.

{"name" : "SelectorSpreadPriority", "weight" : 1}

InterPodAffinityPriority computes a sum by iterating through the elements of weightedPodAffinityTerm and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.

{"name" : "InterPodAffinityPriority", "weight" : 1}

LeastRequestedPriority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.

{"name" : "LeastRequestedPriority", "weight" : 1}

BalancedResourceAllocation favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority.

{"name" : "BalancedResourceAllocation", "weight" : 1}

NodePreferAvoidPodsPriority ignores pods that are owned by a controller other than a replication controller.

{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}

NodeAffinityPriority prioritizes nodes according to node affinity scheduling preferences

{"name" : "NodeAffinityPriority", "weight" : 1}

TaintTolerationPriority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule.

{"name" : "TaintTolerationPriority", "weight" : 1}
16.2.6.1.2. Other Static Priorities

OpenShift Container Platform also supports the following priorities:

EqualPriority gives an equal weight of 1 to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.

{"name" : "EqualPriority", "weight" : 1}

MostRequestedPriority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.

{"name" : "MostRequestedPriority", "weight" : 1}

ImageLocalityPriority prioritizes nodes that already have requested pod container’s images.

{"name" : "ImageLocalityPriority", "weight" : 1}

ServiceSpreadingPriority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.

{"name" : "ServiceSpreadingPriority", "weight" : 1}
16.2.6.2. Configurable Priorities

You can configure these priorities in the scheduler configuration, by default /etc/origin/master/scheduler.json, to add labels to affect how the priorities.

The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.

For information on using these priorities, see Modifying Scheduler Policy.

ServiceAntiAffinity takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.

"priorities":[
    {
        "name":"<name>", 1
        "weight" : 1 2
        "argument":{
            "serviceAntiAffinity":{
                "label":[
                    "<label>" 3
                ]
            }
        }
    }
]
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.

For example:

        "name":"RackSpread", 1
        "weight" : 1 2
        "argument":{
            "serviceAntiAffinity":{
                "label": "rack" 3
            }
        }
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.
Note

In some situations using ServiceAntiAffinity based on custom labels does not spread pod as expected. See this Red Hat Solution.

*The labelPreference parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label.

"priorities":[
    {
        "name":"<name>", 1
        "weight" : 1, 2
        "argument":{
            "labelPreference":{
                "label": "<label>", 3
                "presence": true 4
            }
        }
    }
]
1
Specify a name for the priority.
2
Specify a weight. Enter a non-zero positive value.
3
Specify a label to match.
4
Specify whether the label is required, either true or false.

16.2.7. Use Cases

One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.

16.2.7.1. Infrastructure Topological Levels

Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes (e.g., region=r1, zone=z1, rack=s1).

These label names have no particular meaning and administrators are free to name their infrastructure levels anything (eg, city/building/room). Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regionszonesracks). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.

16.2.7.2. Affinity

Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.

If you need greater control over where the pods are scheduled, see Using Node Affinity and Using Pod Affinity and Anti-affinity. These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

16.2.7.3. Anti Affinity

Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.

If you need greater control over where the pods are scheduled, see Using Node Affinity and Using Pod Affinity and Anti-affinity. These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

16.2.8. Sample Policy Configurations

The configuration below specifies the default scheduler configuration, if it were to be specified via the scheduler policy file.

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RegionZoneAffinity" 1
    argument:
      serviceAffinity: 2
        labels: 3
          - "region"
          - "zone"
priorities:
...
  - name: "RackSpread" 4
    weight: 1
    argument:
      serviceAntiAffinity: 5
        label: "rack" 6
1
The name for the predicate.
2
3
The labels for the predicate.
4
The name for the priority.
5
6
The labels for the priority.

In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.

The following example defines three topological levels, region (affinity) → zone (affinity) → rack (anti-affinity):

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RegionZoneAffinity"
    argument:
      serviceAffinity:
        labels:
          - "region"
          - "zone"
priorities:
...
  - name: "RackSpread"
    weight: 1
    argument:
      serviceAntiAffinity:
        label: "rack"

The following example defines three topological levels, city (affinity) → building (anti-affinity) → room (anti-affinity):

kind: "Policy"
version: "v1"
predicates:
...
  - name: "CityAffinity"
    argument:
      serviceAffinity:
        labels:
          - "city"
priorities:
...
  - name: "BuildingSpread"
    weight: 1
    argument:
      serviceAntiAffinity:
        label: "building"
  - name: "RoomSpread"
    weight: 1
    argument:
      serviceAntiAffinity:
        label: "room"

The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RequireRegion"
    argument:
      labelsPresence:
        labels:
          - "region"
        presence: true
priorities:
...
  - name: "ZonePreferred"
    weight: 1
    argument:
      labelPreference:
        label: "zone"
        presence: true

The following example combines both static and configurable predicates and also priorities:

kind: "Policy"
version: "v1"
predicates:
...
  - name: "RegionAffinity"
    argument:
      serviceAffinity:
        labels:
          - "region"
  - name: "RequireRegion"
    argument:
      labelsPresence:
        labels:
          - "region"
        presence: true
  - name: "BuildingNodesAvoid"
    argument:
      labelsPresence:
        labels:
          - "building"
        presence: false
  - name: "PodFitsPorts"
  - name: "MatchNodeSelector"
priorities:
...
  - name: "ZoneSpread"
    weight: 2
    argument:
      serviceAntiAffinity:
        label: "zone"
  - name: "ZonePreferred"
    weight: 1
    argument:
      labelPreference:
        label: "zone"
        presence: true
  - name: "ServiceSpreadingPriority"
    weight: 1

16.3. Descheduling

16.3.1. Overview

Descheduling involves evicting pods based on specific policies so that the pods can be rescheduled onto more appropriate nodes.

Your cluster can benefit from descheduling and rescheduling already-running pods for various reasons:

  • Nodes are under- or over-utilized.
  • Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
  • Node failure requires pods to be moved.
  • New nodes are added to clusters.

The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.

It is important to note that there are a number of core components, such as DNS, that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster may stop working properly if the component is evicted. To prevent the descheduler from removing these pods, configure the pod as a critical pod by adding the scheduler.alpha.kubernetes.io/critical-pod annotation to the pod specification.

Note

The descheduler job is considered a critical pod, which prevents the descheduler pod from being evicted by the descheduler.

The descheduler job and descheduler pod are created in the kube-system project, which is created by default.

Important

The descheduler is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.

The descheduler does not evict the following types of pods:

  • Critical pods (with the scheduler.alpha.kubernetes.io/critical-pod annotation).
  • Pods (static and mirror pods or pods in standalone mode) not associated with a Replica Set, Replication Controller, Deployment, or Job (because these pods are not recreated).
  • Pods associated with DaemonSets.
  • Pods with local storage.
  • Pods subject to Pod Disruption Budget (PDB) are not evicted if descheduling violates the PDB. The pods can be evicted using an eviction policy.
Note

Best efforts pods are evicted before Burstable and Guaranteed pods.

The following sections describe the process to configure and run the descheduler:

16.3.2. Creating a Cluster Role

To configure the necessary permissions for the descheduler to work in a pod:

  1. Create a cluster role with the following rules:

    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1beta1
    metadata:
      name: descheduler-cluster-role
    rules:
    - apiGroups: [""]
      resources: ["nodes"]
      verbs: ["get", "watch", "list"] 1
    - apiGroups: [""]
      resources: ["pods"]
      verbs: ["get", "watch", "list", "delete"] 2
    - apiGroups: [""]
      resources: ["pods/eviction"] 3
      verbs: ["create"]
    1
    Configures the role to allow viewing nodes.
    2
    Configures the role to allow viewing and deleting pods.
    3
    Allows a node to evict pods bound to itself.
  2. Create the service account which will be used to run the job:

    # oc create sa <file-name>.yaml -n kube-system

    For example:

    # oc create sa descheduler-sa.yaml -n kube-system
  3. Bind the cluster role to the service account:

    # oc create clusterrolebinding descheduler-cluster-role-binding \
        --clusterrole=<cluster-role-name> \
        --serviceaccount=kube-system:<service-account-name>

    For example:

    # oc create clusterrolebinding descheduler-cluster-role-binding \
        --clusterrole=descheduler-cluster-role \
        --serviceaccount=kube-system:descheduler-sa

16.3.3. Creating Descheduler Policies

You can configure the descheduler to remove pods from nodes that violate rules defined by strategies in a YAML policy file. You then create a configuration map that includes a path to the policy file and a job specification using that configuration map to apply the specific descheduling strategy.

Sample descheduler policy file

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: false
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds:
           "cpu" : 50
           "memory": 50
           "pods": 50
         numberOfNodes: 3
  "RemovePodsViolatingInterPodAntiAffinity":
     enabled: true
  "RemovePodsViolatingNodeAffinity":
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution"

There are three default strategies that can be used with the descheduler:

You can configure and disable parameters associated with strategies as needed.

16.3.3.1. Removing Duplicate Pods

The RemoveDuplicates strategy ensures that there is only one pod associated with a Replica Set, Replication Controller, Deployment Configuration, or Job running on same node. If there are other pods associated with those objects, the duplicate pods are evicted. Removing duplicate pods results in better spreading of pods in a cluster.

For example, duplicate pods could happen if a node fails and the pods on the node are moved to another node, leading to more than one pod associated with an Replica Set or Replication Controller, running on same node. After the failed node is ready again, this strategy could be used to evict those duplicate pods.

There are no parameters associated with this strategy.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: false 1
1
Set this value to enabled: true to use this policy. Set to false to disable this policy.
16.3.3.2. Creating a Low Node Utilization Policy

The LowNodeUtilization strategy finds nodes that are underutilized and evicts pods from other nodes so that the evicted pods can be scheduled on these underutilized nodes.

The underutilization of nodes is determined by a configurable threshold, thresholds, for CPU, memory, or number of pods (based on percentage). If a node usage is below all these thresholds, the node is considered underutilized and the descheduler can evict pods from other nodes. Pods request resource requirements are considered when computing node resource utilization.

A high threshold value, targetThresholds is used to determine properly utilized nodes. Any node that is between the thresholds and targetThresholds is considered properly utilized and is not considered for eviction. The threshold, targetThresholds, can be configured for CPU, memory, and number of pods (based on percentage).

These thresholds could be tuned for your cluster requirements.

The numberOfNodes parameter can be configured to activate the strategy only when number of underutilized nodes is above the configured value. Set this parameter if it is acceptable for a few nodes to go underutilized. By default, numberOfNodes is set to zero.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds: 1
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds: 2
           "cpu" : 50
           "memory": 50
           "pods": 50
         numberOfNodes: 3 3
1
Set the low-end threshold. If the node is below all three values, the descheduler considers the node underutilized.
2
Set the high-end threshold. If the node is below these values and above the threshold values, the descheduler considers the node properly utilized.
3
Set the number of nodes that can be underutilized before the descheduler will evict pods from underutilized nodes.
16.3.3.3. Remove Pods Violating Inter-Pod Anti-Affinity

The RemovePodsViolatingInterPodAntiAffinity strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.

For example, Node1 has podA, podB, and podC. podB and podC have anti-affinity rules that prohibit them from running on the same node as podA. podA will be evicted from the node so that podB and podC can run on that node. This situation could happen if the anti-affinity rule was applied when podB and podC were running on the node.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity": 1
     enabled: true
1
Set this value to enabled: true to use this policy. Set to false to disable this policy.
16.3.3.4. Remove Pods Violating Node Affinity

The RemovePodsViolatingNodeAffinity strategy ensures that all pods violating node affinity are removed from nodes. This situation could occur if a node no longer satisfies a pod’s affinity rule. If another node is available that satisfies the affinity rule, the pod is evicted.

For example, podA is scheduled on nodeA, because the node satisfies the requiredDuringSchedulingIgnoredDuringExecution node affinity rule at the time of scheduling.If nodeA stops satisfying the rule, and if there is another node available that satisfies the node affinity rule, the strategy evicts podA from nodeA and moves it to the other node.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity": 1
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution" 2
1
Set this value to enabled: true to use this policy. Set to false to disable this policy.
2
Specify the requiredDuringSchedulingIgnoredDuringExecution node affinity type.

16.3.4. Create a Configuration Map for the Descheduler Policy

Create a configuration map for the descheduler policy file in the kube-system project, so that it can be referenced by the descheduler job.

# oc create configmap descheduler-policy-configmap \
     -n kube-system --from-file=<path-to-policy-dir/policy.yaml> 1
1
The path to the policy file you created.

16.3.5. Create the Job Specification

Create a job configuration for the descheduler.

apiVersion: batch/v1
kind: Job
metadata:
  name: descheduler-job
  namespace: kube-system
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: descheduler-pod 1
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: "true" 2
    spec:
        containers:
        - name: descheduler
          image: registry.access.redhat.com/openshift3/ose-descheduler
          volumeMounts: 3
          - mountPath: /policy-dir
            name: policy-volume
          command:
          - "/bin/sh"
          - "-ec"
          - |
            /bin/descheduler --policy-config-file /policy-dir/policy.yaml 4
        restartPolicy: "Never"
        serviceAccountName: descheduler-sa 5
        volumes:
        - name: policy-volume
          configMap:
            name: descheduler-policy-configmap
1
Specify a name for the job.
2
Configures the pod so that it will not be descheduled.
3
The volume name and mount path in the container where the job should be mounted.
4
Path in the container where the policy file you created will be stored.
5
Specify the name of the service account you created.

The policy file is mounted as a volume from the configuration map.

16.3.6. Run the Descheduler

To run the descheduler as a job in a pod:

# oc create -f <file-name>.yaml

For example:

# oc create -f descheduler-job.yaml

16.4. Custom Scheduling

16.4.1. Overview

You can run multiple, custom schedulers alongside the default scheduler and configure which scheduler to use for each pods.

To schedule a given pod using a specific scheduler, specify the name of the scheduler in that pod specification.

Note

Information on how to create the scheduler binary is outside the scope of this document. For an example, see Configure Multiple Schedulers in the Kubernetes documentation.

16.4.2. Package the Scheduler

The general process for including a custom scheduler in your cluster involves creating an image and including that image in a deployment.

  1. Package your scheduler binary into a container image.
  2. Create a container image containing the scheduler binary.

    For example:

    FROM <source-image>
    ADD <path-to-binary> /usr/local/bin/kube-scheduler
  3. Save the file as Dockerfile, build the image, and push it to a registry.

    For example:

    docker build -t <dest_env_registry_ip>:<port>/<namespace>/<image name>:<tag>
    docker push <dest_env_registry_ip>:<port>/<namespace>/<image name>:<tag>
  4. In OpenShift Container Platform, create a deployment for the custom scheduler.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: custom-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: custom-scheduler
    subjects:
    - kind: ServiceAccount
      name: custom-scheduler
      namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:kube-scheduler
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: custom-scheduler
      namespace: kube-system
      labels:
        app: custom-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: custom-scheduler
      template:
        metadata:
          labels:
            app: custom-scheduler
        spec:
          serviceAccount: custom-scheduler
          containers:
            - name: custom-scheduler
              image: "<namespace>/<image name>:<tag>" 1
              imagePullPolicy: Always
    1
    Specify the container image you created for the custom scheduler.

16.4.3. Deploying Pods using a Custom Scheduler

After your custom scheduler is deployed in your cluster, you can configure pods to use that scheduler instead of the default scheduler.

  1. Create or edit a pod configuration and specify the name of the scheduler with the schedulerName parameter. The name must be unique.

    Sample pod specification with scheduler

    apiVersion: v1
    kind: Pod
    metadata:
      name: custom-scheduler-example
      labels:
        name: custom-scheduler-example
    spec:
      schedulerName: custom-scheduler 1
      containers:
      - name: pod-with-second-annotation-container
        image: docker.io/ocpqe/hello-pod

    1
    The name of the scheduler to use. When no scheduler name is supplied, the pod is automatically scheduled using the default scheduler.
  2. Run the following command to create the pod:

    $ oc create -f <file-name>.yaml

    For example:

    $ oc create -f custom-scheduler-example.yaml
  3. Run the following command to check that the pod was created:

    $ oc get pod <file-name>

    For example:

    $ oc get pod custom-scheduler-example
    
    NAME                       READY     STATUS    RESTARTS   AGE
    custom-scheduler-example   1/1       Running   0          4m
  4. Run the following command to check that the custom scheduler scheduled the pod:

    $ oc describe pod <pod-name>

    For example:

    $ oc describe pod custom-scheduler-example

    The name of the scheduler is listed, as shown in the following truncated output:

    ...
    
    Events:
      FirstSeen  LastSeen  Count  From                SubObjectPath  Type       Reason Message
      ---------  --------  -----  ----                -------------  --------   ------ -------
      1m         1m        1      custom-scheduler    Normal         Scheduled  Successfully assigned custom-scheduler to <$node1>
    
    ...

16.5. Controlling Pod Placement

16.5.1. Overview

As a cluster administrator, you can set a policy to prevent application developers with certain roles from targeting specific nodes when scheduling pods.

The Pod Node Constraints admission controller ensures that pods are deployed onto only specified node hosts using labels and prevents users without a specific role from using the nodeSelector field to schedule pods.

16.5.2. Constraining Pod Placement Using Node Name

Use the Pod Node Constraints admission controller to ensure a pod is deployed onto only a specified node host by assigning it a label and specifying this in the nodeName setting in a pod configuration.

  1. Ensure you have the desired labels (see Updating Labels on Nodes for details) and node selector set up in your environment.

    For example, make sure that your pod configuration features the nodeName value indicating the desired label:

    apiVersion: v1
    kind: Pod
    spec:
      nodeName: <value>
  2. Modify the master configuration file, /etc/origin/master/master-config.yaml, to add PodNodeConstraints to the admissionConfig section:

    ...
    admissionConfig:
      pluginConfig:
        PodNodeConstraints:
          configuration:
            apiversion: v1
            kind: PodNodeConstraintsConfig
    ...
  3. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers

16.5.3. Constraining Pod Placement Using a Node Selector

Using node selectors, you can ensure that pods are only placed onto nodes with specific labels. As a cluster administrator, you can use the Pod Node Constraints admission controller to set a policy that prevents users without the pods/binding permission from using node selectors to schedule pods.

The nodeSelectorLabelBlacklist field of a master configuration file gives you control over the labels that certain roles can specify in a pod configuration’s nodeSelector field. Users, service accounts, and groups that have the pods/binding permission role can specify any node selector. Those without the pods/binding permission are prohibited from setting a nodeSelector for any label that appears in nodeSelectorLabelBlacklist.

For example, an OpenShift Container Platform cluster might consist of five data centers spread across two regions. In the U.S., us-east, us-central, and us-west; and in the Asia-Pacific region (APAC), apac-east and apac-west. Each node in each geographical region is labeled accordingly. For example, region: us-east.

Note

See Updating Labels on Nodes for details on assigning labels.

As a cluster administrator, you can create an infrastructure where application developers should be deploying pods only onto the nodes closest to their geographical location. You can create a node selector, grouping the U.S. data centers into superregion: us and the APAC data centers into superregion: apac.

To maintain an even loading of resources per data center, you can add the desired region to the nodeSelectorLabelBlacklist section of a master configuration. Then, whenever a developer located in the U.S. creates a pod, it is deployed onto a node in one of the regions with the superregion: us label. If the developer tries to target a specific region for their pod (for example, region: us-east), they receive an error. If they try again, without the node selector on their pod, it can still be deployed onto the region they tried to target, because superregion: us is set as the project-level node selector, and nodes labeled region: us-east are also labeled superregion: us.

  1. Ensure you have the desired labels (see Updating Labels on Nodes for details) and node selector set up in your environment.

    For example, make sure that your pod configuration features the nodeSelector value indicating the desired label:

    apiVersion: v1
    kind: Pod
    spec:
      nodeSelector:
        <key>: <value>
    ...
  2. Modify the master configuration file, /etc/origin/master/master-config.yaml, to add nodeSelectorLabelBlacklist to the admissionConfig section with the labels that are assigned to the node hosts you want to deny pod placement:

    ...
    admissionConfig:
      pluginConfig:
        PodNodeConstraints:
          configuration:
            apiversion: v1
            kind: PodNodeConstraintsConfig
            nodeSelectorLabelBlacklist:
              - kubernetes.io/hostname
              - <label>
    ...
  3. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers

16.5.4. Control Pod Placement to Projects

The Pod Node Selector admission controller allows you to force pods onto nodes associated with a specific project and prevent pods from being scheduled in those nodes.

The Pod Node Selector admission controller determines where a pod can be placed using labels on projects and node selectors specified in pods. A new pod will be placed on a node associated with a project only if the node selectors in the pod match the labels in the project.

After the pod is created, the node selectors are merged into the pod so that the pod specification includes the labels originally included in the specification and any new labels from the node selectors. The example below illustrates the merging effect.

The Pod Node Selector admission controller also allows you to create a list of labels that are permitted in a specific project. This list acts as a whitelist that lets developers know what labels are acceptable to use in a project and gives administrators greater control over labeling in a cluster.

To activate the Pod Node Selector admission controller:

  1. Configure the Pod Node Selector admission controller and whitelist, using one of the following methods:

    • Add the following to the master configuration file, /etc/origin/master/master-config.yaml:

      admissionConfig:
        pluginConfig:
          PodNodeSelector:
            configuration:
              podNodeSelectorPluginConfig: 1
                clusterDefaultNodeSelector: "k3=v3" 2
                ns1: region=west,env=test,infra=fedora,os=fedora 3
      1
      Adds the Pod Node Selector admission controller plug-in.
      2
      Creates default labels for all nodes.
      3
      Creates a whitelist of permitted labels in the specified project. Here, the project is ns1 and the labels are the key=value pairs that follow.
    • Create a file containing the admission controller information:

      podNodeSelectorPluginConfig:
          clusterDefaultNodeSelector: "k3=v3"
           ns1: region=west,env=test,infra=fedora,os=fedora

      Then, reference the file in the master configuration:

      admissionConfig:
        pluginConfig:
          PodNodeSelector:
            location: <path-to-file>
      Note

      If a project does not have node selectors specified, the pods associated with that project will be merged using the default node selector (clusterDefaultNodeSelector).

  2. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers
  3. Create a project object that includes the scheduler.alpha.kubernetes.io/node-selector annotation and labels.

    apiVersion: v1
    kind: Namespace
    metadata
      name: ns1
      annotations:
        scheduler.alpha.kubernetes.io/node-selector: env=test,infra=fedora 1
    spec: {},
    status: {}
    1
    Annotation to create the labels to match the project label selector. Here, the key/value labels are env=test and infra=fedora.
    Note

    When using the Pod Node Selector admission controller, you cannot use oc adm new-project <project-name> for setting project node selector. When you set the project node selector using the oc adm new-project myproject --node-selector='type=user-node,region=<region> command, OpenShift Container Platform sets the openshift.io/node-selector annotation, which is processed by NodeEnv admission plugin.

  4. Create a pod specification that includes the labels in the node selector, for example:

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        name: hello-pod
      name: hello-pod
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod:latest"
          imagePullPolicy: IfNotPresent
          name: hello-pod
          ports:
            - containerPort: 8080
              protocol: TCP
          resources: {}
          securityContext:
            capabilities: {}
            privileged: false
          terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      nodeSelector: 1
        env: test
        os: fedora
      serviceAccount: ""
    status: {}
    1
    Node selectors to match project labels.
  5. Create the pod in the project:

    # oc create -f pod.yaml --namespace=ns1
  6. Check that the node selector labels were added to the pod configuration:

    get pod pod1 --namespace=ns1 -o json
    
    nodeSelector": {
     "env": "test",
     "infra": "fedora",
     "os": "fedora"
    }

    The node selectors are merged into the pod and the pod should be scheduled in the appropriate project.

If you create a pod with a label that is not specified in the project specification, the pod is not scheduled on the node.

For example, here the label env: production is not in any project specification:

nodeSelector:
 "env: production"
 "infra": "fedora",
 "os": "fedora"

If there is a node that does not have a node selector annotation, the pod will be scheduled there.

16.6. Pod Priority and Preemption

16.6.1. Applying pod priority and preemption

You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. Pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node Pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.

To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.

Preemption is controlled by the disablePreemption parameter in the scheduler configuration file, which is set to false by default.

16.6.2. About pod priority

When the Pod Priority and Preemption feature is enabled, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.

16.6.2.1. Pod priority classes

You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.

A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.

  • System-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are sdn-ovs, sdn, and so forth.
  • System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.
Note

If you upgrade your existing cluster, the priority of your existing pods is effectively zero. However, existing pods with the scheduler.alpha.kubernetes.io/critical-pod annotation are automatically converted to system-cluster-critical class.

16.6.2.2. Pod priority names

After you have one or more priority classes, you can create pods that specify a priority class name in a pod specification. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.

The following YAML is an example of a pod configuration that uses the priority class created in the preceding example. The priority admission controller checks the specification and resolves the priority of the pod to 1000000.

16.6.3. About pod preemption

When a developer creates a pod, the pod goes into a queue. When the Pod Priority and Preemption feature is enabled, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.

When the scheduler preempts one or more pods on a node, the nominatedNodeName field of higher-priority pod specification is set to the name of the node, along with the nodename field. The scheduler uses the nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.

After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName field and nodeName field of the pod specification might be different.

Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName of the pending pod, making the pod eligible for another node.

Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.

The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.

16.6.3.1. Pod preemption and other scheduler settings

If you enable pod priority and preemption, consider your other scheduler settings:

Pod priority and pod disruption budget
A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
Pod priority and pod affinity
Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.

If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

16.6.3.2. Graceful termination of preempted pods

When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.

To minimize this gap, configure a small graceful termination period for lower-priority pods.

16.6.4. Pod priority example scenarios

Pod priority and preemption assigns a priority to pods for scheduling. The scheduler will preempt (evict) lower-priority pods in order to schedule higher-priority pods.

Typical preemption scenario

Pod P is a pending pod.

  1. The scheduler locates Node N, where the removal of one or more pods would enable Pod P to be scheduled on that node.
  2. The scheduler deletes the lower-priority pods from the Node N and schedules Pod P on the node.
  3. The nominatedNodeName field of Pod P is set to the name of Node N.
Note

Pod P is not necessarily scheduled to the nominated node.

Preemption and termination periods

The preempted pod has a long termination period.

  1. The scheduler preempts a lower-priority pod on Node N.
  2. The scheduler waits for the pod to gracefully terminate.
  3. For other scheduling reasons, Node M becomes available.
  4. The scheduler can then schedule Pod P on Node M.

16.6.5. Configuring priority and preemption

You apply pod priority and preemption by creating a priority class objects and associating pods to the priority using the priorityClassName in your pod specifications.

Sample priority class object

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-priority 1
value: 1000000 2
globalDefault: false 3
description: "This priority class should be used for XYZ service pods only." 4

1
The name of the priority class object.
2
The priority value of the object.
3
Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is false by default. Only one priority class with globalDefault set to true can exist in the cluster. If there is no priority class with globalDefault:true, the priority of pods with no priority class name is zero. Adding a priority class with globalDefault:true affects only pods created after the priority class is added and does not change the priorities of existing pods.
4
Optional arbitrary text string that describes which pods developers should use with this priority class.

Sample pod specification with priority class name

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority 1

1
Specify the priority class to use with this pod.

To configure your cluster to use priority and preemption:

  1. Create one or more priority classes:

    1. Specify a name and value for the priority.
    2. Optionally specify the globalDefault field in the priority class and a description.
  2. Create pods or edit existing pods to include the name of a priority class. You can add the priority name directly to the pod configuration or to a pod template:

16.6.6. Disabling priority and preemption

You can disable the pod priority and preemption feature.

After the feature is disabled, the existing pods keep their priority fields, but preemption is disabled, and priority fields are ignored. If the feature is disabled, you cannot set a priority class name in new pods.

Important

Critical pods rely on scheduler preemption to be scheduled when a cluster is under resource pressure. For this reason, Red Hat recommends not disabling preemption. DaemonSet pods are scheduled by the DaemonSet controller and not affected by disabling preemption.

To disable the preemption for the cluster:

  1. Modify the master-config.yaml to set the disablePreemption parameter in the schedulerArgs section to false.

    disablePreemption=false
  2. Restart the OpenShift Container Platform master service and scheduler to apply the changes.

    # master-restart api
    # master-restart scheduler

16.7. Advanced Scheduling

16.7.1. Overview

Advanced scheduling involves configuring a pod so that the pod is required to run on particular nodes or has a preference to run on particular nodes.

Generally, advanced scheduling is not necessary, as the OpenShift Container Platform automatically places pods in a reasonable manner. For example, the default scheduler attempts to distribute pods across the nodes evenly and considers the available resources in a node. However, you might want more control over where a pod is placed.

If a pod needs to be on a machine with a faster disk speed (or prevented from being placed on that machine) or pods from two different services need to be located so they can communicate, you can use advanced scheduling to make that happen.

To ensure that appropriate new pods are scheduled on a dedicated group of nodes and prevent other new pods from being scheduled on those nodes, you can combine these methods as needed.

16.7.2. Using Advanced Scheduling

There are several ways to invoke advanced scheduling in your cluster:

Pod Affinity and Anti-affinity

Pod affinity allows a pod to specify an affinity (or anti-affinity) towards a group of pods (for an application’s latency requirements, due to security, and so forth) it can be placed with. The node does not have control over the placement.

Pod affinity uses labels on nodes and label selectors on pods to create rules for pod placement. Rules can be mandatory (required) or best-effort (preferred).

See Using Pod Affinity and Anti-affinity.

Node Affinity

Node affinity allows a pod to specify an affinity (or anti-affinity) towards a group of nodes (due to their special hardware, location, requirements for high availability, and so forth) it can be placed on. The node does not have control over the placement.

Node affinity uses labels on nodes and label selectors on pods to create rules for pod placement. Rules can be mandatory (required) or best-effort (preferred).

See Using Node Affinity.

Node Selectors

Node selectors are the simplest form of advanced scheduling. Like node affinity, node selectors also use labels on nodes and label selectors on pods to allow a pod to control the nodes on which it can be placed. However, node selectors do not have required and preferred rules that node affinities have.

See Using Node Selectors.

Taints and Tolerations

Taints/Tolerations allow the node to control which pods should (or should not) be scheduled on them. Taints are labels on a node and tolerations are labels on a pod. The labels on the pod must match (or tolerate) the label (taint) on the node in order to be scheduled.

Taints/tolerations have one advantage over affinities. For example, if you add to a cluster a new group of nodes with different labels, you would need to update affinities on each of the pods you want to access the node and on any other pods you do not want to use the new nodes. With taints/tolerations, you would only need to update those pods that are required to land on those new nodes, because other pods would be repelled.

See Using Taints and Tolerations.

16.8. Advanced Scheduling and Node Affinity

16.8.1. Overview

Node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on nodes and label selectors specified in pods. Node affinity allows a pod to specify an affinity (or anti-affinity) towards a group of nodes it can be placed on. The node does not have control over the placement.

For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.

There are two types of node affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.

16.8.2. Configuring Node Affinity

You configure node affinity through the pod specification file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example is a pod specification with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth and whose value is either e2e-az-North or e2e-az-South:

Sample pod configuration file with a node affinity required rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 1
      requiredDuringSchedulingIgnoredDuringExecution: 2
        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-NorthSouth 3
            operator: In 4
            values:
            - e2e-az-North 5
            - e2e-az-South 6
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1
The stanza to configure node affinity.
2
Defines a required rule.
3 5 6
The key/value pair (label) that must be matched to apply the rule.
4
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the pod specification. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest and whose value is either e2e-az-East or e2e-az-West is preferred for the pod:

Sample pod configuration file with a node affinity preferred rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 1 3
        preference:
          matchExpressions:
          - key: e2e-az-EastWest 4
            operator: In 5
            values:
            - e2e-az-East 6
            - e2e-az-West 7
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1
The stanza to configure node affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with highest weight is preferred.
4 6 7
The key/value pair (label) that must be matched to apply the rule.
5
The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the pod specification. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

There is no explicit node anti-affinity concept, but using the NotIn or DoesNotExist operator replicates that behavior.

Note

If you are using node affinity and node selectors in the same pod configuration, note the following:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.
16.8.2.1. Configuring a Required Node Affinity Rule

Required rules must be met before a pod can be scheduled on a node.

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.

  1. Add a label to a node by editing the node configuration or by using the oc label node command:

    $ oc label node node1 e2e-az-name=e2e-az1
    Note

    To modify a node in your cluster, update the node configuration maps as needed. Do not manually edit the node-config.yaml file.

  2. In the pod specification, use the nodeAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node.
    2. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the operator In to require the label to be in the node:

      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: e2e-az-name
                  operator: In
                  values:
                  - e2e-az1
                  - e2e-az2
  3. Create the pod:

    $ oc create -f e2e-az2.yaml
16.8.2.2. Configuring a Preferred Node Affinity Rule

Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.

  1. Add a label to a node by editing the node configuration or by executing the oc label node command:

    $ oc label node node1 e2e-az-name=e2e-az3
    Note

    To modify a node in your cluster, update the node configuration maps as needed. Do not manually edit the node-config.yaml file.

  2. In the pod specification, use the nodeAffinity stanza to configure the preferredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify a weight for the node, as a number 1-100. The node with highest weight is preferred.
    2. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node:

            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: e2e-az-name
                  operator: In
                  values:
                  - e2e-az3
  3. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the operator In to require the label to be in the node.
  4. Create the pod.

    $ oc create -f e2e-az3.yaml

16.8.3. Examples

The following examples demonstrate node affinity.

16.8.3.1. Node Affinity with Matching Labels

The following example demonstrates node affinity for a node and pod with matching labels:

  • The Node1 node has the label zone:us:

    $ oc label node node1 zone=us
  • The pod pod-s1 has the zone and us key/value pair under a required node affinity rule:

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod"
          name: hello-pod
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "zone"
                  operator: In
                  values:
                  - us
  • Create the pod using the standard command:

    $ oc create -f pod-s1.yaml
    pod "pod-s1" created
  • The pod pod-s1 can be scheduled on Node1:

    $ oc get pod -o wide
    NAME     READY     STATUS       RESTARTS   AGE      IP      NODE
    pod-s1   1/1       Running      0          4m       IP1     node1
16.8.3.2. Node Affinity with No Matching Labels

The following example demonstrates node affinity for a node and pod without matching labels:

  • The Node1 node has the label zone:emea:

    $ oc label node node1 zone=emea
  • The pod pod-s1 has the zone and us key/value pair under a required node affinity rule:

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
    spec:
      containers:
        - image: "docker.io/ocpqe/hello-pod"
          name: hello-pod
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "zone"
                  operator: In
                  values:
                  - us
  • The pod pod-s1 cannot be scheduled on Node1:

    $ oc describe pod pod-s1
    
    ...
    Events:
     FirstSeen LastSeen Count From              SubObjectPath  Type                Reason
     --------- -------- ----- ----              -------------  --------            ------
     1m        33s      8     default-scheduler Warning        FailedScheduling    No nodes are available that match all of the following predicates:: MatchNodeSelector (1).

16.9. Advanced Scheduling and Pod Affinity and Anti-affinity

16.9.1. Overview

Pod affinity and pod anti-affinity allow you to specify rules about how pods should be placed relative to other pods. The rules are defined using custom labels on nodes and label selectors specified in pods. Pod affinity/anti-affinity allows a pod to specify an affinity (or anti-affinity) towards a group of pods it can be placed with. The node does not have control over the placement.

For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes or availability zones to reduce correlated failures.

Pod affinity/anti-affinity allows you to constrain which nodes your pod is eligible to be scheduled on based on the labels on other pods. A label is a key/value pair.

  • Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
  • Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.

There are two types of pod affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

16.9.2. Configuring Pod Affinity and Anti-affinity

You configure pod affinity/anti-affinity through the pod specification files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example shows a pod specification configured for pod affinity and anti-affinity.

In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security and value S1. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security and value S2.

Sample pod config file with pod affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity: 1
      requiredDuringSchedulingIgnoredDuringExecution: 2
      - labelSelector:
          matchExpressions:
          - key: security 3
            operator: In 4
            values:
            - S1 5
        topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

1
Stanza to configure pod affinity.
2
Defines a required rule.
3 5
The key and value (label) that must be matched to apply the rule.
4
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

Sample pod config file with pod anti-affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 1
      preferredDuringSchedulingIgnoredDuringExecution: 2
      - weight: 100  3
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security 4
              operator: In 5
              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

1
Stanza to configure pod anti-affinity.
2
Defines a preferred rule.
3
Specifies a weight for a preferred rule. The node with the highest weight is preferred.
4
Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
5
The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.
Note

If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.

16.9.2.1. Configuring an Affinity Rule

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.

  1. Create a pod with a specific label in the pod specification:

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: security-s1
      labels:
        security: S1
    spec:
      containers:
      - name: security-s1
        image: docker.io/ocpqe/hello-pod
  2. When creating other pods, edit the pod specification as follows:

    1. Use the podAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter:
    2. Specify the key and value that must be met. If you want the new pod to be scheduled with the other pod, use the same key and value parameters as the label on the first pod.

          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: security
                  operator: In
                  values:
                  - S1
              topologyKey: failure-domain.beta.kubernetes.io/zone
    3. Specify an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
    4. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
  3. Create the pod.

    $ oc create -f <pod-spec>.yaml
16.9.2.2. Configuring an Anti-affinity Rule

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.

  1. Create a pod with a specific label in the pod specification:

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: security-s2
      labels:
        security: S2
    spec:
      containers:
      - name: security-s2
        image: docker.io/ocpqe/hello-pod
  2. When creating other pods, edit the pod specification to set the following parameters:
  3. Use the podAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter:

    1. Specify a weight for the node, 1-100. The node that with highest weight is preferred.
    2. Specify the key and values that must be met. If you want the new pod to not be scheduled with the other pod, use the same key and value parameters as the label on the first pod.

          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: security
                    operator: In
                    values:
                    - S2
                topologyKey: kubernetes.io/hostname
    3. For a preferred rule, specify a weight, 1-100.
    4. Specify an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
  4. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
  5. Create the pod.

    $ oc create -f <pod-spec>.yaml

16.9.3. Examples

The following examples demonstrate pod affinity and pod anti-affinity.

16.9.3.1. Pod Affinity

The following example demonstrates pod affinity for pods with matching labels and label selectors.

  • The pod team4 has the label team:4.

    $ cat team4.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: team4
      labels:
         team: "4"
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod team4a has the label selector team:4 under podAffinity.

    $ cat pod-team4a.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: team4a
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: team
                operator: In
                values:
                - "4"
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-affinity
        image: docker.io/ocpqe/hello-pod
  • The team4a pod is scheduled on the same node as the team4 pod.
16.9.3.2. Pod Anti-affinity

The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.

  • The pod pod-s1 has the label security:s1.

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
      labels:
        security: s1
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 has the label selector security:s1 under podAntiAffinity.

    $ cat pod-s2.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s2
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: security
                operator: In
                values:
                - s1
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-antiaffinity
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 cannot be scheduled on the same node as pod-s1.
16.9.3.3. Pod Affinity with no Matching Labels

The following example demonstrates pod affinity for pods without matching labels and label selectors.

  • The pod pod-s1 has the label security:s1.

    $ cat pod-s1.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s1
      labels:
        security: s1
    spec:
      containers:
      - name: ocp
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 has the label selector security:s2.

    $ cat pod-s2.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-s2
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: security
                operator: In
                values:
                - s2
            topologyKey: kubernetes.io/hostname
      containers:
      - name: pod-affinity
        image: docker.io/ocpqe/hello-pod
  • The pod pod-s2 is not scheduled unless there is a node with a pod that has the security:s2 label. If there is no other pod with that label, the new pod remains in a pending state:

    NAME      READY     STATUS    RESTARTS   AGE       IP        NODE
    pod-s2    0/1       Pending   0          32s       <none>

16.10. Advanced Scheduling and Node Selectors

16.10.1. Overview

A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods.

For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.

If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.

16.10.2. Configuring Node Selectors

Using nodeSelector in a pod configuration, you can ensure that pods are only placed onto nodes with specific labels.

  1. Ensure you have the desired labels (see Updating Labels on Nodes for details) and node selector set up in your environment.

    For example, make sure that your pod configuration features the nodeSelector value indicating the desired label:

    apiVersion: v1
    kind: Pod
    spec:
      nodeSelector:
        <key>: <value>
    ...
  2. Modify the master configuration file, /etc/origin/master/master-config.yaml, to add nodeSelectorLabelBlacklist to the admissionConfig section with the labels that are assigned to the node hosts you want to deny pod placement:

    ...
    admissionConfig:
      pluginConfig:
        PodNodeConstraints:
          configuration:
            apiversion: v1
            kind: PodNodeConstraintsConfig
            nodeSelectorLabelBlacklist:
              - kubernetes.io/hostname
              - <label>
    ...
  3. Restart OpenShift Container Platform for the changes to take effect.

    # master-restart api
    # master-restart controllers
Note

If you are using node selectors and node affinity in the same pod configuration, note the following:

  • If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
  • If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
  • If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.

16.11. Advanced Scheduling and Taints and Tolerations

16.11.1. Overview

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

16.11.2. Taints and Tolerations

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

Table 16.1. Taint and toleration components
ParameterDescription

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule

  • New pods that do not match the taint are not scheduled onto that node.
  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.
  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;
    • the value parameters are the same;
    • the effect parameters are the same.
  • If the operator parameter is set to Exists:

    • the key parameters are the same;
    • the effect parameters are the same.
16.11.2.1. Using Multiple Taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

  1. Process the taints for which the pod has a matching toleration.
  2. The remaining unmatched taints have the indicated effects on the pod:

    • If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.
    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.
    • If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node (if it is already running on the node), or the pod is not scheduled onto the node (if it is not yet running on the node).

      • Pods that do not tolerate the taint are evicted immediately.
      • Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.
      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • The node has the following taints:

    $ oc adm taint nodes node1 key1=value1:NoSchedule
    $ oc adm taint nodes node1 key1=value1:NoExecute
    $ oc adm taint nodes node1 key2=value2:NoSchedule
  • The pod has the following tolerations:

    tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

16.11.3. Adding a Taint to an Existing Node

You add a taint to a node using the oc adm taint command with the parameters described in the Taint and toleration components table:

$ oc adm taint nodes <node-name> <key>=<value>:<effect>

For example:

$ oc adm taint nodes node1 key1=value1:NoExecute

The example places a taint on node1 that has key key1, value value1, and taint effect NoExecute.

16.11.4. Adding a Toleration to a Pod

To add a toleration to a pod, edit the pod specification to include a tolerations section:

Sample pod configuration file with Equal operator

tolerations:
- key: "key1" 1
  operator: "Equal" 2
  value: "value1" 3
  effect: "NoExecute" 4
  tolerationSeconds: 3600 5

1 2 3 4
The toleration parameters, as described in the Taint and toleration components table.
5
The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted. See Using Toleration Seconds to Delay Pod Evictions below.

Sample pod configuration file with Exists operator

tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 3600

Both of these tolerations match the taint created by the oc adm taint command above. A pod with either toleration would be able to schedule onto node1.

16.11.4.1. Using Toleration Seconds to Delay Pod Evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the pod specification. If a taint with the NoExecute effect is added to a node, any pods that do not tolerate the taint are evicted immediately (pods that do tolerate the taint are not evicted). However, if a pod that to be evicted has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

For example:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

Here, if this pod is running but does not have a matching taint, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

16.11.4.1.1. Setting a Default Value for Toleration Seconds

This plug-in sets the default forgiveness toleration for pods, to tolerate the node.kubernetes.io/not-ready:NoExecute and node.kubernetes.io/unreachable:NoExecute taints for five minutes.

If the pod configuration provided by the user already has either toleration, the default is not added.

To enable Default Toleration Seconds:

  1. Modify the master configuration file (/etc/origin/master/master-config.yaml) to Add DefaultTolerationSeconds to the admissionConfig section:

    admissionConfig:
      pluginConfig:
        DefaultTolerationSeconds:
          configuration:
            kind: DefaultAdmissionConfig
            apiVersion: v1
            disable: false
  2. Restart OpenShift for the changes to take effect:

    # master-restart api
    # master-restart controllers
  3. Verify that the default was added:

    1. Create a pod:

      $ oc create -f </path/to/file>

      For example:

      $ oc create -f hello-pod.yaml
      pod "hello-pod" created
    2. Check the pod tolerations:

      $ oc describe pod <pod-name> |grep -i toleration

      For example:

      $ oc describe pod hello-pod |grep -i toleration
      Tolerations:    node.kubernetes.io/not-ready=:Exists:NoExecute for 300s

16.11.5. Pod Eviction for Node Problems

OpenShift Container Platform can be configured to represent node unreachable and node not ready conditions as taints. This allows per-pod specification of how long to remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes.

When the Taint Based Evictions feature is enabled, the taints are automatically added by the node controller and the normal logic for evicting pods from Ready nodes is disabled.

  • If a node enters a not ready state, the node.kubernetes.io/not-ready:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.
  • If a node enters a not reachable state, the node.kubernetes.io/unreachable:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

To enable Taint Based Evictions:

  1. Modify the master configuration file (/etc/origin/master/master-config.yaml) to add the following to the kubernetesMasterConfig section:

    kubernetesMasterConfig:
       controllerArguments:
         feature-gates:
         - TaintBasedEvictions=true
  2. Check that the taint is added to a node:

    $ oc describe node $node | grep -i taint
    
    Taints: node.kubernetes.io/not-ready:NoExecute
  3. Restart OpenShift for the changes to take effect:

    # master-restart api
    # master-restart controllers
  4. Add a toleration to pods:

    tolerations:
    - key: "node.kubernetes.io/unreachable"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000

    or

    tolerations:
    - key: "node.kubernetes.io/not-ready"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000
Note

To maintain the existing rate limiting behavior of pod evictions due to node problems, the system adds the taints in a rate-limited way. This prevents massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

16.11.6. Daemonsets and Tolerations

DaemonSet pods are created with NoExecute tolerations for node.kubernetes.io/unreachable and node.kubernetes.io/not-ready with no tolerationSeconds to ensure that DaemonSet pods are never evicted due to these problems, even when the Default Toleration Seconds feature is disabled.

16.11.7. Examples

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that should not be running on a node. A few of typical scenrios are:

16.11.7.1. Dedicating a Node for a User

You can specify a set of nodes for exclusive use by a particular set of users.

To specify dedicated nodes:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    Only the pods with the tolerations are allowed to use the dedicated nodes.

16.11.7.2. Binding a User to a Node

You can configure a node so that particular users can use only the dedicated nodes.

To configure a node so that users can use only that node:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the key:value label (dedicated=groupName).

  3. Add a label similar to the taint (such as the key:value label) to the dedicated nodes.
16.11.7.3. Nodes with Special Hardware

In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

To ensure pods are blocked from the specialized hardware:

  1. Taint the nodes that have the specialized hardware using one of the following commands:

    $ oc adm taint nodes <node-name> disktype=ssd:NoSchedule
    $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
  2. Adding a corresponding toleration to pods that use the special hardware using an admission controller.

For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.

To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.

Chapter 17. Setting Quotas

17.1. Overview

A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per project. It can limit the quantity of objects that can be created in a project by type, as well as the total amount of compute resources and storage that may be consumed by resources in that project.

Note

See the Developer Guide for more on compute resources.

17.2. Resources Managed by Quota

The following describes the set of compute resources and object types that may be managed by a quota.

Note

A pod is in a terminal state if status.phase in (Failed, Succeeded) is true.

Table 17.1. Compute Resources Managed by Quota
Resource NameDescription

cpu

The sum of CPU requests across all pods in a non-terminal state cannot exceed this value. cpu and requests.cpu are the same value and can be used interchangeably.

memory

The sum of memory requests across all pods in a non-terminal state cannot exceed this value. memory and requests.memory are the same value and can be used interchangeably.

ephemeral-storage

The sum of local ephemeral storage requests across all pods in a non-terminal state cannot exceed this value. ephemeral-storage and requests.ephemeral-storage are the same value and can be used interchangeably. This resource is available only if you enabled the ephemeral storage technology preview. This feature is disabled by default.

requests.cpu

The sum of CPU requests across all pods in a non-terminal state cannot exceed this value. cpu and requests.cpu are the same value and can be used interchangeably.

requests.memory

The sum of memory requests across all pods in a non-terminal state cannot exceed this value. memory and requests.memory are the same value and can be used interchangeably.

requests.ephemeral-storage

The sum of ephemeral storage requests across all pods in a non-terminal state cannot exceed this value. ephemeral-storage and requests.ephemeral-storage are the same value and can be used interchangeably. This resource is available only if you enabled the ephemeral storage technology preview. This feature is disabled by default.

limits.cpu

The sum of CPU limits across all pods in a non-terminal state cannot exceed this value.

limits.memory

The sum of memory limits across all pods in a non-terminal state cannot exceed this value.

limits.ephemeral-storage

The sum of ephemeral storage limits across all pods in a non-terminal state cannot exceed this value. This resource is available only if you enabled the ephemeral storage technology preview. This feature is disabled by default.

Table 17.2. Storage Resources Managed by Quota
Resource NameDescription

requests.storage

The sum of storage requests across all persistent volume claims in any state cannot exceed this value.

persistentvolumeclaims

The total number of persistent volume claims that can exist in the project.

<storage-class-name>.storageclass.storage.k8s.io/requests.storage

The sum of storage requests across all persistent volume claims in any state that have a matching storage class, cannot exceed this value.

<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims

The total number of persistent volume claims with a matching storage class that can exist in the project.

Table 17.3. Object Counts Managed by Quota
Resource NameDescription

pods

The total number of pods in a non-terminal state that can exist in the project.

replicationcontrollers

The total number of replication controllers that can exist in the project.

resourcequotas

The total number of resource quotas that can exist in the project.

services

The total number of services that can exist in the project.

secrets

The total number of secrets that can exist in the project.

configmaps

The total number of ConfigMap objects that can exist in the project.

persistentvolumeclaims

The total number of persistent volume claims that can exist in the project.

openshift.io/imagestreams

The total number of image streams that can exist in the project.

You can configure an object count quota for these standard namespaced resource types using the count/<resource>.<group> syntax while creating a quota.

$ oc create quota <name> --hard=count/<resource>.<group>=<quota> 1
1
<resource> is the name of the resource, and <group> is the API group, if applicable. Use the kubectl api-resources command for a list of resources and their associated API groups.

17.2.1. Setting Resource Quota for Extended Resources

Overcommitment of resources is not allowed for extended resources, so you must specify requests and limits for the same extended resource in a quota. Currently, only quota items with the prefix requests. are allowed for extended resources. The following is an example scenario of how to set resource quota for the GPU resource nvidia.com/gpu.

Procedure

  1. Determine how many GPUs are available on a node in your cluster. For example:

    # oc describe node ip-172-31-27-209.us-west-2.compute.internal | egrep 'Capacity|Allocatable|gpu'
                        openshift.com/gpu-accelerator=true
    Capacity:
     nvidia.com/gpu:  2
    Allocatable:
     nvidia.com/gpu:  2
      nvidia.com/gpu  0           0

    In this example, 2 GPUs are available.

  2. Set a quota in the namespace nvidia. In this example, the quota is 1:

    # cat gpu-quota.yaml
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: gpu-quota
      namespace: nvidia
    spec:
      hard:
        requests.nvidia.com/gpu: 1
  3. Create the quota:

    # oc create -f gpu-quota.yaml
    resourcequota/gpu-quota created
  4. Verify that the namespace has the correct quota set:

    # oc describe quota gpu-quota -n nvidia
    Name:                    gpu-quota
    Namespace:               nvidia
    Resource                 Used  Hard
    --------                 ----  ----
    requests.nvidia.com/gpu  0     1
  5. Run a pod that asks for a single GPU:

    # oc create pod gpu-pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      generateName: gpu-pod-
      namespace: nvidia
    spec:
      restartPolicy: OnFailure
      containers:
      - name: rhel7-gpu-pod
        image: rhel7
        env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: all
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: "compute,utility"
          - name: NVIDIA_REQUIRE_CUDA
            value: "cuda>=5.0"
    
        command: ["sleep"]
        args: ["infinity"]
    
        resources:
          limits:
            nvidia.com/gpu: 1
  6. Verify that the pod is running:

    # oc get pods
    NAME              READY     STATUS      RESTARTS   AGE
    gpu-pod-s46h7     1/1       Running     0          1m
  7. Verify that the quota Used counter is correct:

    # oc describe quota gpu-quota -n nvidia
    Name:                    gpu-quota
    Namespace:               nvidia
    Resource                 Used  Hard
    --------                 ----  ----
    requests.nvidia.com/gpu  1     1
  8. Attempt to create a second GPU pod in the nvidia namespace. This is technically available on the node because it has 2 GPUs:

    # oc create -f gpu-pod.yaml
    Error from server (Forbidden): error when creating "gpu-pod.yaml": pods "gpu-pod-f7z2w" is forbidden: exceeded quota: gpu-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=1, limited: requests.nvidia.com/gpu=1

    This Forbidden error message is expected because you have a quota of 1 GPU and this pod tried to allocate a second GPU, which exceeds its quota.

17.3. Quota Scopes

Each quota can have an associated set of scopes. A quota will only measure usage for a resource if it matches the intersection of enumerated scopes.

Adding a scope to a quota restricts the set of resources to which that quota can apply. Specifying a resource outside of the allowed set results in a validation error.

ScopeDescription

Terminating

Match pods where spec.activeDeadlineSeconds >= 0.

NotTerminating

Match pods where spec.activeDeadlineSeconds is nil.

BestEffort

Match pods that have best effort quality of service for either cpu or memory. See the Quality of Service Classes for more on committing compute resources.

NotBestEffort

Match pods that do not have best effort quality of service for cpu and memory.

A BestEffort scope restricts a quota to limiting the following resources:

  • pods

A Terminating, NotTerminating, and NotBestEffort scope restricts a quota to tracking the following resources:

  • pods
  • memory
  • requests.memory
  • limits.memory
  • cpu
  • requests.cpu
  • limits.cpu
  • ephemeral-storage
  • requests.ephemeral-storage
  • limits.ephemeral-storage
Note

Ephemeral storage requests and limits apply only if you enabled the ephemeral storage technology preview. This feature is disabled by default.

17.4. Quota Enforcement

After a resource quota for a project is first created, the project restricts the ability to create any new resources that may violate a quota constraint until it has calculated updated usage statistics.

After a quota is created and usage statistics are updated, the project accepts the creation of new content. When you create or modify resources, your quota usage is incremented immediately upon the request to create or modify the resource.

When you delete a resource, your quota use is decremented during the next full recalculation of quota statistics for the project. A configurable amount of time determines how long it takes to reduce quota usage statistics to their current observed system value.

If project modifications exceed a quota usage limit, the server denies the action, and an appropriate error message is returned to the user explaining the quota constraint violated, and what their currently observed usage stats are in the system.

17.5. Requests Versus Limits

When allocating compute resources, each container may specify a request and a limit value each for CPU, memory, and ephemeral storage. Quotas can restrict any of these values.

If the quota has a value specified for requests.cpu or requests.memory, then it requires that every incoming container make an explicit request for those resources. If the quota has a value specified for limits.cpu or limits.memory, then it requires that every incoming container specify an explicit limit for those resources.

17.6. Sample Resource Quota Definitions

core-object-counts.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: core-object-counts
spec:
  hard:
    configmaps: "10" 1
    persistentvolumeclaims: "4" 2
    replicationcontrollers: "20" 3
    secrets: "10" 4
    services: "10" 5

1
The total number of ConfigMap objects that can exist in the project.
2
The total number of persistent volume claims (PVCs) that can exist in the project.
3
The total number of replication controllers that can exist in the project.
4
The total number of secrets that can exist in the project.
5
The total number of services that can exist in the project.

openshift-object-counts.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: openshift-object-counts
spec:
  hard:
    openshift.io/imagestreams: "10" 1

1
The total number of image streams that can exist in the project.

compute-resources.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    pods: "4" 1
    requests.cpu: "1" 2
    requests.memory: 1Gi 3
    requests.ephemeral-storage: 2Gi 4
    limits.cpu: "2" 5
    limits.memory: 2Gi 6
    limits.ephemeral-storage: 4Gi 7

1
The total number of pods in a non-terminal state that can exist in the project.
2
Across all pods in a non-terminal state, the sum of CPU requests cannot exceed 1 core.
3
Across all pods in a non-terminal state, the sum of memory requests cannot exceed 1Gi.
4
Across all pods in a non-terminal state, the sum of ephemeral storage requests cannot exceed 2Gi.
5
Across all pods in a non-terminal state, the sum of CPU limits cannot exceed 2 cores.
6
Across all pods in a non-terminal state, the sum of memory limits cannot exceed 2Gi.
7
Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed 4Gi.

besteffort.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: besteffort
spec:
  hard:
    pods: "1" 1
  scopes:
  - BestEffort 2

1
The total number of pods in a non-terminal state with BestEffort quality of service that can exist in the project.
2
Restricts the quota to only matching pods that have BestEffort quality of service for either memory or CPU.

compute-resources-long-running.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources-long-running
spec:
  hard:
    pods: "4" 1
    limits.cpu: "4" 2
    limits.memory: "2Gi" 3
    limits.ephemeral-storage: "4Gi" 4
  scopes:
  - NotTerminating 5

1
The total number of pods in a non-terminal state.
2
Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value.
3
Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value.
4
Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed this value.
5
Restricts the quota to only matching pods where spec.activeDeadlineSeconds is set to nil. Build pods will fall under NotTerminating unless the RestartNever policy is applied.

compute-resources-time-bound.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources-time-bound
spec:
  hard:
    pods: "2" 1
    limits.cpu: "1" 2
    limits.memory: "1Gi" 3
    limits.ephemeral-storage: "1Gi" 4
  scopes:
  - Terminating 5

1
The total number of pods in a non-terminal state.
2
Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value.
3
Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value.
4
Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed this value.
5
Restricts the quota to only matching pods where spec.activeDeadlineSeconds >=0. For example, this quota would charge for build or deployer pods, but not long running pods like a web server or database.

storage-consumption.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage-consumption
spec:
  hard:
    persistentvolumeclaims: "10" 1
    requests.storage: "50Gi" 2
    gold.storageclass.storage.k8s.io/requests.storage: "10Gi" 3
    silver.storageclass.storage.k8s.io/requests.storage: "20Gi" 4
    silver.storageclass.storage.k8s.io/persistentvolumeclaims: "5" 5
    bronze.storageclass.storage.k8s.io/requests.storage: "0" 6
    bronze.storageclass.storage.k8s.io/persistentvolumeclaims: "0" 7

1
The total number of persistent volume claims in a project
2
Across all persistent volume claims in a project, the sum of storage requested cannot exceed this value.
3
Across all persistent volume claims in a project, the sum of storage requested in the gold storage class cannot exceed this value.
4
Across all persistent volume claims in a project, the sum of storage requested in the silver storage class cannot exceed this value.
5
Across all persistent volume claims in a project, the total number of claims in the silver storage class cannot exceed this value.
6
Across all persistent volume claims in a project, the sum of storage requested in the bronze storage class cannot exceed this value. When this is set to 0, it means bronze storage class cannot request storage.
7
Across all persistent volume claims in a project, the sum of storage requested in the bronze storage class cannot exceed this value. When this is set to 0, it means bronze storage class cannot create claims.

17.7. Creating a Quota

To create a quota, first define the quota in a file, such as the examples in Sample Resource Quota Definitions. Then, create using that file to apply it to a project:

$ oc create -f <resource_quota_definition> [-n <project_name>]

For example:

$ oc create -f core-object-counts.yaml -n demoproject

17.7.1. Creating Object Count Quotas

You can create an object count quota for all OpenShift Container Platform standard namespaced resource types, such as BuildConfig, and DeploymentConfig. An object quota count places a defined quota on all standard namespaced resource types.

When using a resource quota, an object is charged against the quota if it exists in server storage. These types of quotas are useful to protect against exhaustion of storage resources.

To configure an object count quota for a resource, run the following command:

$ oc create quota <name> --hard=count/<resource>.<group>=<quota>,count/<resource>.<group>=<quota>

For example:

$ oc create quota test --hard=count/deployments.extensions=2,count/replicasets.extensions=4,count/pods=3,count/secrets=4
resourcequota "test" created

$ oc describe quota test
Name:                         test
Namespace:                    quota
Resource                      Used  Hard
--------                      ----  ----
count/deployments.extensions  0     2
count/pods                    0     3
count/replicasets.extensions  0     4
count/secrets                 0     4

This example limits the listed resources to the hard limit in each project in the cluster.

17.8. Viewing a Quota

You can view usage statistics related to any hard limits defined in a project’s quota by navigating in the web console to the project’s Quota page.

You can also use the CLI to view quota details:

  1. First, get the list of quotas defined in the project. For example, for a project called demoproject:

    $ oc get quota -n demoproject
    NAME                AGE
    besteffort          11m
    compute-resources   2m
    core-object-counts  29m
  2. Then, describe the quota you are interested in, for example the core-object-counts quota:

    $ oc describe quota core-object-counts -n demoproject
    Name:			core-object-counts
    Namespace:		demoproject
    Resource		Used	Hard
    --------		----	----
    configmaps		3	10
    persistentvolumeclaims	0	4
    replicationcontrollers	3	20
    secrets			9	10
    services		2	10

17.9. Configuring Quota Synchronization Period

When a set of resources are deleted, the synchronization time frame of resources is determined by the resource-quota-sync-period setting in the /etc/origin/master/master-config.yaml file.

Before quota usage is restored, a user may encounter problems when attempting to reuse the resources. You can change the resource-quota-sync-period setting to have the set of resources regenerate at the desired amount of time (in seconds) and for the resources to be available again:

kubernetesMasterConfig:
  apiLevels:
  - v1beta3
  - v1
  apiServerArguments: null
  controllerArguments:
    resource-quota-sync-period:
      - "10s"

After making any changes, restart the master services to apply them.

# master-restart api
# master-restart controllers

Adjusting the regeneration time can be helpful for creating resources and determining resource usage when automation is used.

Note

The resource-quota-sync-period setting is designed to balance system performance. Reducing the sync period can result in a heavy load on the master.

17.10. Accounting for Quota in Deployment Configurations

If a quota has been defined for your project, see Deployment Resources for considerations on any deployment configurations.

17.11. Require Explicit Quota to Consume a Resource

If a resource is not managed by quota, a user has no restriction on the amount of resource that can be consumed. For example, if there is no quota on storage related to the gold storage class, the amount of gold storage a project can create is unbounded.

For high-cost compute or storage resources, administrators may want to require an explicit quota be granted in order to consume a resource. For example, if a project was not explicitly given quota for storage related to the gold storage class, users of that project would not be able to create any storage of that type.

In order to require explicit quota to consume a particular resource, the following stanza should be added to the master-config.yaml.

admissionConfig:
  pluginConfig:
    ResourceQuota:
      configuration:
        apiVersion: resourcequota.admission.k8s.io/v1alpha1
        kind: Configuration
        limitedResources:
        - resource: persistentvolumeclaims 1
        matchContains:
        - gold.storageclass.storage.k8s.io/requests.storage 2
1
The group/resource to whose consumption is limited by default.
2
The name of the resource tracked by quota associated with the group/resource to limit by default.

In the above example, the quota system will intercept every operation that creates or updates a PersistentVolumeClaim. It checks what resources understood by quota would be consumed, and if there is no covering quota for those resources in the project, the request is denied. In this example, if a user creates a PersistentVolumeClaim that uses storage associated with the gold storage class, and there is no matching quota in the project, the request is denied.

17.12. Known Issues

  • Invalid objects can cause quota resources for a project to become exhausted. Quota is incremented in admission prior to validation of the resource. As a result, quota can be incremented even if the pod is not ultimately persisted. This will be resolved in a future release. (BZ1485375)

Chapter 18. Setting Multi-Project Quotas

18.1. Overview

A multi-project quota, defined by a ClusterResourceQuota object, allows quotas to be shared across multiple projects. Resources used in each selected project will be aggregated and that aggregate will be used to limit resources across all the selected projects.

18.2. Selecting Projects

You can select projects based on annotation selection, label selection, or both. For example, to select projects based on annotations, run the following command:

$ oc create clusterquota for-user \
     --project-annotation-selector openshift.io/requester=<user-name> \
     --hard pods=10 \
     --hard secrets=20

It creates the following ClusterResourceQuota object:

apiVersion: v1
kind: ClusterResourceQuota
metadata:
  name: for-user
spec:
  quota: 1
    hard:
      pods: "10"
      secrets: "20"
  selector:
    annotations: 2
      openshift.io/requester: <user-name>
    labels: null 3
status:
  namespaces: 4
  - namespace: ns-one
    status:
      hard:
        pods: "10"
        secrets: "20"
      used:
        pods: "1"
        secrets: "9"
  total: 5
    hard:
      pods: "10"
      secrets: "20"
    used:
      pods: "1"
      secrets: "9"
1
The ResourceQuotaSpec object that will be enforced over the selected projects.
2
A simple key/value selector for annotations.
3
A label selector that can be used to select projects.
4
A per-namespace map that describes current quota usage in each selected project.
5
The aggregate usage across all selected projects.

This multi-project quota document controls all projects requested by <user-name> using the default project request endpoint. You are limited to 10 pods and 20 secrets.

Similarly, to select projects based on labels, run this command:

$ oc create clusterresourcequota for-name \ 1
    --project-label-selector=name=frontend \ 2
    --hard=pods=10 --hard=secrets=20
1
Both clusterresourcequota and clusterquota are aliases of the same command. for-name is the name of the clusterresourcequota object.
2
To select projects by label, provide a key-value pair by using the format --project-label-selector=key=value.

It creates the following ClusterResourceQuota object definition:

apiVersion: v1
kind: ClusterResourceQuota
metadata:
  creationTimestamp: null
  name: for-name
spec:
  quota:
    hard:
      pods: "10"
      secrets: "20"
  selector:
    annotations: null
    labels:
      matchLabels:
        name: frontend

18.3. Viewing Applicable ClusterResourceQuotas

A project administrator is not allowed to create or modify the multi-project quota that limits his or her project, but the administrator is allowed to view the multi-project quota documents that are applied to his or her project. The project administrator can do this via the AppliedClusterResourceQuota resource.

$ oc describe AppliedClusterResourceQuota

produces:

Name:   for-user
Namespace:  <none>
Created:  19 hours ago
Labels:   <none>
Annotations:  <none>
Label Selector: <null>
AnnotationSelector: map[openshift.io/requester:<user-name>]
Resource  Used  Hard
--------  ----  ----
pods        1     10
secrets     9     20

18.4. Selection Granularity

Because of the locking consideration when claiming quota allocations, the number of active projects selected by a multi-project quota is an important consideration. Selecting more than 100 projects under a single multi-project quota may have detrimental effects on API server responsiveness in those projects.

Chapter 19. Pruning objects

19.1. Overview

Over time, API objects created in OpenShift Container Platform can accumulate in the etcd data store through normal user operations, such as when building and deploying applications.

As an administrator, you can periodically prune older versions of objects from your OpenShift Container Platform instance that are no longer needed. For example, by pruning images you can delete older images and layers that are no longer in use, but are still taking up disk space.

19.2. Basic prune operations

The CLI groups prune operations under a common parent command.

$ oc adm prune <object_type> <options>

This specifies:

  • The <object_type> to perform the action on, such as groups, builds, deployments, or images.
  • The <options> supported to prune that object type.

19.3. Pruning groups

To prune groups records from an external provider, administrators can run the following command:

$ oc adm prune groups --sync-config=path/to/sync/config [<options>]
Table 19.1. Prune groups CLI configuration options
OptionsDescription

--confirm

Indicate that pruning should occur, instead of performing a dry-run.

--blacklist

Path to the group blacklist file. See Syncing Groups With LDAP for the blacklist file structure.

--whitelist

Path to the group whitelist file. See Syncing Groups With LDAP for the whitelist file structure.

--sync-config