Networking


OpenShift Container Platform 4.11

Configuring and managing cluster networking

Red Hat OpenShift Documentation Team

Abstract

This document provides instructions for configuring and managing your OpenShift Container Platform cluster network, including DNS, ingress, and the Pod network.

Chapter 1. Understanding networking

Cluster Administrators have several options for exposing applications that run inside a cluster to external traffic and securing network connections:

  • Service types, such as node ports or load balancers
  • API resources, such as Ingress and Route

By default, Kubernetes allocates each pod an internal IP address for applications running within the pod. Pods and their containers can network, but clients outside the cluster do not have networking access. When you expose your application to external traffic, giving each pod its own IP address means that pods can be treated like physical hosts or virtual machines in terms of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.

Note

Some cloud platforms offer metadata APIs that listen on the 169.254.169.254 IP address, a link-local IP address in the IPv4 169.254.0.0/16 CIDR block.

This CIDR block is not reachable from the pod network. Pods that need access to these IP addresses must be given host network access by setting the spec.hostNetwork field in the pod spec to true.

If you allow a pod host network access, you grant the pod privileged access to the underlying network infrastructure.

1.1. OpenShift Container Platform DNS

If you are running multiple services, such as front-end and back-end services for use with multiple pods, environment variables are created for user names, service IPs, and more so the front-end pods can communicate with the back-end services. If the service is deleted and recreated, a new IP address can be assigned to the service, and requires the front-end pods to be recreated to pick up the updated values for the service IP environment variable. Additionally, the back-end service must be created before any of the front-end pods to ensure that the service IP is generated properly, and that it can be provided to the front-end pods as an environment variable.

For this reason, OpenShift Container Platform has a built-in DNS so that the services can be reached by the service DNS as well as the service IP/port.

1.2. OpenShift Container Platform Ingress Operator

When you create your OpenShift Container Platform cluster, pods and services running on the cluster are each allocated their own IP addresses. The IP addresses are accessible to other pods and services running nearby but are not accessible to outside clients. The Ingress Operator implements the IngressController API and is the component responsible for enabling external access to OpenShift Container Platform cluster services.

The Ingress Operator makes it possible for external clients to access your service by deploying and managing one or more HAProxy-based Ingress Controllers to handle routing. You can use the Ingress Operator to route traffic by specifying OpenShift Container Platform Route and Kubernetes Ingress resources. Configurations within the Ingress Controller, such as the ability to define endpointPublishingStrategy type and internal load balancing, provide ways to publish Ingress Controller endpoints.

1.2.1. Comparing routes and Ingress

The Kubernetes Ingress resource in OpenShift Container Platform implements the Ingress Controller with a shared router service that runs as a pod inside the cluster. The most common way to manage Ingress traffic is with the Ingress Controller. You can scale and replicate this pod like any other regular pod. This router service is based on HAProxy, which is an open source load balancer solution.

The OpenShift Container Platform route provides Ingress traffic to services in the cluster. Routes provide advanced features that might not be supported by standard Kubernetes Ingress Controllers, such as TLS re-encryption, TLS passthrough, and split traffic for blue-green deployments.

Ingress traffic accesses services in the cluster through a route. Routes and Ingress are the main resources for handling Ingress traffic. Ingress provides features similar to a route, such as accepting external requests and delegating them based on the route. However, with Ingress you can only allow certain types of connections: HTTP/2, HTTPS and server name identification (SNI), and TLS with certificate. In OpenShift Container Platform, routes are generated to meet the conditions specified by the Ingress resource.

This glossary defines common terms that are used in the networking content.

authentication
To control access to an OpenShift Container Platform cluster, a cluster administrator can configure user authentication and ensure only approved users access the cluster. To interact with an OpenShift Container Platform cluster, you must authenticate to the OpenShift Container Platform API. You can authenticate by providing an OAuth access token or an X.509 client certificate in your requests to the OpenShift Container Platform API.
AWS Load Balancer Operator
The AWS Load Balancer (ALB) Operator deploys and manages an instance of the aws-load-balancer-controller.
Cluster Network Operator
The Cluster Network Operator (CNO) deploys and manages the cluster network components in an OpenShift Container Platform cluster. This includes deployment of the Container Network Interface (CNI) default network provider plug-in selected for the cluster during installation.
config map
A config map provides a way to inject configuration data into pods. You can reference the data stored in a config map in a volume of type ConfigMap. Applications running in a pod can use this data.
custom resource (CR)
A CR is extension of the Kubernetes API. You can create custom resources.
DNS
Cluster DNS is a DNS server which serves DNS records for Kubernetes services. Containers started by Kubernetes automatically include this DNS server in their DNS searches.
DNS Operator
The DNS Operator deploys and manages CoreDNS to provide a name resolution service to pods. This enables DNS-based Kubernetes Service discovery in OpenShift Container Platform.
deployment
A Kubernetes resource object that maintains the life cycle of an application.
domain
Domain is a DNS name serviced by the Ingress Controller.
egress
The process of data sharing externally through a network’s outbound traffic from a pod.
External DNS Operator
The External DNS Operator deploys and manages ExternalDNS to provide the name resolution for services and routes from the external DNS provider to OpenShift Container Platform.
HTTP-based route
An HTTP-based route is an unsecured route that uses the basic HTTP routing protocol and exposes a service on an unsecured application port.
Ingress
The Kubernetes Ingress resource in OpenShift Container Platform implements the Ingress Controller with a shared router service that runs as a pod inside the cluster.
Ingress Controller
The Ingress Operator manages Ingress Controllers. Using an Ingress Controller is the most common way to allow external access to an OpenShift Container Platform cluster.
installer-provisioned infrastructure
The installation program deploys and configures the infrastructure that the cluster runs on.
kubelet
A primary node agent that runs on each node in the cluster to ensure that containers are running in a pod.
Kubernetes NMState Operator
The Kubernetes NMState Operator provides a Kubernetes API for performing state-driven network configuration across the OpenShift Container Platform cluster’s nodes with NMState.
kube-proxy
Kube-proxy is a proxy service which runs on each node and helps in making services available to the external host. It helps in forwarding the request to correct containers and is capable of performing primitive load balancing.
load balancers
OpenShift Container Platform uses load balancers for communicating from outside the cluster with services running in the cluster.
MetalLB Operator
As a cluster administrator, you can add the MetalLB Operator to your cluster so that when a service of type LoadBalancer is added to the cluster, MetalLB can add an external IP address for the service.
multicast
With IP multicast, data is broadcast to many IP addresses simultaneously.
namespaces
A namespace isolates specific system resources that are visible to all processes. Inside a namespace, only processes that are members of that namespace can see those resources.
networking
Network information of a OpenShift Container Platform cluster.
node
A worker machine in the OpenShift Container Platform cluster. A node is either a virtual machine (VM) or a physical machine.
OpenShift Container Platform Ingress Operator
The Ingress Operator implements the IngressController API and is the component responsible for enabling external access to OpenShift Container Platform services.
pod
One or more containers with shared resources, such as volume and IP addresses, running in your OpenShift Container Platform cluster. A pod is the smallest compute unit defined, deployed, and managed.
PTP Operator
The PTP Operator creates and manages the linuxptp services.
route
The OpenShift Container Platform route provides Ingress traffic to services in the cluster. Routes provide advanced features that might not be supported by standard Kubernetes Ingress Controllers, such as TLS re-encryption, TLS passthrough, and split traffic for blue-green deployments.
scaling
Increasing or decreasing the resource capacity.
service
Exposes a running application on a set of pods.
Single Root I/O Virtualization (SR-IOV) Network Operator
The Single Root I/O Virtualization (SR-IOV) Network Operator manages the SR-IOV network devices and network attachments in your cluster.
software-defined networking (SDN)
OpenShift Container Platform uses a software-defined networking (SDN) approach to provide a unified cluster network that enables communication between pods across the OpenShift Container Platform cluster.
Stream Control Transmission Protocol (SCTP)
SCTP is a reliable message based protocol that runs on top of an IP network.
taint
Taints and tolerations ensure that pods are scheduled onto appropriate nodes. You can apply one or more taints on a node.
toleration
You can apply tolerations to pods. Tolerations allow the scheduler to schedule pods with matching taints.
web console
A user interface (UI) to manage OpenShift Container Platform.

Chapter 2. Accessing hosts

Learn how to create a bastion host to access OpenShift Container Platform instances and access the control plane nodes with secure shell (SSH) access.

The OpenShift Container Platform installer does not create any public IP addresses for any of the Amazon Elastic Compute Cloud (Amazon EC2) instances that it provisions for your OpenShift Container Platform cluster. To be able to SSH to your OpenShift Container Platform hosts, you must follow this procedure.

Procedure

  1. Create a security group that allows SSH access into the virtual private cloud (VPC) created by the openshift-install command.
  2. Create an Amazon EC2 instance on one of the public subnets the installer created.
  3. Associate a public IP address with the Amazon EC2 instance that you created.

    Unlike with the OpenShift Container Platform installation, you should associate the Amazon EC2 instance you created with an SSH keypair. It does not matter what operating system you choose for this instance, as it will simply serve as an SSH bastion to bridge the internet into your OpenShift Container Platform cluster’s VPC. The Amazon Machine Image (AMI) you use does matter. With Red Hat Enterprise Linux CoreOS (RHCOS), for example, you can provide keys via Ignition, like the installer does.

  4. After you provisioned your Amazon EC2 instance and can SSH into it, you must add the SSH key that you associated with your OpenShift Container Platform installation. This key can be different from the key for the bastion instance, but does not have to be.

    Note

    Direct SSH access is only recommended for disaster recovery. When the Kubernetes API is responsive, run privileged pods instead.

  5. Run oc get nodes, inspect the output, and choose one of the nodes that is a master. The hostname looks similar to ip-10-0-1-163.ec2.internal.
  6. From the bastion SSH host you manually deployed into Amazon EC2, SSH into that control plane host. Ensure that you use the same SSH key you specified during the installation:

    $ ssh -i <ssh-key-path> core@<master-hostname>
    Copy to Clipboard Toggle word wrap

Chapter 3. Networking Operators overview

OpenShift Container Platform supports multiple types of networking Operators. You can manage the cluster networking using these networking Operators.

3.1. Cluster Network Operator

The Cluster Network Operator (CNO) deploys and manages the cluster network components in an OpenShift Container Platform cluster. This includes deployment of the Container Network Interface (CNI) default network provider plugin selected for the cluster during installation. For more information, see Cluster Network Operator in OpenShift Container Platform.

3.2. DNS Operator

The DNS Operator deploys and manages CoreDNS to provide a name resolution service to pods. This enables DNS-based Kubernetes Service discovery in OpenShift Container Platform. For more information, see DNS Operator in OpenShift Container Platform.

3.3. Ingress Operator

When you create your OpenShift Container Platform cluster, pods and services running on the cluster are each allocated IP addresses. The IP addresses are accessible to other pods and services running nearby but are not accessible to external clients. The Ingress Operator implements the Ingress Controller API and is responsible for enabling external access to OpenShift Container Platform cluster services. For more information, see Ingress Operator in OpenShift Container Platform.

3.4. External DNS Operator

The External DNS Operator deploys and manages ExternalDNS to provide the name resolution for services and routes from the external DNS provider to OpenShift Container Platform. For more information, see Understanding the External DNS Operator.

3.5. Network Observability Operator

The Network Observability Operator is an optional Operator that allows cluster administrators to observe the network traffic for OpenShift Container Platform clusters. The Network Observability Operator uses the eBPF technology to create network flows. The network flows are then enriched with OpenShift Container Platform information and stored in Loki. You can view and analyze the stored network flows information in the OpenShift Container Platform console for further insight and troubleshooting. For more information, see About Network Observability Operator.

The Cluster Network Operator (CNO) deploys and manages the cluster network components on an OpenShift Container Platform cluster, including the Container Network Interface (CNI) default network provider plugin selected for the cluster during installation.

4.1. Cluster Network Operator

The Cluster Network Operator implements the network API from the operator.openshift.io API group. The Operator deploys the OpenShift SDN default Container Network Interface (CNI) network provider plugin, or the default network provider plugin that you selected during cluster installation, by using a daemon set.

Procedure

The Cluster Network Operator is deployed during installation as a Kubernetes Deployment.

  1. Run the following command to view the Deployment status:

    $ oc get -n openshift-network-operator deployment/network-operator
    Copy to Clipboard Toggle word wrap

    Example output

    NAME               READY   UP-TO-DATE   AVAILABLE   AGE
    network-operator   1/1     1            1           56m
    Copy to Clipboard Toggle word wrap

  2. Run the following command to view the state of the Cluster Network Operator:

    $ oc get clusteroperator/network
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    network   4.5.4     True        False         False      50m
    Copy to Clipboard Toggle word wrap

    The following fields provide information about the status of the operator: AVAILABLE, PROGRESSING, and DEGRADED. The AVAILABLE field is True when the Cluster Network Operator reports an available status condition.

4.2. Viewing the cluster network configuration

Every new OpenShift Container Platform installation has a network.config object named cluster.

Procedure

  • Use the oc describe command to view the cluster network configuration:

    $ oc describe network.config/cluster
    Copy to Clipboard Toggle word wrap

    Example output

    Name:         cluster
    Namespace:
    Labels:       <none>
    Annotations:  <none>
    API Version:  config.openshift.io/v1
    Kind:         Network
    Metadata:
      Self Link:           /apis/config.openshift.io/v1/networks/cluster
    Spec: 
    1
    
      Cluster Network:
        Cidr:         10.128.0.0/14
        Host Prefix:  23
      Network Type:   OpenShiftSDN
      Service Network:
        172.30.0.0/16
    Status: 
    2
    
      Cluster Network:
        Cidr:               10.128.0.0/14
        Host Prefix:        23
      Cluster Network MTU:  8951
      Network Type:         OpenShiftSDN
      Service Network:
        172.30.0.0/16
    Events:  <none>
    Copy to Clipboard Toggle word wrap

    1
    The Spec field displays the configured state of the cluster network.
    2
    The Status field displays the current state of the cluster network configuration.

4.3. Viewing Cluster Network Operator status

You can inspect the status and view the details of the Cluster Network Operator using the oc describe command.

Procedure

  • Run the following command to view the status of the Cluster Network Operator:

    $ oc describe clusteroperators/network
    Copy to Clipboard Toggle word wrap

4.4. Viewing Cluster Network Operator logs

You can view Cluster Network Operator logs by using the oc logs command.

Procedure

  • Run the following command to view the logs of the Cluster Network Operator:

    $ oc logs --namespace=openshift-network-operator deployment/network-operator
    Copy to Clipboard Toggle word wrap

4.5. Cluster Network Operator configuration

The configuration for the cluster network is specified as part of the Cluster Network Operator (CNO) configuration and stored in a custom resource (CR) object that is named cluster. The CR specifies the fields for the Network API in the operator.openshift.io API group.

The CNO configuration inherits the following fields during cluster installation from the Network API in the Network.config.openshift.io API group and these fields cannot be changed:

clusterNetwork
IP address pools from which pod IP addresses are allocated.
serviceNetwork
IP address pool for services.
defaultNetwork.type
Cluster network provider, such as OpenShift SDN or OVN-Kubernetes.
Note

After cluster installation, you cannot modify the fields listed in the previous section.

You can specify the cluster network provider configuration for your cluster by setting the fields for the defaultNetwork object in the CNO object named cluster.

The fields for the Cluster Network Operator (CNO) are described in the following table:

Expand
Table 4.1. Cluster Network Operator configuration object
FieldTypeDescription

metadata.name

string

The name of the CNO object. This name is always cluster.

spec.clusterNetwork

array

A list specifying the blocks of IP addresses from which pod IP addresses are allocated and the subnet prefix length assigned to each individual node in the cluster. For example:

spec:
  clusterNetwork:
  - cidr: 10.128.0.0/19
    hostPrefix: 23
  - cidr: 10.128.32.0/19
    hostPrefix: 23
Copy to Clipboard Toggle word wrap

This value is ready-only and inherited from the Network.config.openshift.io object named cluster during cluster installation.

spec.serviceNetwork

array

A block of IP addresses for services. The OpenShift SDN and OVN-Kubernetes Container Network Interface (CNI) network providers support only a single IP address block for the service network. For example:

spec:
  serviceNetwork:
  - 172.30.0.0/14
Copy to Clipboard Toggle word wrap

This value is ready-only and inherited from the Network.config.openshift.io object named cluster during cluster installation.

spec.defaultNetwork

object

Configures the Container Network Interface (CNI) cluster network provider for the cluster network.

spec.kubeProxyConfig

object

The fields for this object specify the kube-proxy configuration. If you are using the OVN-Kubernetes cluster network provider, the kube-proxy configuration has no effect.

defaultNetwork object configuration

The values for the defaultNetwork object are defined in the following table:

Expand
Table 4.2. defaultNetwork object
FieldTypeDescription

type

string

Either OpenShiftSDN or OVNKubernetes. The cluster network provider is selected during installation. This value cannot be changed after cluster installation.

Note

OpenShift Container Platform uses the OpenShift SDN Container Network Interface (CNI) cluster network provider by default.

openshiftSDNConfig

object

This object is only valid for the OpenShift SDN cluster network provider.

ovnKubernetesConfig

object

This object is only valid for the OVN-Kubernetes cluster network provider.

Configuration for the OpenShift SDN CNI cluster network provider

The following table describes the configuration fields for the OpenShift SDN Container Network Interface (CNI) cluster network provider.

Expand
Table 4.3. openshiftSDNConfig object
FieldTypeDescription

mode

string

The network isolation mode for OpenShift SDN.

mtu

integer

The maximum transmission unit (MTU) for the VXLAN overlay network. This value is normally configured automatically.

vxlanPort

integer

The port to use for all VXLAN packets. The default value is 4789.

Note

You can only change the configuration for your cluster network provider during cluster installation.

Example OpenShift SDN configuration

defaultNetwork:
  type: OpenShiftSDN
  openshiftSDNConfig:
    mode: NetworkPolicy
    mtu: 1450
    vxlanPort: 4789
Copy to Clipboard Toggle word wrap

Configuration for the OVN-Kubernetes CNI cluster network provider

The following table describes the configuration fields for the OVN-Kubernetes CNI cluster network provider.

Expand
Table 4.4. ovnKubernetesConfig object
FieldTypeDescription

mtu

integer

The maximum transmission unit (MTU) for the Geneve (Generic Network Virtualization Encapsulation) overlay network. This value is normally configured automatically.

genevePort

integer

The UDP port for the Geneve overlay network.

ipsecConfig

object

If the field is present, IPsec is enabled for the cluster.

policyAuditConfig

object

Specify a configuration object for customizing network policy audit logging. If unset, the defaults audit log settings are used.

gatewayConfig

object

Optional: Specify a configuration object for customizing how egress traffic is sent to the node gateway.

Note

While migrating egress traffic, you can expect some disruption to workloads and service traffic until the Cluster Network Operator (CNO) successfully rolls out the changes.

Expand
Table 4.5. policyAuditConfig object
FieldTypeDescription

rateLimit

integer

The maximum number of messages to generate every second per node. The default value is 20 messages per second.

maxFileSize

integer

The maximum size for the audit log in bytes. The default value is 50000000 or 50 MB.

destination

string

One of the following additional audit log targets:

libc
The libc syslog() function of the journald process on the host.
udp:<host>:<port>
A syslog server. Replace <host>:<port> with the host and port of the syslog server.
unix:<file>
A Unix Domain Socket file specified by <file>.
null
Do not send the audit logs to any additional target.

syslogFacility

string

The syslog facility, such as kern, as defined by RFC5424. The default value is local0.

Expand
Table 4.6. gatewayConfig object
FieldTypeDescription

routingViaHost

boolean

Set this field to true to send egress traffic from pods to the host networking stack. For highly-specialized installations and applications that rely on manually configured routes in the kernel routing table, you might want to route egress traffic to the host networking stack. By default, egress traffic is processed in OVN to exit the cluster and is not affected by specialized routes in the kernel routing table. The default value is false.

This field has an interaction with the Open vSwitch hardware offloading feature. If you set this field to true, you do not receive the performance benefits of the offloading because egress traffic is processed by the host networking stack.

Note

You can only change the configuration for your cluster network provider during cluster installation, except for the gatewayConfig field that can be changed at runtime as a post-installation activity.

Example OVN-Kubernetes configuration with IPSec enabled

defaultNetwork:
  type: OVNKubernetes
  ovnKubernetesConfig:
    mtu: 1400
    genevePort: 6081
    ipsecConfig: {}
Copy to Clipboard Toggle word wrap

kubeProxyConfig object configuration

The values for the kubeProxyConfig object are defined in the following table:

Expand
Table 4.7. kubeProxyConfig object
FieldTypeDescription

iptablesSyncPeriod

string

The refresh period for iptables rules. The default value is 30s. Valid suffixes include s, m, and h and are described in the Go time package documentation.

Note

Because of performance improvements introduced in OpenShift Container Platform 4.3 and greater, adjusting the iptablesSyncPeriod parameter is no longer necessary.

proxyArguments.iptables-min-sync-period

array

The minimum duration before refreshing iptables rules. This field ensures that the refresh does not happen too frequently. Valid suffixes include s, m, and h and are described in the Go time package. The default value is:

kubeProxyConfig:
  proxyArguments:
    iptables-min-sync-period:
    - 0s
Copy to Clipboard Toggle word wrap

A complete CNO configuration is specified in the following example:

Example Cluster Network Operator object

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  clusterNetwork: 
1

  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork: 
2

  - 172.30.0.0/16
  defaultNetwork: 
3

    type: OpenShiftSDN
    openshiftSDNConfig:
      mode: NetworkPolicy
      mtu: 1450
      vxlanPort: 4789
  kubeProxyConfig:
    iptablesSyncPeriod: 30s
    proxyArguments:
      iptables-min-sync-period:
      - 0s
Copy to Clipboard Toggle word wrap

1 2 3
Configured only during cluster installation.

The DNS Operator deploys and manages CoreDNS to provide a name resolution service to pods, enabling DNS-based Kubernetes Service discovery in OpenShift Container Platform.

5.1. DNS Operator

The DNS Operator implements the dns API from the operator.openshift.io API group. The Operator deploys CoreDNS using a daemon set, creates a service for the daemon set, and configures the kubelet to instruct pods to use the CoreDNS service IP address for name resolution.

Procedure

The DNS Operator is deployed during installation with a Deployment object.

  1. Use the oc get command to view the deployment status:

    $ oc get -n openshift-dns-operator deployment/dns-operator
    Copy to Clipboard Toggle word wrap

    Example output

    NAME           READY     UP-TO-DATE   AVAILABLE   AGE
    dns-operator   1/1       1            1           23h
    Copy to Clipboard Toggle word wrap

  2. Use the oc get command to view the state of the DNS Operator:

    $ oc get clusteroperator/dns
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      VERSION     AVAILABLE   PROGRESSING   DEGRADED   SINCE
    dns       4.1.0-0.11  True        False         False      92m
    Copy to Clipboard Toggle word wrap

    AVAILABLE, PROGRESSING and DEGRADED provide information about the status of the operator. AVAILABLE is True when at least 1 pod from the CoreDNS daemon set reports an Available status condition.

5.2. Changing the DNS Operator managementState

DNS manages the CoreDNS component to provide a name resolution service for pods and services in the cluster. The managementState of the DNS Operator is set to Managed by default, which means that the DNS Operator is actively managing its resources. You can change it to Unmanaged, which means the DNS Operator is not managing its resources.

The following are use cases for changing the DNS Operator managementState:

  • You are a developer and want to test a configuration change to see if it fixes an issue in CoreDNS. You can stop the DNS Operator from overwriting the fix by setting the managementState to Unmanaged.
  • You are a cluster administrator and have reported an issue with CoreDNS, but need to apply a workaround until the issue is fixed. You can set the managementState field of the DNS Operator to Unmanaged to apply the workaround.

Procedure

  • Change managementState DNS Operator:

    oc patch dns.operator.openshift.io default --type merge --patch '{"spec":{"managementState":"Unmanaged"}}'
    Copy to Clipboard Toggle word wrap

5.3. Controlling DNS pod placement

The DNS Operator has two daemon sets: one for CoreDNS and one for managing the /etc/hosts file. The daemon set for /etc/hosts must run on every node host to add an entry for the cluster image registry to support pulling images. Security policies can prohibit communication between pairs of nodes, which prevents the daemon set for CoreDNS from running on every node.

As a cluster administrator, you can use a custom node selector to configure the daemon set for CoreDNS to run or not run on certain nodes.

Prerequisites

  • You installed the oc CLI.
  • You are logged in to the cluster with a user with cluster-admin privileges.

Procedure

  • To prevent communication between certain nodes, configure the spec.nodePlacement.nodeSelector API field:

    1. Modify the DNS Operator object named default:

      $ oc edit dns.operator/default
      Copy to Clipboard Toggle word wrap
    2. Specify a node selector that includes only control plane nodes in the spec.nodePlacement.nodeSelector API field:

       spec:
         nodePlacement:
           nodeSelector:
             node-role.kubernetes.io/worker: ""
      Copy to Clipboard Toggle word wrap
  • To allow the daemon set for CoreDNS to run on nodes, configure a taint and toleration:

    1. Modify the DNS Operator object named default:

      $ oc edit dns.operator/default
      Copy to Clipboard Toggle word wrap
    2. Specify a taint key and a toleration for the taint:

       spec:
         nodePlacement:
           tolerations:
           - effect: NoExecute
             key: "dns-only"
             operators: Equal
             value: abc
             tolerationSeconds: 3600 
      1
      Copy to Clipboard Toggle word wrap
      1
      If the taint is dns-only, it can be tolerated indefinitely. You can omit tolerationSeconds.

5.4. View the default DNS

Every new OpenShift Container Platform installation has a dns.operator named default.

Procedure

  1. Use the oc describe command to view the default dns:

    $ oc describe dns.operator/default
    Copy to Clipboard Toggle word wrap

    Example output

    Name:         default
    Namespace:
    Labels:       <none>
    Annotations:  <none>
    API Version:  operator.openshift.io/v1
    Kind:         DNS
    ...
    Status:
      Cluster Domain:  cluster.local 
    1
    
      Cluster IP:      172.30.0.10 
    2
    
    ...
    Copy to Clipboard Toggle word wrap

    1
    The Cluster Domain field is the base DNS domain used to construct fully qualified pod and service domain names.
    2
    The Cluster IP is the address pods query for name resolution. The IP is defined as the 10th address in the service CIDR range.
  2. To find the service CIDR of your cluster, use the oc get command:

    $ oc get networks.config/cluster -o jsonpath='{$.status.serviceNetwork}'
    Copy to Clipboard Toggle word wrap

Example output

[172.30.0.0/16]
Copy to Clipboard Toggle word wrap

5.5. Using DNS forwarding

You can use DNS forwarding to override the default forwarding configuration in the /etc/resolv.conf file in the following ways:

  • Specify name servers for every zone. If the forwarded zone is the Ingress domain managed by OpenShift Container Platform, then the upstream name server must be authorized for the domain.
  • Provide a list of upstream DNS servers.
  • Change the default forwarding policy.
Note

A DNS forwarding configuration for the default domain can have both the default servers specified in the /etc/resolv.conf file and the upstream DNS servers.

Procedure

  1. Modify the DNS Operator object named default:

    $ oc edit dns.operator/default
    Copy to Clipboard Toggle word wrap

    After you issue the previous command, the Operator creates and updates the config map named dns-default with additional server configuration blocks based on Server. If none of the servers have a zone that matches the query, then name resolution falls back to the upstream DNS servers.

    Configuring DNS forwarding

    apiVersion: operator.openshift.io/v1
    kind: DNS
    metadata:
      name: default
    spec:
      servers:
      - name: example-server 
    1
    
        zones: 
    2
    
        - example.com
        forwardPlugin:
          policy: Random 
    3
    
          upstreams: 
    4
    
          - 1.1.1.1
          - 2.2.2.2:5353
      upstreamResolvers: 
    5
    
        policy: Random 
    6
    
        upstreams: 
    7
    
        - type: SystemResolvConf 
    8
    
        - type: Network
          address: 1.2.3.4 
    9
    
          port: 53 
    10
    Copy to Clipboard Toggle word wrap

    1
    Must comply with the rfc6335 service name syntax.
    2
    Must conform to the definition of a subdomain in the rfc1123 service name syntax. The cluster domain, cluster.local, is an invalid subdomain for the zones field.
    3
    Defines the policy to select upstream resolvers. Default value is Random. You can also use the values RoundRobin, and Sequential.
    4
    A maximum of 15 upstreams is allowed per forwardPlugin.
    5
    Optional. You can use it to override the default policy and forward DNS resolution to the specified DNS resolvers (upstream resolvers) for the default domain. If you do not provide any upstream resolvers, the DNS name queries go to the servers in /etc/resolv.conf.
    6
    Determines the order in which upstream servers are selected for querying. You can specify one of these values: Random, RoundRobin, or Sequential. The default value is Sequential.
    7
    Optional. You can use it to provide upstream resolvers.
    8
    You can specify two types of upstreams - SystemResolvConf and Network. SystemResolvConf configures the upstream to use /etc/resolv.conf and Network defines a Networkresolver. You can specify one or both.
    9
    If the specified type is Network, you must provide an IP address. The address field must be a valid IPv4 or IPv6 address.
    10
    If the specified type is Network, you can optionally provide a port. The port field must have a value between 1 and 65535. If you do not specify a port for the upstream, by default port 853 is tried.
  2. Optional: When working in a highly regulated environment, you might need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that you can ensure additional DNS traffic and data privacy. Cluster administrators can configure transport layer security (TLS) for forwarded DNS queries.

    Configuring DNS forwarding with TLS

    apiVersion: operator.openshift.io/v1
    kind: DNS
    metadata:
      name: default
    spec:
      servers:
      - name: example-server 
    1
    
        zones: 
    2
    
        - example.com
        forwardPlugin:
          transportConfig:
            transport: TLS 
    3
    
            tls:
              caBundle:
                name: mycacert
              serverName: dnstls.example.com  
    4
    
          policy: Random 
    5
    
          upstreams: 
    6
    
          - 1.1.1.1
          - 2.2.2.2:5353
      upstreamResolvers: 
    7
    
        transportConfig:
          transport: TLS
          tls:
            caBundle:
              name: mycacert
            serverName: dnstls.example.com
        upstreams:
        - type: Network 
    8
    
          address: 1.2.3.4 
    9
    
          port: 53 
    10
    Copy to Clipboard Toggle word wrap

    1
    Must comply with the rfc6335 service name syntax.
    2
    Must conform to the definition of a subdomain in the rfc1123 service name syntax. The cluster domain, cluster.local, is an invalid subdomain for the zones field. The cluster domain, cluster.local, is an invalid subdomain for zones.
    3
    When configuring TLS for forwarded DNS queries, set the transport field to have the value TLS. By default, CoreDNS caches forwarded connections for 10 seconds. CoreDNS will hold a TCP connection open for those 10 seconds if no request is issued. With large clusters, ensure that your DNS server is aware that it might get many new connections to hold open because you can initiate a connection per node. Set up your DNS hierarchy accordingly to avoid performance issues.
    4
    When configuring TLS for forwarded DNS queries, this is a mandatory server name used as part of the server name indication (SNI) to validate the upstream TLS server certificate.
    5
    Defines the policy to select upstream resolvers. Default value is Random. You can also use the values RoundRobin, and Sequential.
    6
    Required. You can use it to provide upstream resolvers. A maximum of 15 upstreams entries are allowed per forwardPlugin entry.
    7
    Optional. You can use it to override the default policy and forward DNS resolution to the specified DNS resolvers (upstream resolvers) for the default domain. If you do not provide any upstream resolvers, the DNS name queries go to the servers in /etc/resolv.conf.
    8
    Network type indicates that this upstream resolver should handle forwarded requests separately from the upstream resolvers listed in /etc/resolv.conf. Only the Network type is allowed when using TLS and you must provide an IP address.
    9
    The address field must be a valid IPv4 or IPv6 address.
    10
    You can optionally provide a port. The port must have a value between 1 and 65535. If you do not specify a port for the upstream, by default port 853 is tried.
    Note

    If servers is undefined or invalid, the config map only contains the default server.

Verification

  1. View the config map:

    $ oc get configmap/dns-default -n openshift-dns -o yaml
    Copy to Clipboard Toggle word wrap

    Sample DNS ConfigMap based on previous sample DNS

    apiVersion: v1
    data:
      Corefile: |
        example.com:5353 {
            forward . 1.1.1.1 2.2.2.2:5353
        }
        bar.com:5353 example.com:5353 {
            forward . 3.3.3.3 4.4.4.4:5454 
    1
    
        }
        .:5353 {
            errors
            health
            kubernetes cluster.local in-addr.arpa ip6.arpa {
                pods insecure
                upstream
                fallthrough in-addr.arpa ip6.arpa
            }
            prometheus :9153
            forward . /etc/resolv.conf 1.2.3.4:53 {
                policy Random
            }
            cache 30
            reload
        }
    kind: ConfigMap
    metadata:
      labels:
        dns.operator.openshift.io/owning-dns: default
      name: dns-default
      namespace: openshift-dns
    Copy to Clipboard Toggle word wrap

    1
    Changes to the forwardPlugin triggers a rolling update of the CoreDNS daemon set.

5.6. DNS Operator status

You can inspect the status and view the details of the DNS Operator using the oc describe command.

Procedure

View the status of the DNS Operator:

$ oc describe clusteroperators/dns
Copy to Clipboard Toggle word wrap

5.7. DNS Operator logs

You can view DNS Operator logs by using the oc logs command.

Procedure

View the logs of the DNS Operator:

$ oc logs -n openshift-dns-operator deployment/dns-operator -c dns-operator
Copy to Clipboard Toggle word wrap

5.8. Setting the CoreDNS log level

You can configure the CoreDNS log level to determine the amount of detail in logged error messages. The valid values for CoreDNS log level are Normal, Debug, and Trace. The default logLevel is Normal.

Note

The errors plugin is always enabled. The following logLevel settings report different error responses:

  • logLevel: Normal enables the "errors" class: log . { class error }.
  • logLevel: Debug enables the "denial" class: log . { class denial error }.
  • logLevel: Trace enables the "all" class: log . { class all }.

Procedure

  • To set logLevel to Debug, enter the following command:

    $ oc patch dnses.operator.openshift.io/default -p '{"spec":{"logLevel":"Debug"}}' --type=merge
    Copy to Clipboard Toggle word wrap
  • To set logLevel to Trace, enter the following command:

    $ oc patch dnses.operator.openshift.io/default -p '{"spec":{"logLevel":"Trace"}}' --type=merge
    Copy to Clipboard Toggle word wrap

Verification

  • To ensure the desired log level was set, check the config map:

    $ oc get configmap/dns-default -n openshift-dns -o yaml
    Copy to Clipboard Toggle word wrap

5.9. Setting the CoreDNS Operator log level

Cluster administrators can configure the Operator log level to more quickly track down OpenShift DNS issues. The valid values for operatorLogLevel are Normal, Debug, and Trace. Trace has the most detailed information. The default operatorlogLevel is Normal. There are seven logging levels for issues: Trace, Debug, Info, Warning, Error, Fatal and Panic. After the logging level is set, log entries with that severity or anything above it will be logged.

  • operatorLogLevel: "Normal" sets logrus.SetLogLevel("Info").
  • operatorLogLevel: "Debug" sets logrus.SetLogLevel("Debug").
  • operatorLogLevel: "Trace" sets logrus.SetLogLevel("Trace").

Procedure

  • To set operatorLogLevel to Debug, enter the following command:

    $ oc patch dnses.operator.openshift.io/default -p '{"spec":{"operatorLogLevel":"Debug"}}' --type=merge
    Copy to Clipboard Toggle word wrap
  • To set operatorLogLevel to Trace, enter the following command:

    $ oc patch dnses.operator.openshift.io/default -p '{"spec":{"operatorLogLevel":"Trace"}}' --type=merge
    Copy to Clipboard Toggle word wrap

6.1. OpenShift Container Platform Ingress Operator

When you create your OpenShift Container Platform cluster, pods and services running on the cluster are each allocated their own IP addresses. The IP addresses are accessible to other pods and services running nearby but are not accessible to outside clients. The Ingress Operator implements the IngressController API and is the component responsible for enabling external access to OpenShift Container Platform cluster services.

The Ingress Operator makes it possible for external clients to access your service by deploying and managing one or more HAProxy-based Ingress Controllers to handle routing. You can use the Ingress Operator to route traffic by specifying OpenShift Container Platform Route and Kubernetes Ingress resources. Configurations within the Ingress Controller, such as the ability to define endpointPublishingStrategy type and internal load balancing, provide ways to publish Ingress Controller endpoints.

6.2. The Ingress configuration asset

The installation program generates an asset with an Ingress resource in the config.openshift.io API group, cluster-ingress-02-config.yml.

YAML Definition of the Ingress resource

apiVersion: config.openshift.io/v1
kind: Ingress
metadata:
  name: cluster
spec:
  domain: apps.openshiftdemos.com
Copy to Clipboard Toggle word wrap

The installation program stores this asset in the cluster-ingress-02-config.yml file in the manifests/ directory. This Ingress resource defines the cluster-wide configuration for Ingress. This Ingress configuration is used as follows:

  • The Ingress Operator uses the domain from the cluster Ingress configuration as the domain for the default Ingress Controller.
  • The OpenShift API Server Operator uses the domain from the cluster Ingress configuration. This domain is also used when generating a default host for a Route resource that does not specify an explicit host.

6.3. Ingress Controller configuration parameters

The ingresscontrollers.operator.openshift.io resource offers the following configuration parameters.

Expand
ParameterDescription

domain

domain is a DNS name serviced by the Ingress Controller and is used to configure multiple features:

  • For the LoadBalancerService endpoint publishing strategy, domain is used to configure DNS records. See endpointPublishingStrategy.
  • When using a generated default certificate, the certificate is valid for domain and its subdomains. See defaultCertificate.
  • The value is published to individual Route statuses so that users know where to target external DNS records.

The domain value must be unique among all Ingress Controllers and cannot be updated.

If empty, the default value is ingress.config.openshift.io/cluster .spec.domain.

replicas

replicas is the desired number of Ingress Controller replicas. If not set, the default value is 2.

endpointPublishingStrategy

endpointPublishingStrategy is used to publish the Ingress Controller endpoints to other networks, enable load balancer integrations, and provide access to other systems.

If not set, the default value is based on infrastructure.config.openshift.io/cluster .status.platform:

  • Amazon Web Services (AWS): LoadBalancerService (with External scope)
  • Azure: LoadBalancerService (with External scope)
  • Google Cloud Platform (GCP): LoadBalancerService (with External scope)
  • Bare metal: NodePortService
  • Other: HostNetwork

    Note

    HostNetwork has a hostNetwork field with the following default values for the optional binding ports: httpPort: 80, httpsPort: 443, and statsPort: 1936. With the binding ports, you can deploy multiple Ingress Controllers on the same node for the HostNetwork strategy.

    Example

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: internal
      namespace: openshift-ingress-operator
    spec:
      domain: example.com
      endpointPublishingStrategy:
        type: HostNetwork
        hostNetwork:
          httpPort: 80
          httpsPort: 443
          statsPort: 1936
    Copy to Clipboard Toggle word wrap

    Note

    On Red Hat OpenStack Platform (RHOSP), the LoadBalancerService endpoint publishing strategy is only supported if a cloud provider is configured to create health monitors. For RHOSP 16.1 and 16.2, this strategy is only possible if you use the Amphora Octavia provider.

    For more information, see the "Setting cloud provider options" section of the RHOSP installation documentation.

For most platforms, the endpointPublishingStrategy value can be updated. On GCP, you can configure the following endpointPublishingStrategy fields:

  • loadBalancer.scope
  • loadbalancer.providerParameters.gcp.clientAccess
  • hostNetwork.protocol
  • nodePort.protocol

defaultCertificate

The defaultCertificate value is a reference to a secret that contains the default certificate that is served by the Ingress Controller. When Routes do not specify their own certificate, defaultCertificate is used.

The secret must contain the following keys and data: * tls.crt: certificate file contents * tls.key: key file contents

If not set, a wildcard certificate is automatically generated and used. The certificate is valid for the Ingress Controller domain and subdomains, and the generated certificate’s CA is automatically integrated with the cluster’s trust store.

The in-use certificate, whether generated or user-specified, is automatically integrated with OpenShift Container Platform built-in OAuth server.

namespaceSelector

namespaceSelector is used to filter the set of namespaces serviced by the Ingress Controller. This is useful for implementing shards.

routeSelector

routeSelector is used to filter the set of Routes serviced by the Ingress Controller. This is useful for implementing shards.

nodePlacement

nodePlacement enables explicit control over the scheduling of the Ingress Controller.

If not set, the defaults values are used.

Note

The nodePlacement parameter includes two parts, nodeSelector and tolerations. For example:

nodePlacement:
 nodeSelector:
   matchLabels:
     kubernetes.io/os: linux
 tolerations:
 - effect: NoSchedule
   operator: Exists
Copy to Clipboard Toggle word wrap

tlsSecurityProfile

tlsSecurityProfile specifies settings for TLS connections for Ingress Controllers.

If not set, the default value is based on the apiservers.config.openshift.io/cluster resource.

When using the Old, Intermediate, and Modern profile types, the effective profile configuration is subject to change between releases. For example, given a specification to use the Intermediate profile deployed on release X.Y.Z, an upgrade to release X.Y.Z+1 may cause a new profile configuration to be applied to the Ingress Controller, resulting in a rollout.

The minimum TLS version for Ingress Controllers is 1.1, and the maximum TLS version is 1.3.

Note

Ciphers and the minimum TLS version of the configured security profile are reflected in the TLSProfile status.

Important

The Ingress Operator converts the TLS 1.0 of an Old or Custom profile to 1.1.

clientTLS

clientTLS authenticates client access to the cluster and services; as a result, mutual TLS authentication is enabled. If not set, then client TLS is not enabled.

clientTLS has the required subfields, spec.clientTLS.clientCertificatePolicy and spec.clientTLS.ClientCA.

The ClientCertificatePolicy subfield accepts one of the two values: Required or Optional. The ClientCA subfield specifies a config map that is in the openshift-config namespace. The config map should contain a CA certificate bundle.

The AllowedSubjectPatterns is an optional value that specifies a list of regular expressions, which are matched against the distinguished name on a valid client certificate to filter requests. The regular expressions must use PCRE syntax. At least one pattern must match a client certificate’s distinguished name; otherwise, the Ingress Controller rejects the certificate and denies the connection. If not specified, the Ingress Controller does not reject certificates based on the distinguished name.

routeAdmission

routeAdmission defines a policy for handling new route claims, such as allowing or denying claims across namespaces.

namespaceOwnership describes how hostname claims across namespaces should be handled. The default is Strict.

  • Strict: does not allow routes to claim the same hostname across namespaces.
  • InterNamespaceAllowed: allows routes to claim different paths of the same hostname across namespaces.

wildcardPolicy describes how routes with wildcard policies are handled by the Ingress Controller.

  • WildcardsAllowed: Indicates routes with any wildcard policy are admitted by the Ingress Controller.
  • WildcardsDisallowed: Indicates only routes with a wildcard policy of None are admitted by the Ingress Controller. Updating wildcardPolicy from WildcardsAllowed to WildcardsDisallowed causes admitted routes with a wildcard policy of Subdomain to stop working. These routes must be recreated to a wildcard policy of None to be readmitted by the Ingress Controller. WildcardsDisallowed is the default setting.

IngressControllerLogging

logging defines parameters for what is logged where. If this field is empty, operational logs are enabled but access logs are disabled.

  • access describes how client requests are logged. If this field is empty, access logging is disabled.

    • destination describes a destination for log messages.

      • type is the type of destination for logs:

        • Container specifies that logs should go to a sidecar container. The Ingress Operator configures the container, named logs, on the Ingress Controller pod and configures the Ingress Controller to write logs to the container. The expectation is that the administrator configures a custom logging solution that reads logs from this container. Using container logs means that logs may be dropped if the rate of logs exceeds the container runtime capacity or the custom logging solution capacity.
        • Syslog specifies that logs are sent to a Syslog endpoint. The administrator must specify an endpoint that can receive Syslog messages. The expectation is that the administrator has configured a custom Syslog instance.
      • container describes parameters for the Container logging destination type. Currently there are no parameters for container logging, so this field must be empty.
      • syslog describes parameters for the Syslog logging destination type:

        • address is the IP address of the syslog endpoint that receives log messages.
        • port is the UDP port number of the syslog endpoint that receives log messages.
        • maxLength is the maximum length of the syslog message. It must be between 480 and 4096 bytes. If this field is empty, the maximum length is set to the default value of 1024 bytes.
        • facility specifies the syslog facility of log messages. If this field is empty, the facility is local1. Otherwise, it must specify a valid syslog facility: kern, user, mail, daemon, auth, syslog, lpr, news, uucp, cron, auth2, ftp, ntp, audit, alert, cron2, local0, local1, local2, local3. local4, local5, local6, or local7.
    • httpLogFormat specifies the format of the log message for an HTTP request. If this field is empty, log messages use the implementation’s default HTTP log format. For HAProxy’s default HTTP log format, see the HAProxy documentation.

httpHeaders

httpHeaders defines the policy for HTTP headers.

By setting the forwardedHeaderPolicy for the IngressControllerHTTPHeaders, you specify when and how the Ingress Controller sets the Forwarded, X-Forwarded-For, X-Forwarded-Host, X-Forwarded-Port, X-Forwarded-Proto, and X-Forwarded-Proto-Version HTTP headers.

By default, the policy is set to Append.

  • Append specifies that the Ingress Controller appends the headers, preserving any existing headers.
  • Replace specifies that the Ingress Controller sets the headers, removing any existing headers.
  • IfNone specifies that the Ingress Controller sets the headers if they are not already set.
  • Never specifies that the Ingress Controller never sets the headers, preserving any existing headers.

By setting headerNameCaseAdjustments, you can specify case adjustments that can be applied to HTTP header names. Each adjustment is specified as an HTTP header name with the desired capitalization. For example, specifying X-Forwarded-For indicates that the x-forwarded-for HTTP header should be adjusted to have the specified capitalization.

These adjustments are only applied to cleartext, edge-terminated, and re-encrypt routes, and only when using HTTP/1.

For request headers, these adjustments are applied only for routes that have the haproxy.router.openshift.io/h1-adjust-case=true annotation. For response headers, these adjustments are applied to all HTTP responses. If this field is empty, no request headers are adjusted.

httpCompression

httpCompression defines the policy for HTTP traffic compression.

  • mimeTypes defines a list of MIME types to which compression should be applied. For example, text/css; charset=utf-8, text/html, text/*, image/svg+xml, application/octet-stream, X-custom/customsub, using the format pattern, type/subtype; [;attribute=value]. The types are: application, image, message, multipart, text, video, or a custom type prefaced by X-; e.g. To see the full notation for MIME types and subtypes, see RFC1341

httpErrorCodePages

httpErrorCodePages specifies custom HTTP error code response pages. By default, an IngressController uses error pages built into the IngressController image.

httpCaptureCookies

httpCaptureCookies specifies HTTP cookies that you want to capture in access logs. If the httpCaptureCookies field is empty, the access logs do not capture the cookies.

For any cookie that you want to capture, the following parameters must be in your IngressController configuration:

  • name specifies the name of the cookie.
  • maxLength specifies tha maximum length of the cookie.
  • matchType specifies if the field name of the cookie exactly matches the capture cookie setting or is a prefix of the capture cookie setting. The matchType field uses the Exact and Prefix parameters.

For example:

  httpCaptureCookies:
  - matchType: Exact
    maxLength: 128
    name: MYCOOKIE
Copy to Clipboard Toggle word wrap

httpCaptureHeaders

httpCaptureHeaders specifies the HTTP headers that you want to capture in the access logs. If the httpCaptureHeaders field is empty, the access logs do not capture the headers.

httpCaptureHeaders contains two lists of headers to capture in the access logs. The two lists of header fields are request and response. In both lists, the name field must specify the header name and the maxlength field must specify the maximum length of the header. For example:

  httpCaptureHeaders:
    request:
    - maxLength: 256
      name: Connection
    - maxLength: 128
      name: User-Agent
    response:
    - maxLength: 256
      name: Content-Type
    - maxLength: 256
      name: Content-Length
Copy to Clipboard Toggle word wrap

tuningOptions

tuningOptions specifies options for tuning the performance of Ingress Controller pods.

  • clientFinTimeout specifies how long a connection is held open while waiting for the client response to the server closing the connection. The default timeout is 1s.
  • clientTimeout specifies how long a connection is held open while waiting for a client response. The default timeout is 30s.
  • headerBufferBytes specifies how much memory is reserved, in bytes, for Ingress Controller connection sessions. This value must be at least 16384 if HTTP/2 is enabled for the Ingress Controller. If not set, the default value is 32768 bytes. Setting this field not recommended because headerBufferBytes values that are too small can break the Ingress Controller, and headerBufferBytes values that are too large could cause the Ingress Controller to use significantly more memory than necessary.
  • headerBufferMaxRewriteBytes specifies how much memory should be reserved, in bytes, from headerBufferBytes for HTTP header rewriting and appending for Ingress Controller connection sessions. The minimum value for headerBufferMaxRewriteBytes is 4096. headerBufferBytes must be greater than headerBufferMaxRewriteBytes for incoming HTTP requests. If not set, the default value is 8192 bytes. Setting this field not recommended because headerBufferMaxRewriteBytes values that are too small can break the Ingress Controller and headerBufferMaxRewriteBytes values that are too large could cause the Ingress Controller to use significantly more memory than necessary.
  • healthCheckInterval specifies how long the router waits between health checks. The default is 5s.
  • serverFinTimeout specifies how long a connection is held open while waiting for the server response to the client that is closing the connection. The default timeout is 1s.
  • serverTimeout specifies how long a connection is held open while waiting for a server response. The default timeout is 30s.
  • threadCount specifies the number of threads to create per HAProxy process. Creating more threads allows each Ingress Controller pod to handle more connections, at the cost of more system resources being used. HAProxy supports up to 64 threads. If this field is empty, the Ingress Controller uses the default value of 4 threads. The default value can change in future releases. Setting this field is not recommended because increasing the number of HAProxy threads allows Ingress Controller pods to use more CPU time under load, and prevent other pods from receiving the CPU resources they need to perform. Reducing the number of threads can cause the Ingress Controller to perform poorly.
  • tlsInspectDelay specifies how long the router can hold data to find a matching route. Setting this value too short can cause the router to fall back to the default certificate for edge-terminated, reencrypted, or passthrough routes, even when using a better matched certificate. The default inspect delay is 5s.
  • tunnelTimeout specifies how long a tunnel connection, including websockets, remains open while the tunnel is idle. The default timeout is 1h.
  • maxConnections specifies the maximum number of simultaneous connections that can be established per HAProxy process. Increasing this value allows each ingress controller pod handle more connections at the cost of additional system resources. Permitted values are 0, -1, any value within the range 2000 and 2000000, or the field can be left empty.

    • If this field is left empty or has the value 0, the ingress controller will use the default value of 20000. This value is subject to change in future releases.
    • If the field has the value of -1, then HAProxy will dynamically compute a maximum value based on the available ulimits in the running container. This process results in a large computed value that will incur significant memory usage compared to the current default value of 20000.
    • If the field has a value that is greater than the current operating system limit, the HAProxy process will not start.
    • If you choose a discrete value and the router pod is migrated to a new node, it is possible the new node does not have an identical ulimit configured. In such cases, the pod fails to start.
    • If you have nodes with different ulimits configured, and you choose a discrete value, it is recommended to use the value of -1 for this field so that the maximum number of connections is calculated at runtime.

logEmptyRequests

logEmptyRequests specifies connections for which no request is received and logged. These empty requests come from load balancer health probes or web browser speculative connections (preconnect) and logging these requests can be undesirable. However, these requests can be caused by network errors, in which case logging empty requests can be useful for diagnosing the errors. These requests can be caused by port scans, and logging empty requests can aid in detecting intrusion attempts. Allowed values for this field are Log and Ignore. The default value is Log.

The LoggingPolicy type accepts either one of two values:

  • Log: Setting this value to Log indicates that an event should be logged.
  • Ignore: Setting this value to Ignore sets the dontlognull option in the HAproxy configuration.

HTTPEmptyRequestsPolicy

HTTPEmptyRequestsPolicy describes how HTTP connections are handled if the connection times out before a request is received. Allowed values for this field are Respond and Ignore. The default value is Respond.

The HTTPEmptyRequestsPolicy type accepts either one of two values:

  • Respond: If the field is set to Respond, the Ingress Controller sends an HTTP 400 or 408 response, logs the connection if access logging is enabled, and counts the connection in the appropriate metrics.
  • Ignore: Setting this option to Ignore adds the http-ignore-probes parameter in the HAproxy configuration. If the field is set to Ignore, the Ingress Controller closes the connection without sending a response, then logs the connection, or incrementing metrics.

These connections come from load balancer health probes or web browser speculative connections (preconnect) and can be safely ignored. However, these requests can be caused by network errors, so setting this field to Ignore can impede detection and diagnosis of problems. These requests can be caused by port scans, in which case logging empty requests can aid in detecting intrusion attempts.

Note

All parameters are optional.

6.3.1. Ingress Controller TLS security profiles

TLS security profiles provide a way for servers to regulate which ciphers a connecting client can use when connecting to the server.

6.3.1.1. Understanding TLS security profiles

You can use a TLS (Transport Layer Security) security profile to define which TLS ciphers are required by various OpenShift Container Platform components. The OpenShift Container Platform TLS security profiles are based on Mozilla recommended configurations.

You can specify one of the following TLS security profiles for each component:

Expand
Table 6.1. TLS security profiles
ProfileDescription

Old

This profile is intended for use with legacy clients or libraries. The profile is based on the Old backward compatibility recommended configuration.

The Old profile requires a minimum TLS version of 1.0.

Note

For the Ingress Controller, the minimum TLS version is converted from 1.0 to 1.1.

Intermediate

This profile is the recommended configuration for the majority of clients. It is the default TLS security profile for the Ingress Controller, kubelet, and control plane. The profile is based on the Intermediate compatibility recommended configuration.

The Intermediate profile requires a minimum TLS version of 1.2.

Modern

This profile is intended for use with modern clients that have no need for backwards compatibility. This profile is based on the Modern compatibility recommended configuration.

The Modern profile requires a minimum TLS version of 1.3.

Custom

This profile allows you to define the TLS version and ciphers to use.

Warning

Use caution when using a Custom profile, because invalid configurations can cause problems.

Note

When using one of the predefined profile types, the effective profile configuration is subject to change between releases. For example, given a specification to use the Intermediate profile deployed on release X.Y.Z, an upgrade to release X.Y.Z+1 might cause a new profile configuration to be applied, resulting in a rollout.

To configure a TLS security profile for an Ingress Controller, edit the IngressController custom resource (CR) to specify a predefined or custom TLS security profile. If a TLS security profile is not configured, the default value is based on the TLS security profile set for the API server.

Sample IngressController CR that configures the Old TLS security profile

apiVersion: operator.openshift.io/v1
kind: IngressController
 ...
spec:
  tlsSecurityProfile:
    old: {}
    type: Old
 ...
Copy to Clipboard Toggle word wrap

The TLS security profile defines the minimum TLS version and the TLS ciphers for TLS connections for Ingress Controllers.

You can see the ciphers and the minimum TLS version of the configured TLS security profile in the IngressController custom resource (CR) under Status.Tls Profile and the configured TLS security profile under Spec.Tls Security Profile. For the Custom TLS security profile, the specific ciphers and minimum TLS version are listed under both parameters.

Note

The HAProxy Ingress Controller image supports TLS 1.3 and the Modern profile.

The Ingress Operator also converts the TLS 1.0 of an Old or Custom profile to 1.1.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.

Procedure

  1. Edit the IngressController CR in the openshift-ingress-operator project to configure the TLS security profile:

    $ oc edit IngressController default -n openshift-ingress-operator
    Copy to Clipboard Toggle word wrap
  2. Add the spec.tlsSecurityProfile field:

    Sample IngressController CR for a Custom profile

    apiVersion: operator.openshift.io/v1
    kind: IngressController
     ...
    spec:
      tlsSecurityProfile:
        type: Custom 
    1
    
        custom: 
    2
    
          ciphers: 
    3
    
          - ECDHE-ECDSA-CHACHA20-POLY1305
          - ECDHE-RSA-CHACHA20-POLY1305
          - ECDHE-RSA-AES128-GCM-SHA256
          - ECDHE-ECDSA-AES128-GCM-SHA256
          minTLSVersion: VersionTLS11
     ...
    Copy to Clipboard Toggle word wrap

    1
    Specify the TLS security profile type (Old, Intermediate, or Custom). The default is Intermediate.
    2
    Specify the appropriate field for the selected type:
    • old: {}
    • intermediate: {}
    • custom:
    3
    For the custom type, specify a list of TLS ciphers and minimum accepted TLS version.
  3. Save the file to apply the changes.

Verification

  • Verify that the profile is set in the IngressController CR:

    $ oc describe IngressController default -n openshift-ingress-operator
    Copy to Clipboard Toggle word wrap

    Example output

    Name:         default
    Namespace:    openshift-ingress-operator
    Labels:       <none>
    Annotations:  <none>
    API Version:  operator.openshift.io/v1
    Kind:         IngressController
     ...
    Spec:
     ...
      Tls Security Profile:
        Custom:
          Ciphers:
            ECDHE-ECDSA-CHACHA20-POLY1305
            ECDHE-RSA-CHACHA20-POLY1305
            ECDHE-RSA-AES128-GCM-SHA256
            ECDHE-ECDSA-AES128-GCM-SHA256
          Min TLS Version:  VersionTLS11
        Type:               Custom
     ...
    Copy to Clipboard Toggle word wrap

6.3.1.3. Configuring mutual TLS authentication

You can configure the Ingress Controller to enable mutual TLS (mTLS) authentication by setting a spec.clientTLS value. The clientTLS value configures the Ingress Controller to verify client certificates. This configuration includes setting a clientCA value, which is a reference to a config map. The config map contains the PEM-encoded CA certificate bundle that is used to verify a client’s certificate. Optionally, you can also configure a list of certificate subject filters.

If the clientCA value specifies an X509v3 certificate revocation list (CRL) distribution point, the Ingress Operator downloads and manages a CRL config map based on the HTTP URI X509v3 CRL Distribution Point specified in each provided certificate. The Ingress Controller uses this config map during mTLS/TLS negotiation. Requests that do not provide valid certificates are rejected.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have a PEM-encoded CA certificate bundle.
  • If your CA bundle references a CRL distribution point, you must have also included the end-entity or leaf certificate to the client CA bundle. This certificate must have included an HTTP URI under CRL Distribution Points, as described in RFC 5280. For example:

     Issuer: C=US, O=Example Inc, CN=Example Global G2 TLS RSA SHA256 2020 CA1
             Subject: SOME SIGNED CERT            X509v3 CRL Distribution Points:
                    Full Name:
                      URI:http://crl.example.com/example.crl
    Copy to Clipboard Toggle word wrap

Procedure

  1. In the openshift-config namespace, create a config map from your CA bundle:

    $ oc create configmap \
       router-ca-certs-default \
       --from-file=ca-bundle.pem=client-ca.crt \
    1
    
       -n openshift-config
    Copy to Clipboard Toggle word wrap
    1
    The config map data key must be ca-bundle.pem, and the data value must be a CA certificate in PEM format.
  2. Edit the IngressController resource in the openshift-ingress-operator project:

    $ oc edit IngressController default -n openshift-ingress-operator
    Copy to Clipboard Toggle word wrap
  3. Add the spec.clientTLS field and subfields to configure mutual TLS:

    Sample IngressController CR for a clientTLS profile that specifies filtering patterns

      apiVersion: operator.openshift.io/v1
      kind: IngressController
      metadata:
        name: default
        namespace: openshift-ingress-operator
      spec:
        clientTLS:
          clientCertificatePolicy: Required
          clientCA:
            name: router-ca-certs-default
          allowedSubjectPatterns:
          - "^/CN=example.com/ST=NC/C=US/O=Security/OU=OpenShift$"
    Copy to Clipboard Toggle word wrap

6.4. View the default Ingress Controller

The Ingress Operator is a core feature of OpenShift Container Platform and is enabled out of the box.

Every new OpenShift Container Platform installation has an ingresscontroller named default. It can be supplemented with additional Ingress Controllers. If the default ingresscontroller is deleted, the Ingress Operator will automatically recreate it within a minute.

Procedure

  • View the default Ingress Controller:

    $ oc describe --namespace=openshift-ingress-operator ingresscontroller/default
    Copy to Clipboard Toggle word wrap

6.5. View Ingress Operator status

You can view and inspect the status of your Ingress Operator.

Procedure

  • View your Ingress Operator status:

    $ oc describe clusteroperators/ingress
    Copy to Clipboard Toggle word wrap

6.6. View Ingress Controller logs

You can view your Ingress Controller logs.

Procedure

  • View your Ingress Controller logs:

    $ oc logs --namespace=openshift-ingress-operator deployments/ingress-operator -c <container_name>
    Copy to Clipboard Toggle word wrap

6.7. View Ingress Controller status

Your can view the status of a particular Ingress Controller.

Procedure

  • View the status of an Ingress Controller:

    $ oc describe --namespace=openshift-ingress-operator ingresscontroller/<name>
    Copy to Clipboard Toggle word wrap

6.8. Configuring the Ingress Controller

6.8.1. Setting a custom default certificate

As an administrator, you can configure an Ingress Controller to use a custom certificate by creating a Secret resource and editing the IngressController custom resource (CR).

Prerequisites

  • You must have a certificate/key pair in PEM-encoded files, where the certificate is signed by a trusted certificate authority or by a private trusted certificate authority that you configured in a custom PKI.
  • Your certificate meets the following requirements:

    • The certificate is valid for the ingress domain.
    • The certificate uses the subjectAltName extension to specify a wildcard domain, such as *.apps.ocp4.example.com.
  • You must have an IngressController CR. You may use the default one:

    $ oc --namespace openshift-ingress-operator get ingresscontrollers
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      AGE
    default   10m
    Copy to Clipboard Toggle word wrap

Note

If you have intermediate certificates, they must be included in the tls.crt file of the secret containing a custom default certificate. Order matters when specifying a certificate; list your intermediate certificate(s) after any server certificate(s).

Procedure

The following assumes that the custom certificate and key pair are in the tls.crt and tls.key files in the current working directory. Substitute the actual path names for tls.crt and tls.key. You also may substitute another name for custom-certs-default when creating the Secret resource and referencing it in the IngressController CR.

Note

This action will cause the Ingress Controller to be redeployed, using a rolling deployment strategy.

  1. Create a Secret resource containing the custom certificate in the openshift-ingress namespace using the tls.crt and tls.key files.

    $ oc --namespace openshift-ingress create secret tls custom-certs-default --cert=tls.crt --key=tls.key
    Copy to Clipboard Toggle word wrap
  2. Update the IngressController CR to reference the new certificate secret:

    $ oc patch --type=merge --namespace openshift-ingress-operator ingresscontrollers/default \
      --patch '{"spec":{"defaultCertificate":{"name":"custom-certs-default"}}}'
    Copy to Clipboard Toggle word wrap
  3. Verify the update was effective:

    $ echo Q |\
      openssl s_client -connect console-openshift-console.apps.<domain>:443 -showcerts 2>/dev/null |\
      openssl x509 -noout -subject -issuer -enddate
    Copy to Clipboard Toggle word wrap

    where:

    <domain>
    Specifies the base domain name for your cluster.

    Example output

    subject=C = US, ST = NC, L = Raleigh, O = RH, OU = OCP4, CN = *.apps.example.com
    issuer=C = US, ST = NC, L = Raleigh, O = RH, OU = OCP4, CN = example.com
    notAfter=May 10 08:32:45 2022 GM
    Copy to Clipboard Toggle word wrap

    Tip

    You can alternatively apply the following YAML to set a custom default certificate:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      defaultCertificate:
        name: custom-certs-default
    Copy to Clipboard Toggle word wrap

    The certificate secret name should match the value used to update the CR.

Once the IngressController CR has been modified, the Ingress Operator updates the Ingress Controller’s deployment to use the custom certificate.

6.8.2. Removing a custom default certificate

As an administrator, you can remove a custom certificate that you configured an Ingress Controller to use.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the OpenShift CLI (oc).
  • You previously configured a custom default certificate for the Ingress Controller.

Procedure

  • To remove the custom certificate and restore the certificate that ships with OpenShift Container Platform, enter the following command:

    $ oc patch -n openshift-ingress-operator ingresscontrollers/default \
      --type json -p $'- op: remove\n  path: /spec/defaultCertificate'
    Copy to Clipboard Toggle word wrap

    There can be a delay while the cluster reconciles the new certificate configuration.

Verification

  • To confirm that the original cluster certificate is restored, enter the following command:

    $ echo Q | \
      openssl s_client -connect console-openshift-console.apps.<domain>:443 -showcerts 2>/dev/null | \
      openssl x509 -noout -subject -issuer -enddate
    Copy to Clipboard Toggle word wrap

    where:

    <domain>
    Specifies the base domain name for your cluster.

    Example output

    subject=CN = *.apps.<domain>
    issuer=CN = ingress-operator@1620633373
    notAfter=May 10 10:44:36 2023 GMT
    Copy to Clipboard Toggle word wrap

6.8.3. Scaling an Ingress Controller

Manually scale an Ingress Controller to meeting routing performance or availability requirements such as the requirement to increase throughput. oc commands are used to scale the IngressController resource. The following procedure provides an example for scaling up the default IngressController.

Note

Scaling is not an immediate action, as it takes time to create the desired number of replicas.

Procedure

  1. View the current number of available replicas for the default IngressController:

    $ oc get -n openshift-ingress-operator ingresscontrollers/default -o jsonpath='{$.status.availableReplicas}'
    Copy to Clipboard Toggle word wrap

    Example output

    2
    Copy to Clipboard Toggle word wrap

  2. Scale the default IngressController to the desired number of replicas using the oc patch command. The following example scales the default IngressController to 3 replicas:

    $ oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
    Copy to Clipboard Toggle word wrap

    Example output

    ingresscontroller.operator.openshift.io/default patched
    Copy to Clipboard Toggle word wrap

  3. Verify that the default IngressController scaled to the number of replicas that you specified:

    $ oc get -n openshift-ingress-operator ingresscontrollers/default -o jsonpath='{$.status.availableReplicas}'
    Copy to Clipboard Toggle word wrap

    Example output

    3
    Copy to Clipboard Toggle word wrap

    Tip

    You can alternatively apply the following YAML to scale an Ingress Controller to three replicas:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      replicas: 3               
    1
    Copy to Clipboard Toggle word wrap
    1
    If you need a different amount of replicas, change the replicas value.

6.8.4. Configuring Ingress access logging

You can configure the Ingress Controller to enable access logs. If you have clusters that do not receive much traffic, then you can log to a sidecar. If you have high traffic clusters, to avoid exceeding the capacity of the logging stack or to integrate with a logging infrastructure outside of OpenShift Container Platform, you can forward logs to a custom syslog endpoint. You can also specify the format for access logs.

Container logging is useful to enable access logs on low-traffic clusters when there is no existing Syslog logging infrastructure, or for short-term use while diagnosing problems with the Ingress Controller.

Syslog is needed for high-traffic clusters where access logs could exceed the OpenShift Logging stack’s capacity, or for environments where any logging solution needs to integrate with an existing Syslog logging infrastructure. The Syslog use-cases can overlap.

Prerequisites

  • Log in as a user with cluster-admin privileges.

Procedure

Configure Ingress access logging to a sidecar.

  • To configure Ingress access logging, you must specify a destination using spec.logging.access.destination. To specify logging to a sidecar container, you must specify Container spec.logging.access.destination.type. The following example is an Ingress Controller definition that logs to a Container destination:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      replicas: 2
      logging:
        access:
          destination:
            type: Container
    Copy to Clipboard Toggle word wrap
  • When you configure the Ingress Controller to log to a sidecar, the operator creates a container named logs inside the Ingress Controller Pod:

    $ oc -n openshift-ingress logs deployment.apps/router-default -c logs
    Copy to Clipboard Toggle word wrap

    Example output

    2020-05-11T19:11:50.135710+00:00 router-default-57dfc6cd95-bpmk6 router-default-57dfc6cd95-bpmk6 haproxy[108]: 174.19.21.82:39654 [11/May/2020:19:11:50.133] public be_http:hello-openshift:hello-openshift/pod:hello-openshift:hello-openshift:10.128.2.12:8080 0/0/1/0/1 200 142 - - --NI 1/1/0/0/0 0/0 "GET / HTTP/1.1"
    Copy to Clipboard Toggle word wrap

Configure Ingress access logging to a Syslog endpoint.

  • To configure Ingress access logging, you must specify a destination using spec.logging.access.destination. To specify logging to a Syslog endpoint destination, you must specify Syslog for spec.logging.access.destination.type. If the destination type is Syslog, you must also specify a destination endpoint using spec.logging.access.destination.syslog.endpoint and you can specify a facility using spec.logging.access.destination.syslog.facility. The following example is an Ingress Controller definition that logs to a Syslog destination:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      replicas: 2
      logging:
        access:
          destination:
            type: Syslog
            syslog:
              address: 1.2.3.4
              port: 10514
    Copy to Clipboard Toggle word wrap
    Note

    The syslog destination port must be UDP.

Configure Ingress access logging with a specific log format.

  • You can specify spec.logging.access.httpLogFormat to customize the log format. The following example is an Ingress Controller definition that logs to a syslog endpoint with IP address 1.2.3.4 and port 10514:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      replicas: 2
      logging:
        access:
          destination:
            type: Syslog
            syslog:
              address: 1.2.3.4
              port: 10514
          httpLogFormat: '%ci:%cp [%t] %ft %b/%s %B %bq %HM %HU %HV'
    Copy to Clipboard Toggle word wrap

Disable Ingress access logging.

  • To disable Ingress access logging, leave spec.logging or spec.logging.access empty:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      replicas: 2
      logging:
        access: null
    Copy to Clipboard Toggle word wrap

6.8.5. Setting Ingress Controller thread count

A cluster administrator can set the thread count to increase the amount of incoming connections a cluster can handle. You can patch an existing Ingress Controller to increase the amount of threads.

Prerequisites

  • The following assumes that you already created an Ingress Controller.

Procedure

  • Update the Ingress Controller to increase the number of threads:

    $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"threadCount": 8}}}'
    Copy to Clipboard Toggle word wrap
    Note

    If you have a node that is capable of running large amounts of resources, you can configure spec.nodePlacement.nodeSelector with labels that match the capacity of the intended node, and configure spec.tuningOptions.threadCount to an appropriately high value.

When creating an Ingress Controller on cloud platforms, the Ingress Controller is published by a public cloud load balancer by default. As an administrator, you can create an Ingress Controller that uses an internal cloud load balancer.

Warning

If your cloud provider is Microsoft Azure, you must have at least one public load balancer that points to your nodes. If you do not, all of your nodes will lose egress connectivity to the internet.

Important

If you want to change the scope for an IngressController, you can change the .spec.endpointPublishingStrategy.loadBalancer.scope parameter after the custom resource (CR) is created.

Figure 6.1. Diagram of LoadBalancer

The preceding graphic shows the following concepts pertaining to OpenShift Container Platform Ingress LoadBalancerService endpoint publishing strategy:

  • You can load balance externally, using the cloud provider load balancer, or internally, using the OpenShift Ingress Controller Load Balancer.
  • You can use the single IP address of the load balancer and more familiar ports, such as 8080 and 4200 as shown on the cluster depicted in the graphic.
  • Traffic from the external load balancer is directed at the pods, and managed by the load balancer, as depicted in the instance of a down node. See the Kubernetes Services documentation for implementation details.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create an IngressController custom resource (CR) in a file named <name>-ingress-controller.yaml, such as in the following example:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      namespace: openshift-ingress-operator
      name: <name> 
    1
    
    spec:
      domain: <domain> 
    2
    
      endpointPublishingStrategy:
        type: LoadBalancerService
        loadBalancer:
          scope: Internal 
    3
    Copy to Clipboard Toggle word wrap
    1
    Replace <name> with a name for the IngressController object.
    2
    Specify the domain for the application published by the controller.
    3
    Specify a value of Internal to use an internal load balancer.
  2. Create the Ingress Controller defined in the previous step by running the following command:

    $ oc create -f <name>-ingress-controller.yaml 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <name> with the name of the IngressController object.
  3. Optional: Confirm that the Ingress Controller was created by running the following command:

    $ oc --all-namespaces=true get ingresscontrollers
    Copy to Clipboard Toggle word wrap

An Ingress Controller created on GCP with an internal load balancer generates an internal IP address for the service. A cluster administrator can specify the global access option, which enables clients in any region within the same VPC network and compute region as the load balancer, to reach the workloads running on your cluster.

For more information, see the GCP documentation for global access.

Prerequisites

  • You deployed an OpenShift Container Platform cluster on GCP infrastructure.
  • You configured an Ingress Controller to use an internal load balancer.
  • You installed the OpenShift CLI (oc).

Procedure

  1. Configure the Ingress Controller resource to allow global access.

    Note

    You can also create an Ingress Controller and specify the global access option.

    1. Configure the Ingress Controller resource:

      $ oc -n openshift-ingress-operator edit ingresscontroller/default
      Copy to Clipboard Toggle word wrap
    2. Edit the YAML file:

      Sample clientAccess configuration to Global

        spec:
          endpointPublishingStrategy:
            loadBalancer:
              providerParameters:
                gcp:
                  clientAccess: Global 
      1
      
                type: GCP
              scope: Internal
            type: LoadBalancerService
      Copy to Clipboard Toggle word wrap

      1
      Set gcp.clientAccess to Global.
    3. Save the file to apply the changes.
  2. Run the following command to verify that the service allows global access:

    $ oc -n openshift-ingress edit svc/router-default -o yaml
    Copy to Clipboard Toggle word wrap

    The output shows that global access is enabled for GCP with the annotation, networking.gke.io/internal-load-balancer-allow-global-access.

A cluster administrator can set the health check interval to define how long the router waits between two consecutive health checks. This value is applied globally as a default for all routes. The default value is 5 seconds.

Prerequisites

  • The following assumes that you already created an Ingress Controller.

Procedure

  • Update the Ingress Controller to change the interval between back end health checks:

    $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"healthCheckInterval": "8s"}}}'
    Copy to Clipboard Toggle word wrap
    Note

    To override the healthCheckInterval for a single route, use the route annotation router.openshift.io/haproxy.health.check.interval

You can configure the default Ingress Controller for your cluster to be internal by deleting and recreating it.

Warning

If your cloud provider is Microsoft Azure, you must have at least one public load balancer that points to your nodes. If you do not, all of your nodes will lose egress connectivity to the internet.

Important

If you want to change the scope for an IngressController, you can change the .spec.endpointPublishingStrategy.loadBalancer.scope parameter after the custom resource (CR) is created.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Configure the default Ingress Controller for your cluster to be internal by deleting and recreating it.

    $ oc replace --force --wait --filename - <<EOF
    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      namespace: openshift-ingress-operator
      name: default
    spec:
      endpointPublishingStrategy:
        type: LoadBalancerService
        loadBalancer:
          scope: Internal
    EOF
    Copy to Clipboard Toggle word wrap

6.8.10. Configuring the route admission policy

Administrators and application developers can run applications in multiple namespaces with the same domain name. This is for organizations where multiple teams develop microservices that are exposed on the same hostname.

Warning

Allowing claims across namespaces should only be enabled for clusters with trust between namespaces, otherwise a malicious user could take over a hostname. For this reason, the default admission policy disallows hostname claims across namespaces.

Prerequisites

  • Cluster administrator privileges.

Procedure

  • Edit the .spec.routeAdmission field of the ingresscontroller resource variable using the following command:

    $ oc -n openshift-ingress-operator patch ingresscontroller/default --patch '{"spec":{"routeAdmission":{"namespaceOwnership":"InterNamespaceAllowed"}}}' --type=merge
    Copy to Clipboard Toggle word wrap

    Sample Ingress Controller configuration

    spec:
      routeAdmission:
        namespaceOwnership: InterNamespaceAllowed
    ...
    Copy to Clipboard Toggle word wrap

    Tip

    You can alternatively apply the following YAML to configure the route admission policy:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      routeAdmission:
        namespaceOwnership: InterNamespaceAllowed
    Copy to Clipboard Toggle word wrap

6.8.11. Using wildcard routes

The HAProxy Ingress Controller has support for wildcard routes. The Ingress Operator uses wildcardPolicy to configure the ROUTER_ALLOW_WILDCARD_ROUTES environment variable of the Ingress Controller.

The default behavior of the Ingress Controller is to admit routes with a wildcard policy of None, which is backwards compatible with existing IngressController resources.

Procedure

  1. Configure the wildcard policy.

    1. Use the following command to edit the IngressController resource:

      $ oc edit IngressController
      Copy to Clipboard Toggle word wrap
    2. Under spec, set the wildcardPolicy field to WildcardsDisallowed or WildcardsAllowed:

      spec:
        routeAdmission:
          wildcardPolicy: WildcardsDisallowed # or WildcardsAllowed
      Copy to Clipboard Toggle word wrap

6.8.12. Using X-Forwarded headers

You configure the HAProxy Ingress Controller to specify a policy for how to handle HTTP headers including Forwarded and X-Forwarded-For. The Ingress Operator uses the HTTPHeaders field to configure the ROUTER_SET_FORWARDED_HEADERS environment variable of the Ingress Controller.

Procedure

  1. Configure the HTTPHeaders field for the Ingress Controller.

    1. Use the following command to edit the IngressController resource:

      $ oc edit IngressController
      Copy to Clipboard Toggle word wrap
    2. Under spec, set the HTTPHeaders policy field to Append, Replace, IfNone, or Never:

      apiVersion: operator.openshift.io/v1
      kind: IngressController
      metadata:
        name: default
        namespace: openshift-ingress-operator
      spec:
        httpHeaders:
          forwardedHeaderPolicy: Append
      Copy to Clipboard Toggle word wrap
Example use cases

As a cluster administrator, you can:

  • Configure an external proxy that injects the X-Forwarded-For header into each request before forwarding it to an Ingress Controller.

    To configure the Ingress Controller to pass the header through unmodified, you specify the never policy. The Ingress Controller then never sets the headers, and applications receive only the headers that the external proxy provides.

  • Configure the Ingress Controller to pass the X-Forwarded-For header that your external proxy sets on external cluster requests through unmodified.

    To configure the Ingress Controller to set the X-Forwarded-For header on internal cluster requests, which do not go through the external proxy, specify the if-none policy. If an HTTP request already has the header set through the external proxy, then the Ingress Controller preserves it. If the header is absent because the request did not come through the proxy, then the Ingress Controller adds the header.

As an application developer, you can:

  • Configure an application-specific external proxy that injects the X-Forwarded-For header.

    To configure an Ingress Controller to pass the header through unmodified for an application’s Route, without affecting the policy for other Routes, add an annotation haproxy.router.openshift.io/set-forwarded-headers: if-none or haproxy.router.openshift.io/set-forwarded-headers: never on the Route for the application.

    Note

    You can set the haproxy.router.openshift.io/set-forwarded-headers annotation on a per route basis, independent from the globally set value for the Ingress Controller.

6.8.13. Enabling HTTP/2 Ingress connectivity

You can enable transparent end-to-end HTTP/2 connectivity in HAProxy. It allows application owners to make use of HTTP/2 protocol capabilities, including single connection, header compression, binary streams, and more.

You can enable HTTP/2 connectivity for an individual Ingress Controller or for the entire cluster.

To enable the use of HTTP/2 for the connection from the client to HAProxy, a route must specify a custom certificate. A route that uses the default certificate cannot use HTTP/2. This restriction is necessary to avoid problems from connection coalescing, where the client re-uses a connection for different routes that use the same certificate.

The connection from HAProxy to the application pod can use HTTP/2 only for re-encrypt routes and not for edge-terminated or insecure routes. This restriction is because HAProxy uses Application-Level Protocol Negotiation (ALPN), which is a TLS extension, to negotiate the use of HTTP/2 with the back-end. The implication is that end-to-end HTTP/2 is possible with passthrough and re-encrypt and not with insecure or edge-terminated routes.

Warning

Using WebSockets with a re-encrypt route and with HTTP/2 enabled on an Ingress Controller requires WebSocket support over HTTP/2. WebSockets over HTTP/2 is a feature of HAProxy 2.4, which is unsupported in OpenShift Container Platform at this time.

Important

For non-passthrough routes, the Ingress Controller negotiates its connection to the application independently of the connection from the client. This means a client may connect to the Ingress Controller and negotiate HTTP/1.1, and the Ingress Controller may then connect to the application, negotiate HTTP/2, and forward the request from the client HTTP/1.1 connection using the HTTP/2 connection to the application. This poses a problem if the client subsequently tries to upgrade its connection from HTTP/1.1 to the WebSocket protocol, because the Ingress Controller cannot forward WebSocket to HTTP/2 and cannot upgrade its HTTP/2 connection to WebSocket. Consequently, if you have an application that is intended to accept WebSocket connections, it must not allow negotiating the HTTP/2 protocol or else clients will fail to upgrade to the WebSocket protocol.

Procedure

Enable HTTP/2 on a single Ingress Controller.

  • To enable HTTP/2 on an Ingress Controller, enter the oc annotate command:

    $ oc -n openshift-ingress-operator annotate ingresscontrollers/<ingresscontroller_name> ingress.operator.openshift.io/default-enable-http2=true
    Copy to Clipboard Toggle word wrap

    Replace <ingresscontroller_name> with the name of the Ingress Controller to annotate.

Enable HTTP/2 on the entire cluster.

  • To enable HTTP/2 for the entire cluster, enter the oc annotate command:

    $ oc annotate ingresses.config/cluster ingress.operator.openshift.io/default-enable-http2=true
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively apply the following YAML to add the annotation:

    apiVersion: config.openshift.io/v1
    kind: Ingress
    metadata:
      name: cluster
      annotations:
        ingress.operator.openshift.io/default-enable-http2: "true"
    Copy to Clipboard Toggle word wrap

A cluster administrator can configure the PROXY protocol when an Ingress Controller uses either the HostNetwork or NodePortService endpoint publishing strategy types. The PROXY protocol enables the load balancer to preserve the original client addresses for connections that the Ingress Controller receives. The original client addresses are useful for logging, filtering, and injecting HTTP headers. In the default configuration, the connections that the Ingress Controller receives only contain the source address that is associated with the load balancer.

This feature is not supported in cloud deployments. This restriction is because when OpenShift Container Platform runs in a cloud platform, and an IngressController specifies that a service load balancer should be used, the Ingress Operator configures the load balancer service and enables the PROXY protocol based on the platform requirement for preserving source addresses.

Important

You must configure both OpenShift Container Platform and the external load balancer to either use the PROXY protocol or to use TCP.

Warning

The PROXY protocol is unsupported for the default Ingress Controller with installer-provisioned clusters on non-cloud platforms that use a Keepalived Ingress VIP.

Prerequisites

  • You created an Ingress Controller.

Procedure

  1. Edit the Ingress Controller resource:

    $ oc -n openshift-ingress-operator edit ingresscontroller/default
    Copy to Clipboard Toggle word wrap
  2. Set the PROXY configuration:

    • If your Ingress Controller uses the hostNetwork endpoint publishing strategy type, set the spec.endpointPublishingStrategy.hostNetwork.protocol subfield to PROXY:

      Sample hostNetwork configuration to PROXY

        spec:
          endpointPublishingStrategy:
            hostNetwork:
              protocol: PROXY
            type: HostNetwork
      Copy to Clipboard Toggle word wrap

    • If your Ingress Controller uses the NodePortService endpoint publishing strategy type, set the spec.endpointPublishingStrategy.nodePort.protocol subfield to PROXY:

      Sample nodePort configuration to PROXY

        spec:
          endpointPublishingStrategy:
            nodePort:
              protocol: PROXY
            type: NodePortService
      Copy to Clipboard Toggle word wrap

As a cluster administrator, you can specify an alternative to the default cluster domain for user-created routes by configuring the appsDomain field. The appsDomain field is an optional domain for OpenShift Container Platform to use instead of the default, which is specified in the domain field. If you specify an alternative domain, it overrides the default cluster domain for the purpose of determining the default host for a new route.

For example, you can use the DNS domain for your company as the default domain for routes and ingresses for applications running on your cluster.

Prerequisites

  • You deployed an OpenShift Container Platform cluster.
  • You installed the oc command line interface.

Procedure

  1. Configure the appsDomain field by specifying an alternative default domain for user-created routes.

    1. Edit the ingress cluster resource:

      $ oc edit ingresses.config/cluster -o yaml
      Copy to Clipboard Toggle word wrap
    2. Edit the YAML file:

      Sample appsDomain configuration to test.example.com

      apiVersion: config.openshift.io/v1
      kind: Ingress
      metadata:
        name: cluster
      spec:
        domain: apps.example.com            
      1
      
        appsDomain: <test.example.com>      
      2
      Copy to Clipboard Toggle word wrap

      1
      Specifies the default domain. You cannot modify the default domain after installation.
      2
      Optional: Domain for OpenShift Container Platform infrastructure to use for application routes. Instead of the default prefix, apps, you can use an alternative prefix like test.
  2. Verify that an existing route contains the domain name specified in the appsDomain field by exposing the route and verifying the route domain change:

    Note

    Wait for the openshift-apiserver finish rolling updates before exposing the route.

    1. Expose the route:

      $ oc expose service hello-openshift
      route.route.openshift.io/hello-openshift exposed
      Copy to Clipboard Toggle word wrap

      Example output:

      $ oc get routes
      NAME              HOST/PORT                                   PATH   SERVICES          PORT       TERMINATION   WILDCARD
      hello-openshift   hello_openshift-<my_project>.test.example.com
      hello-openshift   8080-tcp                 None
      Copy to Clipboard Toggle word wrap

6.8.16. Converting HTTP header case

HAProxy 2.2 lowercases HTTP header names by default, for example, changing Host: xyz.com to host: xyz.com. If legacy applications are sensitive to the capitalization of HTTP header names, use the Ingress Controller spec.httpHeaders.headerNameCaseAdjustments API field for a solution to accommodate legacy applications until they can be fixed.

Important

Because OpenShift Container Platform includes HAProxy 2.2, make sure to add the necessary configuration by using spec.httpHeaders.headerNameCaseAdjustments before upgrading.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have access to the cluster as a user with the cluster-admin role.

Procedure

As a cluster administrator, you can convert the HTTP header case by entering the oc patch command or by setting the HeaderNameCaseAdjustments field in the Ingress Controller YAML file.

  • Specify an HTTP header to be capitalized by entering the oc patch command.

    1. Enter the oc patch command to change the HTTP host header to Host:

      $ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"httpHeaders":{"headerNameCaseAdjustments":["Host"]}}}'
      Copy to Clipboard Toggle word wrap
    2. Annotate the route of the application:

      $ oc annotate routes/my-application haproxy.router.openshift.io/h1-adjust-case=true
      Copy to Clipboard Toggle word wrap

      The Ingress Controller then adjusts the host request header as specified.

  • Specify adjustments using the HeaderNameCaseAdjustments field by configuring the Ingress Controller YAML file.

    1. The following example Ingress Controller YAML adjusts the host header to Host for HTTP/1 requests to appropriately annotated routes:

      Example Ingress Controller YAML

      apiVersion: operator.openshift.io/v1
      kind: IngressController
      metadata:
        name: default
        namespace: openshift-ingress-operator
      spec:
        httpHeaders:
          headerNameCaseAdjustments:
          - Host
      Copy to Clipboard Toggle word wrap

    2. The following example route enables HTTP response header name case adjustments using the haproxy.router.openshift.io/h1-adjust-case annotation:

      Example route YAML

      apiVersion: route.openshift.io/v1
      kind: Route
      metadata:
        annotations:
          haproxy.router.openshift.io/h1-adjust-case: true 
      1
      
        name: my-application
        namespace: my-application
      spec:
        to:
          kind: Service
          name: my-application
      Copy to Clipboard Toggle word wrap

      1
      Set haproxy.router.openshift.io/h1-adjust-case to true.

6.8.17. Using router compression

You configure the HAProxy Ingress Controller to specify router compression globally for specific MIME types. You can use the mimeTypes variable to define the formats of MIME types to which compression is applied. The types are: application, image, message, multipart, text, video, or a custom type prefaced by "X-". To see the full notation for MIME types and subtypes, see RFC1341.

Note

Memory allocated for compression can affect the max connections. Additionally, compression of large buffers can cause latency, like heavy regex or long lists of regex.

Not all MIME types benefit from compression, but HAProxy still uses resources to try to compress if instructed to. Generally, text formats, such as html, css, and js, formats benefit from compression, but formats that are already compressed, such as image, audio, and video, benefit little in exchange for the time and resources spent on compression.

Procedure

  1. Configure the httpCompression field for the Ingress Controller.

    1. Use the following command to edit the IngressController resource:

      $ oc edit -n openshift-ingress-operator ingresscontrollers/default
      Copy to Clipboard Toggle word wrap
    2. Under spec, set the httpCompression policy field to mimeTypes and specify a list of MIME types that should have compression applied:

      apiVersion: operator.openshift.io/v1
      kind: IngressController
      metadata:
        name: default
        namespace: openshift-ingress-operator
      spec:
        httpCompression:
          mimeTypes:
          - "text/html"
          - "text/css; charset=utf-8"
          - "application/json"
         ...
      Copy to Clipboard Toggle word wrap

6.8.18. Exposing router metrics

You can expose the HAProxy router metrics by default in Prometheus format on the default stats port, 1936. The external metrics collection and aggregation systems such as Prometheus can access the HAProxy router metrics. You can view the HAProxy router metrics in a browser in the HTML and comma separated values (CSV) format.

Prerequisites

  • You configured your firewall to access the default stats port, 1936.

Procedure

  1. Get the router pod name by running the following command:

    $ oc get pods -n openshift-ingress
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                              READY   STATUS    RESTARTS   AGE
    router-default-76bfffb66c-46qwp   1/1     Running   0          11h
    Copy to Clipboard Toggle word wrap

  2. Get the router’s username and password, which the router pod stores in the /var/lib/haproxy/conf/metrics-auth/statsUsername and /var/lib/haproxy/conf/metrics-auth/statsPassword files:

    1. Get the username by running the following command:

      $ oc rsh <router_pod_name> cat metrics-auth/statsUsername
      Copy to Clipboard Toggle word wrap
    2. Get the password by running the following command:

      $ oc rsh <router_pod_name> cat metrics-auth/statsPassword
      Copy to Clipboard Toggle word wrap
  3. Get the router IP and metrics certificates by running the following command:

    $ oc describe pod <router_pod>
    Copy to Clipboard Toggle word wrap
  4. Get the raw statistics in Prometheus format by running the following command:

    $ curl -u <user>:<password> http://<router_IP>:<stats_port>/metrics
    Copy to Clipboard Toggle word wrap
  5. Access the metrics securely by running the following command:

    $ curl -u user:password https://<router_IP>:<stats_port>/metrics -k
    Copy to Clipboard Toggle word wrap
  6. Access the default stats port, 1936, by running the following command:

    $ curl -u <user>:<password> http://<router_IP>:<stats_port>/metrics
    Copy to Clipboard Toggle word wrap

    Example 6.1. Example output

    …​ # HELP haproxy_backend_connections_total Total number of connections. # TYPE haproxy_backend_connections_total gauge haproxy_backend_connections_total{backend="http",namespace="default",route="hello-route"} 0 haproxy_backend_connections_total{backend="http",namespace="default",route="hello-route-alt"} 0 haproxy_backend_connections_total{backend="http",namespace="default",route="hello-route01"} 0 …​ # HELP haproxy_exporter_server_threshold Number of servers tracked and the current threshold value. # TYPE haproxy_exporter_server_threshold gauge haproxy_exporter_server_threshold{type="current"} 11 haproxy_exporter_server_threshold{type="limit"} 500 …​ # HELP haproxy_frontend_bytes_in_total Current total of incoming bytes. # TYPE haproxy_frontend_bytes_in_total gauge haproxy_frontend_bytes_in_total{frontend="fe_no_sni"} 0 haproxy_frontend_bytes_in_total{frontend="fe_sni"} 0 haproxy_frontend_bytes_in_total{frontend="public"} 119070 …​ # HELP haproxy_server_bytes_in_total Current total of incoming bytes. # TYPE haproxy_server_bytes_in_total gauge haproxy_server_bytes_in_total{namespace="",pod="",route="",server="fe_no_sni",service=""} 0 haproxy_server_bytes_in_total{namespace="",pod="",route="",server="fe_sni",service=""} 0 haproxy_server_bytes_in_total{namespace="default",pod="docker-registry-5-nk5fz",route="docker-registry",server="10.130.0.89:5000",service="docker-registry"} 0 haproxy_server_bytes_in_total{namespace="default",pod="hello-rc-vkjqx",route="hello-route",server="10.130.0.90:8080",service="hello-svc-1"} 0 …​

  7. Launch the stats window by entering the following URL in a browser:

    http://<user>:<password>@<router_IP>:<stats_port>
    Copy to Clipboard Toggle word wrap
  8. Optional: Get the stats in CSV format by entering the following URL in a browser:

    http://<user>:<password>@<router_ip>:1936/metrics;csv
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can specify a custom error code response page for either 503, 404, or both error pages. The HAProxy router serves a 503 error page when the application pod is not running or a 404 error page when the requested URL does not exist. For example, if you customize the 503 error code response page, then the page is served when the application pod is not running, and the default 404 error code HTTP response page is served by the HAProxy router for an incorrect route or a non-existing route.

Custom error code response pages are specified in a config map then patched to the Ingress Controller. The config map keys have two available file names as follows: error-page-503.http and error-page-404.http.

Custom HTTP error code response pages must follow the HAProxy HTTP error page configuration guidelines. Here is an example of the default OpenShift Container Platform HAProxy router http 503 error code response page. You can use the default content as a template for creating your own custom page.

By default, the HAProxy router serves only a 503 error page when the application is not running or when the route is incorrect or non-existent. This default behavior is the same as the behavior on OpenShift Container Platform 4.8 and earlier. If a config map for the customization of an HTTP error code response is not provided, and you are using a custom HTTP error code response page, the router serves a default 404 or 503 error code response page.

Note

If you use the OpenShift Container Platform default 503 error code page as a template for your customizations, the headers in the file require an editor that can use CRLF line endings.

Procedure

  1. Create a config map named my-custom-error-code-pages in the openshift-config namespace:

    $ oc -n openshift-config create configmap my-custom-error-code-pages \
    --from-file=error-page-503.http \
    --from-file=error-page-404.http
    Copy to Clipboard Toggle word wrap
    Important

    If you do not specify the correct format for the custom error code response page, a router pod outage occurs. To resolve this outage, you must delete or correct the config map and delete the affected router pods so they can be recreated with the correct information.

  2. Patch the Ingress Controller to reference the my-custom-error-code-pages config map by name:

    $ oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"httpErrorCodePages":{"name":"my-custom-error-code-pages"}}}' --type=merge
    Copy to Clipboard Toggle word wrap

    The Ingress Operator copies the my-custom-error-code-pages config map from the openshift-config namespace to the openshift-ingress namespace. The Operator names the config map according to the pattern, <your_ingresscontroller_name>-errorpages, in the openshift-ingress namespace.

  3. Display the copy:

    $ oc get cm default-errorpages -n openshift-ingress
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                       DATA   AGE
    default-errorpages         2      25s  
    1
    Copy to Clipboard Toggle word wrap

    1
    The example config map name is default-errorpages because the default Ingress Controller custom resource (CR) was patched.
  4. Confirm that the config map containing the custom error response page mounts on the router volume where the config map key is the filename that has the custom HTTP error code response:

    • For 503 custom HTTP custom error code response:

      $ oc -n openshift-ingress rsh <router_pod> cat /var/lib/haproxy/conf/error_code_pages/error-page-503.http
      Copy to Clipboard Toggle word wrap
    • For 404 custom HTTP custom error code response:

      $ oc -n openshift-ingress rsh <router_pod> cat /var/lib/haproxy/conf/error_code_pages/error-page-404.http
      Copy to Clipboard Toggle word wrap

Verification

Verify your custom error code HTTP response:

  1. Create a test project and application:

     $ oc new-project test-ingress
    Copy to Clipboard Toggle word wrap
    $ oc new-app django-psql-example
    Copy to Clipboard Toggle word wrap
  2. For 503 custom http error code response:

    1. Stop all the pods for the application.
    2. Run the following curl command or visit the route hostname in the browser:

      $ curl -vk <route_hostname>
      Copy to Clipboard Toggle word wrap
  3. For 404 custom http error code response:

    1. Visit a non-existent route or an incorrect route.
    2. Run the following curl command or visit the route hostname in the browser:

      $ curl -vk <route_hostname>
      Copy to Clipboard Toggle word wrap
  4. Check if the errorfile attribute is properly in the haproxy.config file:

    $ oc -n openshift-ingress rsh <router> cat /var/lib/haproxy/conf/haproxy.config | grep errorfile
    Copy to Clipboard Toggle word wrap

A cluster administrator can set the maximum number of simultaneous connections for OpenShift router deployments. You can patch an existing Ingress Controller to increase the maximum number of connections.

Prerequisites

  • The following assumes that you already created an Ingress Controller

Procedure

  • Update the Ingress Controller to change the maximum number of connections for HAProxy:

    $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"maxConnections": 7500}}}'
    Copy to Clipboard Toggle word wrap
    Warning

    If you set the spec.tuningOptions.maxConnections value greater than the current operating system limit, the HAProxy process will not start. See the table in the "Ingress Controller configuration parameters" section for more information about this parameter.

In OpenShift Container Platform, an Ingress Controller can serve all routes, or it can serve a subset of routes. By default, the Ingress Controller serves any route created in any namespace in the cluster. You can add additional Ingress Controllers to your cluster to optimize routing by creating shards, which are subsets of routes based on selected characteristics. To mark a route as a member of a shard, use labels in the route or namespace metadata field. The Ingress Controller uses selectors, also known as a selection expression, to select a subset of routes from the entire pool of routes to serve.

Ingress sharding is useful in cases where you want to load balance incoming traffic across multiple Ingress Controllers, when you want to isolate traffic to be routed to a specific Ingress Controller, or for a variety of other reasons described in the next section.

By default, each route uses the default domain of the cluster. However, routes can be configured to use the domain of the router instead. For more information, see Creating a route for Ingress Controller Sharding.

7.1. Ingress Controller sharding

You can use Ingress sharding, also known as router sharding, to distribute a set of routes across multiple routers by adding labels to routes, namespaces, or both. The Ingress Controller uses a corresponding set of selectors to admit only the routes that have a specified label. Each Ingress shard comprises the routes that are filtered using a given selection expression.

As the primary mechanism for traffic to enter the cluster, the demands on the Ingress Controller can be significant. As a cluster administrator, you can shard the routes to:

  • Balance Ingress Controllers, or routers, with several routes to speed up responses to changes.
  • Allocate certain routes to have different reliability guarantees than other routes.
  • Allow certain Ingress Controllers to have different policies defined.
  • Allow only specific routes to use additional features.
  • Expose different routes on different addresses so that internal and external users can see different routes, for example.
  • Transfer traffic from one version of an application to another during a blue green deployment.

When Ingress Controllers are sharded, a given route is admitted to zero or more Ingress Controllers in the group. A route’s status describes whether an Ingress Controller has admitted it or not. An Ingress Controller will only admit a route if it is unique to its shard.

An Ingress Controller can use three sharding methods:

  • Adding only a namespace selector to the Ingress Controller, so that all routes in a namespace with labels that match the namespace selector are in the Ingress shard.
  • Adding only a route selector to the Ingress Controller, so that all routes with labels that match the route selector are in the Ingress shard.
  • Adding both a namespace selector and route selector to the Ingress Controller, so that routes with labels that match the route selector in a namespace with labels that match the namespace selector are in the Ingress shard.

With sharding, you can distribute subsets of routes over multiple Ingress Controllers. These subsets can be non-overlapping, also called traditional sharding, or overlapping, otherwise known as overlapped sharding.

7.1.1. Traditional sharding example

An Ingress Controller finops-router is configured with the label selector spec.namespaceSelector.matchLabels.name set to finance and ops:

Example YAML definition for finops-router

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: finops-router
  namespace: openshift-ingress-operator
spec:
  namespaceSelector:
    matchLabels:
      name:
        - finance
        - ops
Copy to Clipboard Toggle word wrap

A second Ingress Controller dev-router is configured with the label selector spec.namespaceSelector.matchLabels.name set to dev:

Example YAML definition for dev-router

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: dev-router
  namespace: openshift-ingress-operator
spec:
  namespaceSelector:
    matchLabels:
      name: dev
Copy to Clipboard Toggle word wrap

If all application routes are in separate namespaces, each labeled with name:finance, name:ops, and name:dev respectively, this configuration effectively distributes your routes between the two Ingress Controllers. OpenShift Container Platform routes for console, authentication, and other purposes should not be handled.

In the above scenario, sharding becomes a special case of partitioning, with no overlapping subsets. Routes are divided between router shards.

Warning

The default Ingress Controller continues to serve all routes unless the namespaceSelector or routeSelector fields contain routes that are meant for exclusion. See this Red Hat Knowledgebase solution and the section "Sharding the default Ingress Controller" for more information on how to exclude routes from the default Ingress Controller.

7.1.2. Overlapped sharding example

In addition to finops-router and dev-router in the example above, you also have devops-router, which is configured with the label selector spec.namespaceSelector.matchLabels.name set to dev and ops:

Example YAML definition for devops-router

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: devops-router
  namespace: openshift-ingress-operator
spec:
  namespaceSelector:
    matchLabels:
      name:
        - dev
        - ops
Copy to Clipboard Toggle word wrap

The routes in the namespaces labeled name:dev and name:ops are now serviced by two different Ingress Controllers. With this configuration, you have overlapping subsets of routes.

With overlapping subsets of routes you can create more complex routing rules. For example, you can divert higher priority traffic to the dedicated finops-router while sending lower priority traffic to devops-router.

7.1.3. Sharding the default Ingress Controller

After creating a new Ingress shard, there might be routes that are admitted to your new Ingress shard that are also admitted by the default Ingress Controller. This is because the default Ingress Controller has no selectors and admits all routes by default.

You can restrict an Ingress Controller from servicing routes with specific labels using either namespace selectors or route selectors. The following procedure restricts the default Ingress Controller from servicing your newly sharded finance, ops, and dev, routes using a namespace selector. This adds further isolation to Ingress shards.

Important

You must keep all of OpenShift Container Platform’s administration routes on the same Ingress Controller. Therefore, avoid adding additional selectors to the default Ingress Controller that exclude these essential routes.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in as a project administrator.

Procedure

  1. Modify the default Ingress Controller by running the following command:

    $ oc edit ingresscontroller -n openshift-ingress-operator default
    Copy to Clipboard Toggle word wrap
  2. Edit the Ingress Controller to contain a namespaceSelector that excludes the routes with any of the finance, ops, and dev labels:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      namespaceSelector:
        matchExpressions:
          - key: type
            operator: NotIn
            values:
              - finance
              - ops
              - dev
    Copy to Clipboard Toggle word wrap

The default Ingress Controller will no longer serve the namespaces labeled name:finance, name:ops, and name:dev.

7.1.4. Ingress sharding and DNS

The cluster administrator is responsible for making a separate DNS entry for each router in a project. A router will not forward unknown routes to another router.

Consider the following example:

  • Router A lives on host 192.168.0.5 and has routes with *.foo.com.
  • Router B lives on host 192.168.1.9 and has routes with *.example.com.

Separate DNS entries must resolve *.foo.com to the node hosting Router A and *.example.com to the node hosting Router B:

  • *.foo.com A IN 192.168.0.5
  • *.example.com A IN 192.168.1.9

Ingress Controller sharding by using route labels means that the Ingress Controller serves any route in any namespace that is selected by the route selector.

Figure 7.1. Ingress sharding using route labels

Ingress Controller sharding is useful when balancing incoming traffic load among a set of Ingress Controllers and when isolating traffic to a specific Ingress Controller. For example, company A goes to one Ingress Controller and company B to another.

Procedure

  1. Edit the router-internal.yaml file:

    # cat router-internal.yaml
    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: sharded
      namespace: openshift-ingress-operator
    spec:
      domain: <apps-sharded.basedomain.example.net> 
    1
    
      nodePlacement:
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker: ""
      routeSelector:
        matchLabels:
          type: sharded
    Copy to Clipboard Toggle word wrap
    1
    Specify a domain to be used by the Ingress Controller. This domain must be different from the default Ingress Controller domain.
  2. Apply the Ingress Controller router-internal.yaml file:

    # oc apply -f router-internal.yaml
    Copy to Clipboard Toggle word wrap

    The Ingress Controller selects routes in any namespace that have the label type: sharded.

  3. Create a new route using the domain configured in the router-internal.yaml:

    $ oc expose svc <service-name> --hostname <route-name>.apps-sharded.basedomain.example.net
    Copy to Clipboard Toggle word wrap

Ingress Controller sharding by using namespace labels means that the Ingress Controller serves any route in any namespace that is selected by the namespace selector.

Figure 7.2. Ingress sharding using namespace labels

Ingress Controller sharding is useful when balancing incoming traffic load among a set of Ingress Controllers and when isolating traffic to a specific Ingress Controller. For example, company A goes to one Ingress Controller and company B to another.

Procedure

  1. Edit the router-internal.yaml file:

    # cat router-internal.yaml
    Copy to Clipboard Toggle word wrap

    Example output

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: sharded
      namespace: openshift-ingress-operator
    spec:
      domain: <apps-sharded.basedomain.example.net> 
    1
    
      nodePlacement:
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker: ""
      namespaceSelector:
        matchLabels:
          type: sharded
    Copy to Clipboard Toggle word wrap

    1
    Specify a domain to be used by the Ingress Controller. This domain must be different from the default Ingress Controller domain.
  2. Apply the Ingress Controller router-internal.yaml file:

    # oc apply -f router-internal.yaml
    Copy to Clipboard Toggle word wrap

    The Ingress Controller selects routes in any namespace that is selected by the namespace selector that have the label type: sharded.

  3. Create a new route using the domain configured in the router-internal.yaml:

    $ oc expose svc <service-name> --hostname <route-name>.apps-sharded.basedomain.example.net
    Copy to Clipboard Toggle word wrap

A route allows you to host your application at a URL. In this case, the hostname is not set and the route uses a subdomain instead. When you specify a subdomain, you automatically use the domain of the Ingress Controller that exposes the route. For situations where a route is exposed by multiple Ingress Controllers, the route is hosted at multiple URLs.

The following procedure describes how to create a route for Ingress Controller sharding, using the hello-openshift application as an example.

Ingress Controller sharding is useful when balancing incoming traffic load among a set of Ingress Controllers and when isolating traffic to a specific Ingress Controller. For example, company A goes to one Ingress Controller and company B to another.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in as a project administrator.
  • You have a web application that exposes a port and an HTTP or TLS endpoint listening for traffic on the port.
  • You have configured the Ingress Controller for sharding.

Procedure

  1. Create a project called hello-openshift by running the following command:

    $ oc new-project hello-openshift
    Copy to Clipboard Toggle word wrap
  2. Create a pod in the project by running the following command:

    $ oc create -f https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json
    Copy to Clipboard Toggle word wrap
  3. Create a service called hello-openshift by running the following command:

    $ oc expose pod/hello-openshift
    Copy to Clipboard Toggle word wrap
  4. Create a route definition called hello-openshift-route.yaml:

    YAML definition of the created route for sharding:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      labels:
        type: sharded 
    1
    
      name: hello-openshift-edge
      namespace: hello-openshift
    spec:
      subdomain: hello-openshift 
    2
    
      tls:
        termination: edge
      to:
        kind: Service
        name: hello-openshift
    Copy to Clipboard Toggle word wrap

    1
    Both the label key and its corresponding label value must match the ones specified in the Ingress Controller. In this example, the Ingress Controller has the label key and value type: sharded.
    2
    The route will be exposed using the value of the subdomain field. When you specify the subdomain field, you must leave the hostname unset. If you specify both the host and subdomain fields, then the route will use the value of the host field, and ignore the subdomain field.
  5. Use hello-openshift-route.yaml to create a route to the hello-openshift application by running the following command:

    $ oc -n hello-openshift create -f hello-openshift-route.yaml
    Copy to Clipboard Toggle word wrap

Verification

  • Get the status of the route with the following command:

    $ oc -n hello-openshift get routes/hello-openshift-edge -o yaml
    Copy to Clipboard Toggle word wrap

    The resulting Route resource should look similar to the following:

    Example output

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      labels:
        type: sharded
      name: hello-openshift-edge
      namespace: hello-openshift
    spec:
      subdomain: hello-openshift
      tls:
        termination: edge
      to:
        kind: Service
        name: hello-openshift
    status:
      ingress:
      - host: hello-openshift.<apps-sharded.basedomain.example.net> 
    1
    
        routerCanonicalHostname: router-sharded.<apps-sharded.basedomain.example.net> 
    2
    
        routerName: sharded 
    3
    Copy to Clipboard Toggle word wrap

    1
    The hostname the Ingress Controller, or router, uses to expose the route. The value of the host field is automatically determined by the Ingress Controller, and uses its domain. In this example, the domain of the Ingress Controller is <apps-sharded.basedomain.example.net>.
    2
    The hostname of the Ingress Controller.
    3
    The name of the Ingress Controller. In this example, the Ingress Controller has the name sharded.

Additional Resources

NodePortService endpoint publishing strategy

The NodePortService endpoint publishing strategy publishes the Ingress Controller using a Kubernetes NodePort service.

In this configuration, the Ingress Controller deployment uses container networking. A NodePortService is created to publish the deployment. The specific node ports are dynamically allocated by OpenShift Container Platform; however, to support static port allocations, your changes to the node port field of the managed NodePortService are preserved.

Figure 8.1. Diagram of NodePortService

The preceding graphic shows the following concepts pertaining to OpenShift Container Platform Ingress NodePort endpoint publishing strategy:

  • All the available nodes in the cluster have their own, externally accessible IP addresses. The service running in the cluster is bound to the unique NodePort for all the nodes.
  • When the client connects to a node that is down, for example, by connecting the 10.0.128.4 IP address in the graphic, the node port directly connects the client to an available node that is running the service. In this scenario, no load balancing is required. As the image shows, the 10.0.128.4 address is down and another IP address must be used instead.
Note

The Ingress Operator ignores any updates to .spec.ports[].nodePort fields of the service.

By default, ports are allocated automatically and you can access the port allocations for integrations. However, sometimes static port allocations are necessary to integrate with existing infrastructure which may not be easily reconfigured in response to dynamic ports. To achieve integrations with static node ports, you can update the managed service resource directly.

For more information, see the Kubernetes Services documentation on NodePort.

HostNetwork endpoint publishing strategy

The HostNetwork endpoint publishing strategy publishes the Ingress Controller on node ports where the Ingress Controller is deployed.

An Ingress Controller with the HostNetwork endpoint publishing strategy can have only one pod replica per node. If you want n replicas, you must use at least n nodes where those replicas can be scheduled. Because each pod replica requests ports 80 and 443 on the node host where it is scheduled, a replica cannot be scheduled to a node if another pod on the same node is using those ports.

When a cluster administrator installs a new cluster without specifying that the cluster is private, the default Ingress Controller is created with a scope set to External. Cluster administrators can change an External scoped Ingress Controller to Internal.

Prerequisites

  • You installed the oc CLI.

Procedure

  • To change an External scoped Ingress Controller to Internal, enter the following command:

    $ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'
    Copy to Clipboard Toggle word wrap
  • To check the status of the Ingress Controller, enter the following command:

    $ oc -n openshift-ingress-operator get ingresscontrollers/default -o yaml
    Copy to Clipboard Toggle word wrap
    • The Progressing status condition indicates whether you must take further action. For example, the status condition can indicate that you need to delete the service by entering the following command:

      $ oc -n openshift-ingress delete services/router-default
      Copy to Clipboard Toggle word wrap

      If you delete the service, the Ingress Operator recreates it as Internal.

When a cluster administrator installs a new cluster without specifying that the cluster is private, the default Ingress Controller is created with a scope set to External.

The Ingress Controller’s scope can be configured to be Internal during installation or after, and cluster administrators can change an Internal Ingress Controller to External.

Important

On some platforms, it is necessary to delete and recreate the service.

Changing the scope can cause disruption to Ingress traffic, potentially for several minutes. This applies to platforms where it is necessary to delete and recreate the service, because the procedure can cause OpenShift Container Platform to deprovision the existing service load balancer, provision a new one, and update DNS.

Prerequisites

  • You installed the oc CLI.

Procedure

  • To change an Internal scoped Ingress Controller to External, enter the following command:

    $ oc -n openshift-ingress-operator patch ingresscontrollers/private --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"External"}}}}'
    Copy to Clipboard Toggle word wrap
  • To check the status of the Ingress Controller, enter the following command:

    $ oc -n openshift-ingress-operator get ingresscontrollers/default -o yaml
    Copy to Clipboard Toggle word wrap
    • The Progressing status condition indicates whether you must take further action. For example, the status condition can indicate that you need to delete the service by entering the following command:

      $ oc -n openshift-ingress delete services/router-default
      Copy to Clipboard Toggle word wrap

      If you delete the service, the Ingress Operator recreates it as External.

Chapter 9. Verifying connectivity to an endpoint

The Cluster Network Operator (CNO) runs a controller, the connectivity check controller, that performs a connection health check between resources within your cluster. By reviewing the results of the health checks, you can diagnose connection problems or eliminate network connectivity as the cause of an issue that you are investigating.

9.1. Connection health checks performed

To verify that cluster resources are reachable, a TCP connection is made to each of the following cluster API services:

  • Kubernetes API server service
  • Kubernetes API server endpoints
  • OpenShift API server service
  • OpenShift API server endpoints
  • Load balancers

To verify that services and service endpoints are reachable on every node in the cluster, a TCP connection is made to each of the following targets:

  • Health check target service
  • Health check target endpoints

9.2. Implementation of connection health checks

The connectivity check controller orchestrates connection verification checks in your cluster. The results for the connection tests are stored in PodNetworkConnectivity objects in the openshift-network-diagnostics namespace. Connection tests are performed every minute in parallel.

The Cluster Network Operator (CNO) deploys several resources to the cluster to send and receive connectivity health checks:

Health check source
This program deploys in a single pod replica set managed by a Deployment object. The program consumes PodNetworkConnectivity objects and connects to the spec.targetEndpoint specified in each object.
Health check target
A pod deployed as part of a daemon set on every node in the cluster. The pod listens for inbound health checks. The presence of this pod on every node allows for the testing of connectivity to each node.

9.3. PodNetworkConnectivityCheck object fields

The PodNetworkConnectivityCheck object fields are described in the following tables.

Expand
Table 9.1. PodNetworkConnectivityCheck object fields
FieldTypeDescription

metadata.name

string

The name of the object in the following format: <source>-to-<target>. The destination described by <target> includes one of following strings:

  • load-balancer-api-external
  • load-balancer-api-internal
  • kubernetes-apiserver-endpoint
  • kubernetes-apiserver-service-cluster
  • network-check-target
  • openshift-apiserver-endpoint
  • openshift-apiserver-service-cluster

metadata.namespace

string

The namespace that the object is associated with. This value is always openshift-network-diagnostics.

spec.sourcePod

string

The name of the pod where the connection check originates, such as network-check-source-596b4c6566-rgh92.

spec.targetEndpoint

string

The target of the connection check, such as api.devcluster.example.com:6443.

spec.tlsClientCert

object

Configuration for the TLS certificate to use.

spec.tlsClientCert.name

string

The name of the TLS certificate used, if any. The default value is an empty string.

status

object

An object representing the condition of the connection test and logs of recent connection successes and failures.

status.conditions

array

The latest status of the connection check and any previous statuses.

status.failures

array

Connection test logs from unsuccessful attempts.

status.outages

array

Connect test logs covering the time periods of any outages.

status.successes

array

Connection test logs from successful attempts.

The following table describes the fields for objects in the status.conditions array:

Expand
Table 9.2. status.conditions
FieldTypeDescription

lastTransitionTime

string

The time that the condition of the connection transitioned from one status to another.

message

string

The details about last transition in a human readable format.

reason

string

The last status of the transition in a machine readable format.

status

string

The status of the condition.

type

string

The type of the condition.

The following table describes the fields for objects in the status.conditions array:

Expand
Table 9.3. status.outages
FieldTypeDescription

end

string

The timestamp from when the connection failure is resolved.

endLogs

array

Connection log entries, including the log entry related to the successful end of the outage.

message

string

A summary of outage details in a human readable format.

start

string

The timestamp from when the connection failure is first detected.

startLogs

array

Connection log entries, including the original failure.

Connection log fields

The fields for a connection log entry are described in the following table. The object is used in the following fields:

  • status.failures[]
  • status.successes[]
  • status.outages[].startLogs[]
  • status.outages[].endLogs[]
Expand
Table 9.4. Connection log object
FieldTypeDescription

latency

string

Records the duration of the action.

message

string

Provides the status in a human readable format.

reason

string

Provides the reason for status in a machine readable format. The value is one of TCPConnect, TCPConnectError, DNSResolve, DNSError.

success

boolean

Indicates if the log entry is a success or failure.

time

string

The start time of connection check.

As a cluster administrator, you can verify the connectivity of an endpoint, such as an API server, load balancer, service, or pod.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Access to the cluster as a user with the cluster-admin role.

Procedure

  1. To list the current PodNetworkConnectivityCheck objects, enter the following command:

    $ oc get podnetworkconnectivitycheck -n openshift-network-diagnostics
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                                                                                                                AGE
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0   75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-1   73m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-2   75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-kubernetes-apiserver-service-cluster                               75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-kubernetes-default-service-cluster                                 75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-load-balancer-api-external                                         75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-load-balancer-api-internal                                         75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-ci-ln-x5sv9rb-f76d1-4rzrp-master-0            75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-ci-ln-x5sv9rb-f76d1-4rzrp-master-1            75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-ci-ln-x5sv9rb-f76d1-4rzrp-master-2            75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh      74m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-ci-ln-x5sv9rb-f76d1-4rzrp-worker-c-n8mbf      74m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-ci-ln-x5sv9rb-f76d1-4rzrp-worker-d-4hnrz      74m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-network-check-target-service-cluster                               75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-openshift-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0    75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-openshift-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-1    75m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-openshift-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-2    74m
    network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-openshift-apiserver-service-cluster                                75m
    Copy to Clipboard Toggle word wrap

  2. View the connection test logs:

    1. From the output of the previous command, identify the endpoint that you want to review the connectivity logs for.
    2. To view the object, enter the following command:

      $ oc get podnetworkconnectivitycheck <name> \
        -n openshift-network-diagnostics -o yaml
      Copy to Clipboard Toggle word wrap

      where <name> specifies the name of the PodNetworkConnectivityCheck object.

      Example output

      apiVersion: controlplane.operator.openshift.io/v1alpha1
      kind: PodNetworkConnectivityCheck
      metadata:
        name: network-check-source-ci-ln-x5sv9rb-f76d1-4rzrp-worker-b-6xdmh-to-kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0
        namespace: openshift-network-diagnostics
        ...
      spec:
        sourcePod: network-check-source-7c88f6d9f-hmg2f
        targetEndpoint: 10.0.0.4:6443
        tlsClientCert:
          name: ""
      status:
        conditions:
        - lastTransitionTime: "2021-01-13T20:11:34Z"
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnectSuccess
          status: "True"
          type: Reachable
        failures:
        - latency: 2.241775ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: failed
            to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443: connect:
            connection refused'
          reason: TCPConnectError
          success: false
          time: "2021-01-13T20:10:34Z"
        - latency: 2.582129ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: failed
            to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443: connect:
            connection refused'
          reason: TCPConnectError
          success: false
          time: "2021-01-13T20:09:34Z"
        - latency: 3.483578ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: failed
            to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443: connect:
            connection refused'
          reason: TCPConnectError
          success: false
          time: "2021-01-13T20:08:34Z"
        outages:
        - end: "2021-01-13T20:11:34Z"
          endLogs:
          - latency: 2.032018ms
            message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0:
              tcp connection to 10.0.0.4:6443 succeeded'
            reason: TCPConnect
            success: true
            time: "2021-01-13T20:11:34Z"
          - latency: 2.241775ms
            message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0:
              failed to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443:
              connect: connection refused'
            reason: TCPConnectError
            success: false
            time: "2021-01-13T20:10:34Z"
          - latency: 2.582129ms
            message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0:
              failed to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443:
              connect: connection refused'
            reason: TCPConnectError
            success: false
            time: "2021-01-13T20:09:34Z"
          - latency: 3.483578ms
            message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0:
              failed to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443:
              connect: connection refused'
            reason: TCPConnectError
            success: false
            time: "2021-01-13T20:08:34Z"
          message: Connectivity restored after 2m59.999789186s
          start: "2021-01-13T20:08:34Z"
          startLogs:
          - latency: 3.483578ms
            message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0:
              failed to establish a TCP connection to 10.0.0.4:6443: dial tcp 10.0.0.4:6443:
              connect: connection refused'
            reason: TCPConnectError
            success: false
            time: "2021-01-13T20:08:34Z"
        successes:
        - latency: 2.845865ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:14:34Z"
        - latency: 2.926345ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:13:34Z"
        - latency: 2.895796ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:12:34Z"
        - latency: 2.696844ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:11:34Z"
        - latency: 1.502064ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:10:34Z"
        - latency: 1.388857ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:09:34Z"
        - latency: 1.906383ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:08:34Z"
        - latency: 2.089073ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:07:34Z"
        - latency: 2.156994ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:06:34Z"
        - latency: 1.777043ms
          message: 'kubernetes-apiserver-endpoint-ci-ln-x5sv9rb-f76d1-4rzrp-master-0: tcp
            connection to 10.0.0.4:6443 succeeded'
          reason: TCPConnect
          success: true
          time: "2021-01-13T21:05:34Z"
      Copy to Clipboard Toggle word wrap

As a cluster administrator, you can change the MTU for the cluster network after cluster installation. This change is disruptive as cluster nodes must be rebooted to finalize the MTU change. You can change the MTU only for clusters using the OVN-Kubernetes or OpenShift SDN cluster network providers.

10.1. About the cluster MTU

During installation the maximum transmission unit (MTU) for the cluster network is detected automatically based on the MTU of the primary network interface of nodes in the cluster. You do not normally need to override the detected MTU.

You might want to change the MTU of the cluster network for several reasons:

  • The MTU detected during cluster installation is not correct for your infrastructure
  • Your cluster infrastructure now requires a different MTU, such as from the addition of nodes that need a different MTU for optimal performance

You can change the cluster MTU for only the OVN-Kubernetes and OpenShift SDN cluster network providers.

10.1.1. Service interruption considerations

When you initiate an MTU change on your cluster the following effects might impact service availability:

  • At least two rolling reboots are required to complete the migration to a new MTU. During this time, some nodes are not available as they restart.
  • Specific applications deployed to the cluster with shorter timeout intervals than the absolute TCP timeout interval might experience disruption during the MTU change.

10.1.2. MTU value selection

When planning your MTU migration there are two related but distinct MTU values to consider.

  • Hardware MTU: This MTU value is set based on the specifics of your network infrastructure.
  • Cluster network MTU: This MTU value is always less than your hardware MTU to account for the cluster network overlay overhead. The specific overhead is determined by your cluster network provider:

    • OVN-Kubernetes: 100 bytes
    • OpenShift SDN: 50 bytes

If your cluster requires different MTU values for different nodes, you must subtract the overhead value for your cluster network provider from the lowest MTU value that is used by any node in your cluster. For example, if some nodes in your cluster have an MTU of 9001, and some have an MTU of 1500, you must set this value to 1400.

10.1.3. How the migration process works

The following table summarizes the migration process by segmenting between the user-initiated steps in the process and the actions that the migration performs in response.

Expand
Table 10.1. Live migration of the cluster MTU
User-initiated stepsOpenShift Container Platform activity

Set the following values in the Cluster Network Operator configuration:

  • spec.migration.mtu.machine.to
  • spec.migration.mtu.network.from
  • spec.migration.mtu.network.to

Cluster Network Operator (CNO): Confirms that each field is set to a valid value.

  • The mtu.machine.to must be set to either the new hardware MTU or to the current hardware MTU if the MTU for the hardware is not changing. This value is transient and is used as part of the migration process. Separately, if you specify a hardware MTU that is different from your existing hardware MTU value, you must manually configure the MTU to persist by other means, such as with a machine config, DHCP setting, or a Linux kernel command line.
  • The mtu.network.from field must equal the network.status.clusterNetworkMTU field, which is the current MTU of the cluster network.
  • The mtu.network.to field must be set to the target cluster network MTU and must be lower than the hardware MTU to allow for the overlay overhead of the cluster network provider. For OVN-Kubernetes, the overhead is 100 bytes and for OpenShift SDN the overhead is 50 bytes.

If the values provided are valid, the CNO writes out a new temporary configuration with the MTU for the cluster network set to the value of the mtu.network.to field.

Machine Config Operator (MCO): Performs a rolling reboot of each node in the cluster.

Reconfigure the MTU of the primary network interface for the nodes on the cluster. You can use a variety of methods to accomplish this, including:

  • Deploying a new NetworkManager connection profile with the MTU change
  • Changing the MTU through a DHCP server setting
  • Changing the MTU through boot parameters

N/A

Set the mtu value in the CNO configuration for the cluster network provider and set spec.migration to null.

Machine Config Operator (MCO): Performs a rolling reboot of each node in the cluster with the new MTU configuration.

10.2. Changing the cluster MTU

As a cluster administrator, you can change the maximum transmission unit (MTU) for your cluster. The migration is disruptive and nodes in your cluster might be temporarily unavailable as the MTU update rolls out.

The following procedure describes how to change the cluster MTU by using either machine configs, DHCP, or an ISO. If you use the DHCP or ISO approach, you must refer to configuration artifacts that you kept after installing your cluster to complete the procedure.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You identified the target MTU for your cluster. The correct MTU varies depending on the cluster network provider that your cluster uses:

    • OVN-Kubernetes: The cluster MTU must be set to 100 less than the lowest hardware MTU value in your cluster.
    • OpenShift SDN: The cluster MTU must be set to 50 less than the lowest hardware MTU value in your cluster.

Procedure

To increase or decrease the MTU for the cluster network complete the following procedure.

  1. To obtain the current MTU for the cluster network, enter the following command:

    $ oc describe network.config cluster
    Copy to Clipboard Toggle word wrap

    Example output

    ...
    Status:
      Cluster Network:
        Cidr:               10.217.0.0/22
        Host Prefix:        23
      Cluster Network MTU:  1400
      Network Type:         OpenShiftSDN
      Service Network:
        10.217.4.0/23
    ...
    Copy to Clipboard Toggle word wrap

  2. Prepare your configuration for the hardware MTU:

    • If your hardware MTU is specified with DHCP, update your DHCP configuration such as with the following dnsmasq configuration:

      dhcp-option-force=26,<mtu>
      Copy to Clipboard Toggle word wrap

      where:

      <mtu>
      Specifies the hardware MTU for the DHCP server to advertise.
    • If your hardware MTU is specified with a kernel command line with PXE, update that configuration accordingly.
    • If your hardware MTU is specified in a NetworkManager connection configuration, complete the following steps. This approach is the default for OpenShift Container Platform if you do not explicitly specify your network configuration with DHCP, a kernel command line, or some other method. Your cluster nodes must all use the same underlying network configuration for the following procedure to work unmodified.

      1. Find the primary network interface:

        • If you are using the OpenShift SDN cluster network provider, enter the following command:

          $ oc debug node/<node_name> -- chroot /host ip route list match 0.0.0.0/0 | awk '{print $5 }'
          Copy to Clipboard Toggle word wrap

          where:

          <node_name>
          Specifies the name of a node in your cluster.
        • If you are using the OVN-Kubernetes cluster network provider, enter the following command:

          $ oc debug node/<node_name> -- chroot /host nmcli -g connection.interface-name c show ovs-if-phys0
          Copy to Clipboard Toggle word wrap

          where:

          <node_name>
          Specifies the name of a node in your cluster.
      2. Create the following NetworkManager configuration in the <interface>-mtu.conf file:

        Example NetworkManager connection configuration

        [connection-<interface>-mtu]
        match-device=interface-name:<interface>
        ethernet.mtu=<mtu>
        Copy to Clipboard Toggle word wrap

        where:

        <mtu>
        Specifies the new hardware MTU value.
        <interface>
        Specifies the primary network interface name.
      3. Create two MachineConfig objects, one for the control plane nodes and another for the worker nodes in your cluster:

        1. Create the following Butane config in the control-plane-interface.bu file:

          variant: openshift
          version: 4.11.0
          metadata:
            name: 01-control-plane-interface
            labels:
              machineconfiguration.openshift.io/role: master
          storage:
            files:
              - path: /etc/NetworkManager/conf.d/99-<interface>-mtu.conf 
          1
          
                contents:
                  local: <interface>-mtu.conf 
          2
          
                mode: 0600
          Copy to Clipboard Toggle word wrap
          1
          Specify the NetworkManager connection name for the primary network interface.
          2
          Specify the local filename for the updated NetworkManager configuration file from the previous step.
        2. Create the following Butane config in the worker-interface.bu file:

          variant: openshift
          version: 4.11.0
          metadata:
            name: 01-worker-interface
            labels:
              machineconfiguration.openshift.io/role: worker
          storage:
            files:
              - path: /etc/NetworkManager/conf.d/99-<interface>-mtu.conf 
          1
          
                contents:
                  local: <interface>-mtu.conf 
          2
          
                mode: 0600
          Copy to Clipboard Toggle word wrap
          1
          Specify the NetworkManager connection name for the primary network interface.
          2
          Specify the local filename for the updated NetworkManager configuration file from the previous step.
        3. Create MachineConfig objects from the Butane configs by running the following command:

          $ for manifest in control-plane-interface worker-interface; do
              butane --files-dir . $manifest.bu > $manifest.yaml
            done
          Copy to Clipboard Toggle word wrap
  3. To begin the MTU migration, specify the migration configuration by entering the following command. The Machine Config Operator performs a rolling reboot of the nodes in the cluster in preparation for the MTU change.

    $ oc patch Network.operator.openshift.io cluster --type=merge --patch \
      '{"spec": { "migration": { "mtu": { "network": { "from": <overlay_from>, "to": <overlay_to> } , "machine": { "to" : <machine_to> } } } } }'
    Copy to Clipboard Toggle word wrap

    where:

    <overlay_from>
    Specifies the current cluster network MTU value.
    <overlay_to>
    Specifies the target MTU for the cluster network. This value is set relative to the value for <machine_to> and for OVN-Kubernetes must be 100 less and for OpenShift SDN must be 50 less.
    <machine_to>
    Specifies the MTU for the primary network interface on the underlying host network.

    Example that increases the cluster MTU

    $ oc patch Network.operator.openshift.io cluster --type=merge --patch \
      '{"spec": { "migration": { "mtu": { "network": { "from": 1400, "to": 9000 } , "machine": { "to" : 9100} } } } }'
    Copy to Clipboard Toggle word wrap

  4. As the MCO updates machines in each machine config pool, it reboots each node one by one. You must wait until all the nodes are updated. Check the machine config pool status by entering the following command:

    $ oc get mcp
    Copy to Clipboard Toggle word wrap

    A successfully updated node has the following status: UPDATED=true, UPDATING=false, DEGRADED=false.

    Note

    By default, the MCO updates one machine per pool at a time, causing the total time the migration takes to increase with the size of the cluster.

  5. Confirm the status of the new machine configuration on the hosts:

    1. To list the machine configuration state and the name of the applied machine configuration, enter the following command:

      $ oc describe node | egrep "hostname|machineconfig"
      Copy to Clipboard Toggle word wrap

      Example output

      kubernetes.io/hostname=master-0
      machineconfiguration.openshift.io/currentConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/desiredConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/reason:
      machineconfiguration.openshift.io/state: Done
      Copy to Clipboard Toggle word wrap

      Verify that the following statements are true:

      • The value of machineconfiguration.openshift.io/state field is Done.
      • The value of the machineconfiguration.openshift.io/currentConfig field is equal to the value of the machineconfiguration.openshift.io/desiredConfig field.
    2. To confirm that the machine config is correct, enter the following command:

      $ oc get machineconfig <config_name> -o yaml | grep ExecStart
      Copy to Clipboard Toggle word wrap

      where <config_name> is the name of the machine config from the machineconfiguration.openshift.io/currentConfig field.

      The machine config must include the following update to the systemd configuration:

      ExecStart=/usr/local/bin/mtu-migration.sh
      Copy to Clipboard Toggle word wrap
  6. Update the underlying network interface MTU value:

    • If you are specifying the new MTU with a NetworkManager connection configuration, enter the following command. The MachineConfig Operator automatically performs a rolling reboot of the nodes in your cluster.

      $ for manifest in control-plane-interface worker-interface; do
          oc create -f $manifest.yaml
        done
      Copy to Clipboard Toggle word wrap
    • If you are specifying the new MTU with a DHCP server option or a kernel command line and PXE, make the necessary changes for your infrastructure.
  7. As the MCO updates machines in each machine config pool, it reboots each node one by one. You must wait until all the nodes are updated. Check the machine config pool status by entering the following command:

    $ oc get mcp
    Copy to Clipboard Toggle word wrap

    A successfully updated node has the following status: UPDATED=true, UPDATING=false, DEGRADED=false.

    Note

    By default, the MCO updates one machine per pool at a time, causing the total time the migration takes to increase with the size of the cluster.

  8. Confirm the status of the new machine configuration on the hosts:

    1. To list the machine configuration state and the name of the applied machine configuration, enter the following command:

      $ oc describe node | egrep "hostname|machineconfig"
      Copy to Clipboard Toggle word wrap

      Example output

      kubernetes.io/hostname=master-0
      machineconfiguration.openshift.io/currentConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/desiredConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/reason:
      machineconfiguration.openshift.io/state: Done
      Copy to Clipboard Toggle word wrap

      Verify that the following statements are true:

      • The value of machineconfiguration.openshift.io/state field is Done.
      • The value of the machineconfiguration.openshift.io/currentConfig field is equal to the value of the machineconfiguration.openshift.io/desiredConfig field.
    2. To confirm that the machine config is correct, enter the following command:

      $ oc get machineconfig <config_name> -o yaml | grep path:
      Copy to Clipboard Toggle word wrap

      where <config_name> is the name of the machine config from the machineconfiguration.openshift.io/currentConfig field.

      If the machine config is successfully deployed, the previous output contains the /etc/NetworkManager/system-connections/<connection_name> file path.

      The machine config must not contain the ExecStart=/usr/local/bin/mtu-migration.sh line.

  9. To finalize the MTU migration, enter one of the following commands:

    • If you are using the OVN-Kubernetes cluster network provider:

      $ oc patch Network.operator.openshift.io cluster --type=merge --patch \
        '{"spec": { "migration": null, "defaultNetwork":{ "ovnKubernetesConfig": { "mtu": <mtu> }}}}'
      Copy to Clipboard Toggle word wrap

      where:

      <mtu>
      Specifies the new cluster network MTU that you specified with <overlay_to>.
    • If you are using the OpenShift SDN cluster network provider:

      $ oc patch Network.operator.openshift.io cluster --type=merge --patch \
        '{"spec": { "migration": null, "defaultNetwork":{ "openshiftSDNConfig": { "mtu": <mtu> }}}}'
      Copy to Clipboard Toggle word wrap

      where:

      <mtu>
      Specifies the new cluster network MTU that you specified with <overlay_to>.
  10. After finalizing the MTU migration, each MCP node is rebooted one by one. You must wait until all the nodes are updated. Check the machine config pool status by entering the following command:

    $ oc get mcp
    Copy to Clipboard Toggle word wrap

    A successfully updated node has the following status: UPDATED=true, UPDATING=false, DEGRADED=false.

Verification

You can verify that a node in your cluster uses an MTU that you specified in the previous procedure.

  1. To get the current MTU for the cluster network, enter the following command:

    $ oc describe network.config cluster
    Copy to Clipboard Toggle word wrap
  2. Get the current MTU for the primary network interface of a node.

    1. To list the nodes in your cluster, enter the following command:

      $ oc get nodes
      Copy to Clipboard Toggle word wrap
    2. To obtain the current MTU setting for the primary network interface on a node, enter the following command:

      $ oc debug node/<node> -- chroot /host ip address show <interface>
      Copy to Clipboard Toggle word wrap

      where:

      <node>
      Specifies a node from the output from the previous step.
      <interface>
      Specifies the primary network interface name for the node.

      Example output

      ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8051
      Copy to Clipboard Toggle word wrap

As a cluster administrator, you can expand the available node port range. If your cluster uses of a large number of node ports, you might need to increase the number of available ports.

The default port range is 30000-32767. You can never reduce the port range, even if you first expand it beyond the default range.

11.1. Prerequisites

  • Your cluster infrastructure must allow access to the ports that you specify within the expanded range. For example, if you expand the node port range to 30000-32900, the inclusive port range of 32768-32900 must be allowed by your firewall or packet filtering configuration.

11.2. Expanding the node port range

You can expand the node port range for the cluster.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.

Procedure

  1. To expand the node port range, enter the following command. Replace <port> with the largest port number in the new range.

    $ oc patch network.config.openshift.io cluster --type=merge -p \
      '{
        "spec":
          { "serviceNodePortRange": "30000-<port>" }
      }'
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively apply the following YAML to update the node port range:

    apiVersion: config.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      serviceNodePortRange: "30000-<port>"
    Copy to Clipboard Toggle word wrap

    Example output

    network.config.openshift.io/cluster patched
    Copy to Clipboard Toggle word wrap

  2. To confirm that the configuration is active, enter the following command. It can take several minutes for the update to apply.

    $ oc get configmaps -n openshift-kube-apiserver config \
      -o jsonpath="{.data['config\.yaml']}" | \
      grep -Eo '"service-node-port-range":["[[:digit:]]+-[[:digit:]]+"]'
    Copy to Clipboard Toggle word wrap

    Example output

    "service-node-port-range":["30000-33000"]
    Copy to Clipboard Toggle word wrap

Chapter 12. Configuring IP failover

This topic describes configuring IP failover for pods and services on your OpenShift Container Platform cluster.

IP failover manages a pool of Virtual IP (VIP) addresses on a set of nodes. Every VIP in the set is serviced by a node selected from the set. As long a single node is available, the VIPs are served. There is no way to explicitly distribute the VIPs over the nodes, so there can be nodes with no VIPs and other nodes with many VIPs. If there is only one node, all VIPs are on it.

Note

The VIPs must be routable from outside the cluster.

IP failover monitors a port on each VIP to determine whether the port is reachable on the node. If the port is not reachable, the VIP is not assigned to the node. If the port is set to 0, this check is suppressed. The check script does the needed testing.

IP failover uses Keepalived to host a set of externally accessible VIP addresses on a set of hosts. Each VIP is only serviced by a single host at a time. Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to determine which host, from the set of hosts, services which VIP. If a host becomes unavailable, or if the service that Keepalived is watching does not respond, the VIP is switched to another host from the set. This means a VIP is always serviced as long as a host is available.

When a node running Keepalived passes the check script, the VIP on that node can enter the master state based on its priority and the priority of the current master and as determined by the preemption strategy.

A cluster administrator can provide a script through the OPENSHIFT_HA_NOTIFY_SCRIPT variable, and this script is called whenever the state of the VIP on the node changes. Keepalived uses the master state when it is servicing the VIP, the backup state when another node is servicing the VIP, or in the fault state when the check script fails. The notify script is called with the new state whenever the state changes.

You can create an IP failover deployment configuration on OpenShift Container Platform. The IP failover deployment configuration specifies the set of VIP addresses, and the set of nodes on which to service them. A cluster can have multiple IP failover deployment configurations, with each managing its own set of unique VIP addresses. Each node in the IP failover configuration runs an IP failover pod, and this pod runs Keepalived.

When using VIPs to access a pod with host networking, the application pod runs on all nodes that are running the IP failover pods. This enables any of the IP failover nodes to become the master and service the VIPs when needed. If application pods are not running on all nodes with IP failover, either some IP failover nodes never service the VIPs or some application pods never receive any traffic. Use the same selector and replication count, for both IP failover and the application pods, to avoid this mismatch.

While using VIPs to access a service, any of the nodes can be in the IP failover set of nodes, since the service is reachable on all nodes, no matter where the application pod is running. Any of the IP failover nodes can become master at any time. The service can either use external IPs and a service port or it can use a NodePort.

When using external IPs in the service definition, the VIPs are set to the external IPs, and the IP failover monitoring port is set to the service port. When using a node port, the port is open on every node in the cluster, and the service load-balances traffic from whatever node currently services the VIP. In this case, the IP failover monitoring port is set to the NodePort in the service definition.

Important

Setting up a NodePort is a privileged operation.

Important

Even though a service VIP is highly available, performance can still be affected. Keepalived makes sure that each of the VIPs is serviced by some node in the configuration, and several VIPs can end up on the same node even when other nodes have none. Strategies that externally load-balance across a set of VIPs can be thwarted when IP failover puts multiple VIPs on the same node.

When you use ingressIP, you can set up IP failover to have the same VIP range as the ingressIP range. You can also disable the monitoring port. In this case, all the VIPs appear on same node in the cluster. Any user can set up a service with an ingressIP and have it highly available.

Important

There are a maximum of 254 VIPs in the cluster.

12.1. IP failover environment variables

The following table contains the variables used to configure IP failover.

Expand
Table 12.1. IP failover environment variables
Variable NameDefaultDescription

OPENSHIFT_HA_MONITOR_PORT

80

The IP failover pod tries to open a TCP connection to this port on each Virtual IP (VIP). If connection is established, the service is considered to be running. If this port is set to 0, the test always passes.

OPENSHIFT_HA_NETWORK_INTERFACE

 

The interface name that IP failover uses to send Virtual Router Redundancy Protocol (VRRP) traffic. The default value is eth0.

OPENSHIFT_HA_REPLICA_COUNT

2

The number of replicas to create. This must match spec.replicas value in IP failover deployment configuration.

OPENSHIFT_HA_VIRTUAL_IPS

 

The list of IP address ranges to replicate. This must be provided. For example, 1.2.3.4-6,1.2.3.9.

OPENSHIFT_HA_VRRP_ID_OFFSET

0

The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is 0, and the allowed range is 0 through 255.

OPENSHIFT_HA_VIP_GROUPS

 

The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the OPENSHIFT_HA_VIP_GROUPS variable.

OPENSHIFT_HA_IPTABLES_CHAIN

INPUT

The name of the iptables chain, to automatically add an iptables rule to allow the VRRP traffic on. If the value is not set, an iptables rule is not added. If the chain does not exist, it is not created.

OPENSHIFT_HA_CHECK_SCRIPT

 

The full path name in the pod file system of a script that is periodically run to verify the application is operating.

OPENSHIFT_HA_CHECK_INTERVAL

2

The period, in seconds, that the check script is run.

OPENSHIFT_HA_NOTIFY_SCRIPT

 

The full path name in the pod file system of a script that is run whenever the state changes.

OPENSHIFT_HA_PREEMPTION

preempt_nodelay 300

The strategy for handling a new higher priority host. The nopreempt strategy does not move master from the lower priority host to the higher priority host.

12.2. Configuring IP failover

As a cluster administrator, you can configure IP failover on an entire cluster, or on a subset of nodes, as defined by the label selector. You can also configure multiple IP failover deployment configurations in your cluster, where each one is independent of the others.

The IP failover deployment configuration ensures that a failover pod runs on each of the nodes matching the constraints or the label used.

This pod runs Keepalived, which can monitor an endpoint and use Virtual Router Redundancy Protocol (VRRP) to fail over the virtual IP (VIP) from one node to another if the first node cannot reach the service or endpoint.

For production use, set a selector that selects at least two nodes, and set replicas equal to the number of selected nodes.

Prerequisites

  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You created a pull secret.

Procedure

  1. Create an IP failover service account:

    $ oc create sa ipfailover
    Copy to Clipboard Toggle word wrap
  2. Update security context constraints (SCC) for hostNetwork:

    $ oc adm policy add-scc-to-user privileged -z ipfailover
    $ oc adm policy add-scc-to-user hostnetwork -z ipfailover
    Copy to Clipboard Toggle word wrap
  3. Create a deployment YAML file to configure IP failover:

    Example deployment YAML for IP failover configuration

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: ipfailover-keepalived 
    1
    
      labels:
        ipfailover: hello-openshift
    spec:
      strategy:
        type: Recreate
      replicas: 2
      selector:
        matchLabels:
          ipfailover: hello-openshift
      template:
        metadata:
          labels:
            ipfailover: hello-openshift
        spec:
          serviceAccountName: ipfailover
          privileged: true
          hostNetwork: true
          nodeSelector:
            node-role.kubernetes.io/worker: ""
          containers:
          - name: openshift-ipfailover
            image: quay.io/openshift/origin-keepalived-ipfailover
            ports:
            - containerPort: 63000
              hostPort: 63000
            imagePullPolicy: IfNotPresent
            securityContext:
              privileged: true
            volumeMounts:
            - name: lib-modules
              mountPath: /lib/modules
              readOnly: true
            - name: host-slash
              mountPath: /host
              readOnly: true
              mountPropagation: HostToContainer
            - name: etc-sysconfig
              mountPath: /etc/sysconfig
              readOnly: true
            - name: config-volume
              mountPath: /etc/keepalive
            env:
            - name: OPENSHIFT_HA_CONFIG_NAME
              value: "ipfailover"
            - name: OPENSHIFT_HA_VIRTUAL_IPS 
    2
    
              value: "1.1.1.1-2"
            - name: OPENSHIFT_HA_VIP_GROUPS 
    3
    
              value: "10"
            - name: OPENSHIFT_HA_NETWORK_INTERFACE 
    4
    
              value: "ens3" #The host interface to assign the VIPs
            - name: OPENSHIFT_HA_MONITOR_PORT 
    5
    
              value: "30060"
            - name: OPENSHIFT_HA_VRRP_ID_OFFSET 
    6
    
              value: "0"
            - name: OPENSHIFT_HA_REPLICA_COUNT 
    7
    
              value: "2" #Must match the number of replicas in the deployment
            - name: OPENSHIFT_HA_USE_UNICAST
              value: "false"
            #- name: OPENSHIFT_HA_UNICAST_PEERS
              #value: "10.0.148.40,10.0.160.234,10.0.199.110"
            - name: OPENSHIFT_HA_IPTABLES_CHAIN 
    8
    
              value: "INPUT"
            #- name: OPENSHIFT_HA_NOTIFY_SCRIPT 
    9
    
            #  value: /etc/keepalive/mynotifyscript.sh
            - name: OPENSHIFT_HA_CHECK_SCRIPT 
    10
    
              value: "/etc/keepalive/mycheckscript.sh"
            - name: OPENSHIFT_HA_PREEMPTION 
    11
    
              value: "preempt_delay 300"
            - name: OPENSHIFT_HA_CHECK_INTERVAL 
    12
    
              value: "2"
            livenessProbe:
              initialDelaySeconds: 10
              exec:
                command:
                - pgrep
                - keepalived
          volumes:
          - name: lib-modules
            hostPath:
              path: /lib/modules
          - name: host-slash
            hostPath:
              path: /
          - name: etc-sysconfig
            hostPath:
              path: /etc/sysconfig
          # config-volume contains the check script
          # created with `oc create configmap keepalived-checkscript --from-file=mycheckscript.sh`
          - configMap:
              defaultMode: 0755
              name: keepalived-checkscript
            name: config-volume
          imagePullSecrets:
            - name: openshift-pull-secret 
    13
    Copy to Clipboard Toggle word wrap

    1
    The name of the IP failover deployment.
    2
    The list of IP address ranges to replicate. This must be provided. For example, 1.2.3.4-6,1.2.3.9.
    3
    The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the OPENSHIFT_HA_VIP_GROUPS variable.
    4
    The interface name that IP failover uses to send VRRP traffic. By default, eth0 is used.
    5
    The IP failover pod tries to open a TCP connection to this port on each VIP. If connection is established, the service is considered to be running. If this port is set to 0, the test always passes. The default value is 80.
    6
    The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is 0, and the allowed range is 0 through 255.
    7
    The number of replicas to create. This must match spec.replicas value in IP failover deployment configuration. The default value is 2.
    8
    The name of the iptables chain to automatically add an iptables rule to allow the VRRP traffic on. If the value is not set, an iptables rule is not added. If the chain does not exist, it is not created, and Keepalived operates in unicast mode. The default is INPUT.
    9
    The full path name in the pod file system of a script that is run whenever the state changes.
    10
    The full path name in the pod file system of a script that is periodically run to verify the application is operating.
    11
    The strategy for handling a new higher priority host. The default value is preempt_delay 300, which causes a Keepalived instance to take over a VIP after 5 minutes if a lower-priority master is holding the VIP.
    12
    The period, in seconds, that the check script is run. The default value is 2.
    13
    Create the pull secret before creating the deployment, otherwise you will get an error when creating the deployment.

12.3. About virtual IP addresses

Keepalived manages a set of virtual IP addresses (VIP). The administrator must make sure that all of these addresses:

  • Are accessible on the configured hosts from outside the cluster.
  • Are not used for any other purpose within the cluster.

Keepalived on each node determines whether the needed service is running. If it is, VIPs are supported and Keepalived participates in the negotiation to determine which node serves the VIP. For a node to participate, the service must be listening on the watch port on a VIP or the check must be disabled.

Note

Each VIP in the set may end up being served by a different node.

12.4. Configuring check and notify scripts

Keepalived monitors the health of the application by periodically running an optional user supplied check script. For example, the script can test a web server by issuing a request and verifying the response.

When a check script is not provided, a simple default script is run that tests the TCP connection. This default test is suppressed when the monitor port is 0.

Each IP failover pod manages a Keepalived daemon that manages one or more virtual IPs (VIP) on the node where the pod is running. The Keepalived daemon keeps the state of each VIP for that node. A particular VIP on a particular node may be in master, backup, or fault state.

When the check script for that VIP on the node that is in master state fails, the VIP on that node enters the fault state, which triggers a renegotiation. During renegotiation, all VIPs on a node that are not in the fault state participate in deciding which node takes over the VIP. Ultimately, the VIP enters the master state on some node, and the VIP stays in the backup state on the other nodes.

When a node with a VIP in backup state fails, the VIP on that node enters the fault state. When the check script passes again for a VIP on a node in the fault state, the VIP on that node exits the fault state and negotiates to enter the master state. The VIP on that node may then enter either the master or the backup state.

As cluster administrator, you can provide an optional notify script, which is called whenever the state changes. Keepalived passes the following three parameters to the script:

  • $1 - group or instance
  • $2 - Name of the group or instance
  • $3 - The new state: master, backup, or fault

The check and notify scripts run in the IP failover pod and use the pod file system, not the host file system. However, the IP failover pod makes the host file system available under the /hosts mount path. When configuring a check or notify script, you must provide the full path to the script. The recommended approach for providing the scripts is to use a config map.

The full path names of the check and notify scripts are added to the Keepalived configuration file, _/etc/keepalived/keepalived.conf, which is loaded every time Keepalived starts. The scripts can be added to the pod with a config map as follows.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.

Procedure

  1. Create the desired script and create a config map to hold it. The script has no input arguments and must return 0 for OK and 1 for fail.

    The check script, mycheckscript.sh:

    #!/bin/bash
        # Whatever tests are needed
        # E.g., send request and verify response
    exit 0
    Copy to Clipboard Toggle word wrap
  2. Create the config map:

    $ oc create configmap mycustomcheck --from-file=mycheckscript.sh
    Copy to Clipboard Toggle word wrap
  3. Add the script to the pod. The defaultMode for the mounted config map files must able to run by using oc commands or by editing the deployment configuration. A value of 0755, 493 decimal, is typical:

    $ oc set env deploy/ipfailover-keepalived \
        OPENSHIFT_HA_CHECK_SCRIPT=/etc/keepalive/mycheckscript.sh
    Copy to Clipboard Toggle word wrap
    $ oc set volume deploy/ipfailover-keepalived --add --overwrite \
        --name=config-volume \
        --mount-path=/etc/keepalive \
        --source='{"configMap": { "name": "mycustomcheck", "defaultMode": 493}}'
    Copy to Clipboard Toggle word wrap
    Note

    The oc set env command is whitespace sensitive. There must be no whitespace on either side of the = sign.

    Tip

    You can alternatively edit the ipfailover-keepalived deployment configuration:

    $ oc edit deploy ipfailover-keepalived
    Copy to Clipboard Toggle word wrap
        spec:
          containers:
          - env:
            - name: OPENSHIFT_HA_CHECK_SCRIPT  
    1
    
              value: /etc/keepalive/mycheckscript.sh
    ...
            volumeMounts: 
    2
    
            - mountPath: /etc/keepalive
              name: config-volume
          dnsPolicy: ClusterFirst
    ...
          volumes: 
    3
    
          - configMap:
              defaultMode: 0755 
    4
    
              name: customrouter
            name: config-volume
    ...
    Copy to Clipboard Toggle word wrap
    1
    In the spec.container.env field, add the OPENSHIFT_HA_CHECK_SCRIPT environment variable to point to the mounted script file.
    2
    Add the spec.container.volumeMounts field to create the mount point.
    3
    Add a new spec.volumes field to mention the config map.
    4
    This sets run permission on the files. When read back, it is displayed in decimal, 493.

    Save the changes and exit the editor. This restarts ipfailover-keepalived.

12.5. Configuring VRRP preemption

When a Virtual IP (VIP) on a node leaves the fault state by passing the check script, the VIP on the node enters the backup state if it has lower priority than the VIP on the node that is currently in the master state. However, if the VIP on the node that is leaving fault state has a higher priority, the preemption strategy determines its role in the cluster.

The nopreempt strategy does not move master from the lower priority VIP on the host to the higher priority VIP on the host. With preempt_delay 300, the default, Keepalived waits the specified 300 seconds and moves master to the higher priority VIP on the host.

Prerequisites

  • You installed the OpenShift CLI (oc).

Procedure

  • To specify preemption enter oc edit deploy ipfailover-keepalived to edit the router deployment configuration:

    $ oc edit deploy ipfailover-keepalived
    Copy to Clipboard Toggle word wrap
    ...
        spec:
          containers:
          - env:
            - name: OPENSHIFT_HA_PREEMPTION  
    1
    
              value: preempt_delay 300
    ...
    Copy to Clipboard Toggle word wrap
    1
    Set the OPENSHIFT_HA_PREEMPTION value:
    • preempt_delay 300: Keepalived waits the specified 300 seconds and moves master to the higher priority VIP on the host. This is the default value.
    • nopreempt: does not move master from the lower priority VIP on the host to the higher priority VIP on the host.

12.6. About VRRP ID offset

Each IP failover pod managed by the IP failover deployment configuration, 1 pod per node or replica, runs a Keepalived daemon. As more IP failover deployment configurations are configured, more pods are created and more daemons join into the common Virtual Router Redundancy Protocol (VRRP) negotiation. This negotiation is done by all the Keepalived daemons and it determines which nodes service which virtual IPs (VIP).

Internally, Keepalived assigns a unique vrrp-id to each VIP. The negotiation uses this set of vrrp-ids, when a decision is made, the VIP corresponding to the winning vrrp-id is serviced on the winning node.

Therefore, for every VIP defined in the IP failover deployment configuration, the IP failover pod must assign a corresponding vrrp-id. This is done by starting at OPENSHIFT_HA_VRRP_ID_OFFSET and sequentially assigning the vrrp-ids to the list of VIPs. The vrrp-ids can have values in the range 1..255.

When there are multiple IP failover deployment configurations, you must specify OPENSHIFT_HA_VRRP_ID_OFFSET so that there is room to increase the number of VIPs in the deployment configuration and none of the vrrp-id ranges overlap.

IP failover management is limited to 254 groups of Virtual IP (VIP) addresses. By default OpenShift Container Platform assigns one IP address to each group. You can use the OPENSHIFT_HA_VIP_GROUPS variable to change this so multiple IP addresses are in each group and define the number of VIP groups available for each Virtual Router Redundancy Protocol (VRRP) instance when configuring IP failover.

Grouping VIPs creates a wider range of allocation of VIPs per VRRP in the case of VRRP failover events, and is useful when all hosts in the cluster have access to a service locally. For example, when a service is being exposed with an ExternalIP.

Note

As a rule for failover, do not limit services, such as the router, to one specific host. Instead, services should be replicated to each host so that in the case of IP failover, the services do not have to be recreated on the new host.

Note

If you are using OpenShift Container Platform health checks, the nature of IP failover and groups means that all instances in the group are not checked. For that reason, the Kubernetes health checks must be used to ensure that services are live.

Prerequisites

  • You are logged in to the cluster with a user with cluster-admin privileges.

Procedure

  • To change the number of IP addresses assigned to each group, change the value for the OPENSHIFT_HA_VIP_GROUPS variable, for example:

    Example Deployment YAML for IP failover configuration

    ...
        spec:
            env:
            - name: OPENSHIFT_HA_VIP_GROUPS 
    1
    
              value: "3"
    ...
    Copy to Clipboard Toggle word wrap

    1
    If OPENSHIFT_HA_VIP_GROUPS is set to 3 in an environment with seven VIPs, it creates three groups, assigning three VIPs to the first group, and two VIPs to the two remaining groups.
Note

If the number of groups set by OPENSHIFT_HA_VIP_GROUPS is fewer than the number of IP addresses set to fail over, the group contains more than one IP address, and all of the addresses move as a single unit.

12.8. High availability For ingressIP

In non-cloud clusters, IP failover and ingressIP to a service can be combined. The result is high availability services for users that create services using ingressIP.

The approach is to specify an ingressIPNetworkCIDR range and then use the same range in creating the ipfailover configuration.

Because IP failover can support up to a maximum of 255 VIPs for the entire cluster, the ingressIPNetworkCIDR needs to be /24 or smaller.

12.9. Removing IP failover

When IP failover is initially configured, the worker nodes in the cluster are modified with an iptables rule that explicitly allows multicast packets on 224.0.0.18 for Keepalived. Because of the change to the nodes, removing IP failover requires running a job to remove the iptables rule and removing the virtual IP addresses used by Keepalived.

Procedure

  1. Optional: Identify and delete any check and notify scripts that are stored as config maps:

    1. Identify whether any pods for IP failover use a config map as a volume:

      $ oc get pod -l ipfailover \
        -o jsonpath="\
      {range .items[?(@.spec.volumes[*].configMap)]}
      {'Namespace: '}{.metadata.namespace}
      {'Pod:       '}{.metadata.name}
      {'Volumes that use config maps:'}
      {range .spec.volumes[?(@.configMap)]}  {'volume:    '}{.name}
        {'configMap: '}{.configMap.name}{'\n'}{end}
      {end}"
      Copy to Clipboard Toggle word wrap

      Example output

      Namespace: default
      Pod:       keepalived-worker-59df45db9c-2x9mn
      Volumes that use config maps:
        volume:    config-volume
        configMap: mycustomcheck
      Copy to Clipboard Toggle word wrap

    2. If the preceding step provided the names of config maps that are used as volumes, delete the config maps:

      $ oc delete configmap <configmap_name>
      Copy to Clipboard Toggle word wrap
  2. Identify an existing deployment for IP failover:

    $ oc get deployment -l ipfailover
    Copy to Clipboard Toggle word wrap

    Example output

    NAMESPACE   NAME         READY   UP-TO-DATE   AVAILABLE   AGE
    default     ipfailover   2/2     2            2           105d
    Copy to Clipboard Toggle word wrap

  3. Delete the deployment:

    $ oc delete deployment <ipfailover_deployment_name>
    Copy to Clipboard Toggle word wrap
  4. Remove the ipfailover service account:

    $ oc delete sa ipfailover
    Copy to Clipboard Toggle word wrap
  5. Run a job that removes the IP tables rule that was added when IP failover was initially configured:

    1. Create a file such as remove-ipfailover-job.yaml with contents that are similar to the following example:

      apiVersion: batch/v1
      kind: Job
      metadata:
        generateName: remove-ipfailover-
        labels:
          app: remove-ipfailover
      spec:
        template:
          metadata:
            name: remove-ipfailover
          spec:
            containers:
            - name: remove-ipfailover
              image: quay.io/openshift/origin-keepalived-ipfailover:4.11
              command: ["/var/lib/ipfailover/keepalived/remove-failover.sh"]
            nodeSelector:
              kubernetes.io/hostname: <host_name>  <.>
            restartPolicy: Never
      Copy to Clipboard Toggle word wrap

      <.> Run the job for each node in your cluster that was configured for IP failover and replace the hostname each time.

    2. Run the job:

      $ oc create -f remove-ipfailover-job.yaml
      Copy to Clipboard Toggle word wrap

      Example output

      job.batch/remove-ipfailover-2h8dm created
      Copy to Clipboard Toggle word wrap

Verification

  • Confirm that the job removed the initial configuration for IP failover.

    $ oc logs job/remove-ipfailover-2h8dm
    Copy to Clipboard Toggle word wrap

    Example output

    remove-failover.sh: OpenShift IP Failover service terminating.
      - Removing ip_vs module ...
      - Cleaning up ...
      - Releasing VIPs  (interface eth0) ...
    Copy to Clipboard Toggle word wrap

In Linux, sysctl allows an administrator to modify kernel parameters at runtime. You can modify interface-level network sysctls using the tuning Container Network Interface (CNI) meta plugin. The tuning CNI meta plugin operates in a chain with a main CNI plugin as illustrated.

The main CNI plugin assigns the interface and passes this to the tuning CNI meta plugin at runtime. You can change some sysctls and several interface attributes (promiscuous mode, all-multicast mode, MTU, and MAC address) in the network namespace by using the tuning CNI meta plugin. In the tuning CNI meta plugin configuration, the interface name is represented by the IFNAME token, and is replaced with the actual name of the interface at runtime.

Note

In OpenShift Container Platform, the tuning CNI meta plugin only supports changing interface-level network sysctls.

13.1. Configuring the tuning CNI

The following procedure configures the tuning CNI to change the interface-level network net.ipv4.conf.IFNAME.accept_redirects sysctl. This example enables accepting and sending ICMP-redirected packets.

Procedure

  1. Create a network attachment definition, such as tuning-example.yaml, with the following content:

    apiVersion: "k8s.cni.cncf.io/v1"
    kind: NetworkAttachmentDefinition
    metadata:
      name: <name> 
    1
    
      namespace: default 
    2
    
    spec:
      config: '{
        "cniVersion": "0.4.0", 
    3
    
        "name": "<name>", 
    4
    
        "plugins": [{
           "type": "<main_CNI_plugin>" 
    5
    
          },
          {
           "type": "tuning", 
    6
    
           "sysctl": {
                "net.ipv4.conf.IFNAME.accept_redirects": "1" 
    7
    
            }
          }
         ]
    }
    Copy to Clipboard Toggle word wrap
    1
    Specifies the name for the additional network attachment to create. The name must be unique within the specified namespace.
    2
    Specifies the namespace that the object is associated with.
    3
    Specifies the CNI specification version.
    4
    Specifies the name for the configuration. It is recommended to match the configuration name to the name value of the network attachment definition.
    5
    Specifies the name of the main CNI plugin to configure.
    6
    Specifies the name of the CNI meta plugin.
    7
    Specifies the sysctl to set.

    An example yaml file is shown here:

    apiVersion: "k8s.cni.cncf.io/v1"
    kind: NetworkAttachmentDefinition
    metadata:
      name: tuningnad
      namespace: default
    spec:
      config: '{
        "cniVersion": "0.4.0",
        "name": "tuningnad",
        "plugins": [{
          "type": "bridge"
          },
          {
          "type": "tuning",
          "sysctl": {
             "net.ipv4.conf.IFNAME.accept_redirects": "1"
            }
        }
      ]
    }'
    Copy to Clipboard Toggle word wrap
  2. Apply the yaml by running the following command:

    $ oc apply -f tuning-example.yaml
    Copy to Clipboard Toggle word wrap

    Example output

    networkattachmentdefinition.k8.cni.cncf.io/tuningnad created
    Copy to Clipboard Toggle word wrap

  3. Create a pod such as examplepod.yaml with the network attachment definition similar to the following:

    apiVersion: v1
    kind: Pod
    metadata:
      name: tunepod
      namespace: default
      annotations:
        k8s.v1.cni.cncf.io/networks: tuningnad 
    1
    
    spec:
      containers:
      - name: podexample
        image: centos
        command: ["/bin/bash", "-c", "sleep INF"]
        securityContext:
          runAsUser: 2000 
    2
    
          runAsGroup: 3000 
    3
    
          allowPrivilegeEscalation: false 
    4
    
          capabilities: 
    5
    
            drop: ["ALL"]
      securityContext:
        runAsNonRoot: true 
    6
    
        seccompProfile: 
    7
    
          type: RuntimeDefault
    Copy to Clipboard Toggle word wrap
    1
    Specify the name of the configured NetworkAttachmentDefinition.
    2
    runAsUser controls which user ID the container is run with.
    3
    runAsGroup controls which primary group ID the containers is run with.
    4
    allowPrivilegeEscalation determines if a pod can request to allow privilege escalation. If unspecified, it defaults to true. This boolean directly controls whether the no_new_privs flag gets set on the container process.
    5
    capabilities permit privileged actions without giving full root access. This policy ensures all capabilities are dropped from the pod.
    6
    runAsNonRoot: true requires that the container will run with a user with any UID other than 0.
    7
    RuntimeDefault enables the default seccomp profile for a pod or container workload.
  4. Apply the yaml by running the following command:

    $ oc apply -f examplepod.yaml
    Copy to Clipboard Toggle word wrap
  5. Verify that the pod is created by running the following command:

    $ oc get pod
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      READY   STATUS    RESTARTS   AGE
    tunepod   1/1     Running   0          47s
    Copy to Clipboard Toggle word wrap

  6. Log in to the pod by running the following command:

    $ oc rsh tunepod
    Copy to Clipboard Toggle word wrap
  7. Verify the values of the configured sysctl flags. For example, find the value net.ipv4.conf.net1.accept_redirects by running the following command:

    sh-4.4# sysctl net.ipv4.conf.net1.accept_redirects
    Copy to Clipboard Toggle word wrap

    Expected output

    net.ipv4.conf.net1.accept_redirects = 1
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can use the Stream Control Transmission Protocol (SCTP) on a cluster.

As a cluster administrator, you can enable SCTP on the hosts in the cluster. On Red Hat Enterprise Linux CoreOS (RHCOS), the SCTP module is disabled by default.

SCTP is a reliable message based protocol that runs on top of an IP network.

When enabled, you can use SCTP as a protocol with pods, services, and network policy. A Service object must be defined with the type parameter set to either the ClusterIP or NodePort value.

14.1.1. Example configurations using SCTP protocol

You can configure a pod or service to use SCTP by setting the protocol parameter to the SCTP value in the pod or service object.

In the following example, a pod is configured to use SCTP:

apiVersion: v1
kind: Pod
metadata:
  namespace: project1
  name: example-pod
spec:
  containers:
    - name: example-pod
...
      ports:
        - containerPort: 30100
          name: sctpserver
          protocol: SCTP
Copy to Clipboard Toggle word wrap

In the following example, a service is configured to use SCTP:

apiVersion: v1
kind: Service
metadata:
  namespace: project1
  name: sctpserver
spec:
...
  ports:
    - name: sctpserver
      protocol: SCTP
      port: 30100
      targetPort: 30100
  type: ClusterIP
Copy to Clipboard Toggle word wrap

In the following example, a NetworkPolicy object is configured to apply to SCTP network traffic on port 80 from any pods with a specific label:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-sctp-on-http
spec:
  podSelector:
    matchLabels:
      role: web
  ingress:
  - ports:
    - protocol: SCTP
      port: 80
Copy to Clipboard Toggle word wrap

As a cluster administrator, you can load and enable the blacklisted SCTP kernel module on worker nodes in your cluster.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Access to the cluster as a user with the cluster-admin role.

Procedure

  1. Create a file named load-sctp-module.yaml that contains the following YAML definition:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      name: load-sctp-module
      labels:
        machineconfiguration.openshift.io/role: worker
    spec:
      config:
        ignition:
          version: 3.2.0
        storage:
          files:
            - path: /etc/modprobe.d/sctp-blacklist.conf
              mode: 0644
              overwrite: true
              contents:
                source: data:,
            - path: /etc/modules-load.d/sctp-load.conf
              mode: 0644
              overwrite: true
              contents:
                source: data:,sctp
    Copy to Clipboard Toggle word wrap
  2. To create the MachineConfig object, enter the following command:

    $ oc create -f load-sctp-module.yaml
    Copy to Clipboard Toggle word wrap
  3. Optional: To watch the status of the nodes while the MachineConfig Operator applies the configuration change, enter the following command. When the status of a node transitions to Ready, the configuration update is applied.

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

You can verify that SCTP is working on a cluster by creating a pod with an application that listens for SCTP traffic, associating it with a service, and then connecting to the exposed service.

Prerequisites

  • Access to the internet from the cluster to install the nc package.
  • Install the OpenShift CLI (oc).
  • Access to the cluster as a user with the cluster-admin role.

Procedure

  1. Create a pod starts an SCTP listener:

    1. Create a file named sctp-server.yaml that defines a pod with the following YAML:

      apiVersion: v1
      kind: Pod
      metadata:
        name: sctpserver
        labels:
          app: sctpserver
      spec:
        containers:
          - name: sctpserver
            image: registry.access.redhat.com/ubi8/ubi
            command: ["/bin/sh", "-c"]
            args:
              ["dnf install -y nc && sleep inf"]
            ports:
              - containerPort: 30102
                name: sctpserver
                protocol: SCTP
      Copy to Clipboard Toggle word wrap
    2. Create the pod by entering the following command:

      $ oc create -f sctp-server.yaml
      Copy to Clipboard Toggle word wrap
  2. Create a service for the SCTP listener pod.

    1. Create a file named sctp-service.yaml that defines a service with the following YAML:

      apiVersion: v1
      kind: Service
      metadata:
        name: sctpservice
        labels:
          app: sctpserver
      spec:
        type: NodePort
        selector:
          app: sctpserver
        ports:
          - name: sctpserver
            protocol: SCTP
            port: 30102
            targetPort: 30102
      Copy to Clipboard Toggle word wrap
    2. To create the service, enter the following command:

      $ oc create -f sctp-service.yaml
      Copy to Clipboard Toggle word wrap
  3. Create a pod for the SCTP client.

    1. Create a file named sctp-client.yaml with the following YAML:

      apiVersion: v1
      kind: Pod
      metadata:
        name: sctpclient
        labels:
          app: sctpclient
      spec:
        containers:
          - name: sctpclient
            image: registry.access.redhat.com/ubi8/ubi
            command: ["/bin/sh", "-c"]
            args:
              ["dnf install -y nc && sleep inf"]
      Copy to Clipboard Toggle word wrap
    2. To create the Pod object, enter the following command:

      $ oc apply -f sctp-client.yaml
      Copy to Clipboard Toggle word wrap
  4. Run an SCTP listener on the server.

    1. To connect to the server pod, enter the following command:

      $ oc rsh sctpserver
      Copy to Clipboard Toggle word wrap
    2. To start the SCTP listener, enter the following command:

      $ nc -l 30102 --sctp
      Copy to Clipboard Toggle word wrap
  5. Connect to the SCTP listener on the server.

    1. Open a new terminal window or tab in your terminal program.
    2. Obtain the IP address of the sctpservice service. Enter the following command:

      $ oc get services sctpservice -o go-template='{{.spec.clusterIP}}{{"\n"}}'
      Copy to Clipboard Toggle word wrap
    3. To connect to the client pod, enter the following command:

      $ oc rsh sctpclient
      Copy to Clipboard Toggle word wrap
    4. To start the SCTP client, enter the following command. Replace <cluster_IP> with the cluster IP address of the sctpservice service.

      # nc <cluster_IP> 30102 --sctp
      Copy to Clipboard Toggle word wrap

Chapter 15. Using PTP hardware

You can configure linuxptp services and use PTP-capable hardware in OpenShift Container Platform cluster nodes.

15.1. About PTP hardware

You can use the OpenShift Container Platform console or OpenShift CLI (oc) to install PTP by deploying the PTP Operator. The PTP Operator creates and manages the linuxptp services and provides the following features:

  • Discovery of the PTP-capable devices in the cluster.
  • Management of the configuration of linuxptp services.
  • Notification of PTP clock events that negatively affect the performance and reliability of your application with the PTP Operator cloud-event-proxy sidecar.
Note

The PTP Operator works with PTP-capable devices on clusters provisioned only on bare-metal infrastructure.

15.2. About PTP

Precision Time Protocol (PTP) is used to synchronize clocks in a network. When used in conjunction with hardware support, PTP is capable of sub-microsecond accuracy, and is more accurate than Network Time Protocol (NTP).

The linuxptp package includes the ptp4l and phc2sys programs for clock synchronization. ptp4l implements the PTP boundary clock and ordinary clock. ptp4l synchronizes the PTP hardware clock to the source clock with hardware time stamping and synchronizes the system clock to the source clock with software time stamping. phc2sys is used for hardware time stamping to synchronize the system clock to the PTP hardware clock on the network interface controller (NIC).

15.2.1. Elements of a PTP domain

PTP is used to synchronize multiple nodes connected in a network, with clocks for each node. The clocks synchronized by PTP are organized in a source-destination hierarchy. The hierarchy is created and updated automatically by the best master clock (BMC) algorithm, which runs on every clock. Destination clocks are synchronized to source clocks, and destination clocks can themselves be the source for other downstream clocks. The following types of clocks can be included in configurations:

Grandmaster clock
The grandmaster clock provides standard time information to other clocks across the network and ensures accurate and stable synchronisation. It writes time stamps and responds to time requests from other clocks. Grandmaster clocks can be synchronized to a Global Positioning System (GPS) time source.
Ordinary clock
The ordinary clock has a single port connection that can play the role of source or destination clock, depending on its position in the network. The ordinary clock can read and write time stamps.
Boundary clock
The boundary clock has ports in two or more communication paths and can be a source and a destination to other destination clocks at the same time. The boundary clock works as a destination clock upstream. The destination clock receives the timing message, adjusts for delay, and then creates a new source time signal to pass down the network. The boundary clock produces a new timing packet that is still correctly synced with the source clock and can reduce the number of connected devices reporting directly to the source clock.

15.2.2. Advantages of PTP over NTP

One of the main advantages that PTP has over NTP is the hardware support present in various network interface controllers (NIC) and network switches. The specialized hardware allows PTP to account for delays in message transfer and improves the accuracy of time synchronization. To achieve the best possible accuracy, it is recommended that all networking components between PTP clocks are PTP hardware enabled.

Hardware-based PTP provides optimal accuracy, since the NIC can time stamp the PTP packets at the exact moment they are sent and received. Compare this to software-based PTP, which requires additional processing of the PTP packets by the operating system.

Important

Before enabling PTP, ensure that NTP is disabled for the required nodes. You can disable the chrony time service (chronyd) using a MachineConfig custom resource. For more information, see Disabling chrony time service.

15.2.3. Using PTP with dual NIC hardware

OpenShift Container Platform supports single and dual NIC hardware for precision PTP timing in the cluster.

For 5G telco networks that deliver mid-band spectrum coverage, each virtual distributed unit (vDU) requires connections to 6 radio units (RUs). To make these connections, each vDU host requires 2 NICs configured as boundary clocks.

Dual NIC hardware allows you to connect each NIC to the same upstream leader clock with separate ptp4l instances for each NIC feeding the downstream clocks.

15.3. Installing the PTP Operator using the CLI

As a cluster administrator, you can install the Operator by using the CLI.

Prerequisites

  • A cluster installed on bare-metal hardware with nodes that have hardware that supports PTP.
  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a namespace for the PTP Operator.

    1. Save the following YAML in the ptp-namespace.yaml file:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-ptp
        annotations:
          workload.openshift.io/allowed: management
        labels:
          name: openshift-ptp
          openshift.io/cluster-monitoring: "true"
      Copy to Clipboard Toggle word wrap
    2. Create the Namespace CR:

      $ oc create -f ptp-namespace.yaml
      Copy to Clipboard Toggle word wrap
  2. Create an Operator group for the PTP Operator.

    1. Save the following YAML in the ptp-operatorgroup.yaml file:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: ptp-operators
        namespace: openshift-ptp
      spec:
        targetNamespaces:
        - openshift-ptp
      Copy to Clipboard Toggle word wrap
    2. Create the OperatorGroup CR:

      $ oc create -f ptp-operatorgroup.yaml
      Copy to Clipboard Toggle word wrap
  3. Subscribe to the PTP Operator.

    1. Save the following YAML in the ptp-sub.yaml file:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: ptp-operator-subscription
        namespace: openshift-ptp
      spec:
        channel: "stable"
        name: ptp-operator
        source: redhat-operators
        sourceNamespace: openshift-marketplace
      Copy to Clipboard Toggle word wrap
    2. Create the Subscription CR:

      $ oc create -f ptp-sub.yaml
      Copy to Clipboard Toggle word wrap
  4. To verify that the Operator is installed, enter the following command:

    $ oc get csv -n openshift-ptp -o custom-columns=Name:.metadata.name,Phase:.status.phase
    Copy to Clipboard Toggle word wrap

    Example output

    Name                         Phase
    4.12.0-202301261535          Succeeded
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can install the PTP Operator using the web console.

Note

You have to create the namespace and Operator group as mentioned in the previous section.

Procedure

  1. Install the PTP Operator using the OpenShift Container Platform web console:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.
    2. Choose PTP Operator from the list of available Operators, and then click Install.
    3. On the Install Operator page, under A specific namespace on the cluster select openshift-ptp. Then, click Install.
  2. Optional: Verify that the PTP Operator installed successfully:

    1. Switch to the OperatorsInstalled Operators page.
    2. Ensure that PTP Operator is listed in the openshift-ptp project with a Status of InstallSucceeded.

      Note

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Go to the OperatorsInstalled Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
      • Go to the WorkloadsPods page and check the logs for pods in the openshift-ptp project.

15.5. Configuring PTP devices

The PTP Operator adds the NodePtpDevice.ptp.openshift.io custom resource definition (CRD) to OpenShift Container Platform.

When installed, the PTP Operator searches your cluster for PTP-capable network devices on each node. It creates and updates a NodePtpDevice custom resource (CR) object for each node that provides a compatible PTP-capable network device.

  • To return a complete list of PTP capable network devices in your cluster, run the following command:

    $ oc get NodePtpDevice -n openshift-ptp -o yaml
    Copy to Clipboard Toggle word wrap

    Example output

    apiVersion: v1
    items:
    - apiVersion: ptp.openshift.io/v1
      kind: NodePtpDevice
      metadata:
        creationTimestamp: "2022-01-27T15:16:28Z"
        generation: 1
        name: dev-worker-0 
    1
    
        namespace: openshift-ptp
        resourceVersion: "6538103"
        uid: d42fc9ad-bcbf-4590-b6d8-b676c642781a
      spec: {}
      status:
        devices: 
    2
    
        - name: eno1
        - name: eno2
        - name: eno3
        - name: eno4
        - name: enp5s0f0
        - name: enp5s0f1
    ...
    Copy to Clipboard Toggle word wrap

    1
    The value for the name parameter is the same as the name of the parent node.
    2
    The devices collection includes a list of the PTP capable devices that the PTP Operator discovers for the node.

You can configure the linuxptp services (ptp4l, phc2sys, ts2phc) as grandmaster clock by creating a PtpConfig custom resource (CR) that configures the host NIC.

The ts2phc utility allows you to synchronize the system clock with the PTP grandmaster clock so that the node can stream precision clock signal to downstream PTP ordinary clocks and boundary clocks.

Note

Use the following example PtpConfig CR as the basis to configure linuxptp services as the grandmaster clock for your particular hardware and environment. This example CR does not configure PTP fast events. To configure PTP fast events, set appropriate values for ptp4lOpts, ptp4lConf, and ptpClockThreshold. ptpClockThreshold is used only when events are enabled. See "Configuring the PTP fast event notifications publisher" for more information.

Prerequisites

  • Install an Intel Westport Channel network interface in the bare-metal cluster host.
  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the PTP Operator.

Procedure

  1. Create the PtpConfig resource. For example:

    1. Save the following YAML in the grandmaster-clock-ptp-config.yaml file:

      Example PTP grandmaster clock configuration

      apiVersion: ptp.openshift.io/v1
      kind: PtpConfig
      metadata:
        name: grandmaster-clock
        namespace: openshift-ptp
        annotations: {}
      spec:
        profile:
          - name: grandmaster-clock
            # The interface name is hardware-specific
            interface: $interface
            ptp4lOpts: "-2"
            phc2sysOpts: "-a -r -r -n 24"
            ptpSchedulingPolicy: SCHED_FIFO
            ptpSchedulingPriority: 10
            ptpSettings:
              logReduce: "true"
            ptp4lConf: |
              [global]
              #
              # Default Data Set
              #
              twoStepFlag 1
              slaveOnly 0
              priority1 128
              priority2 128
              domainNumber 24
              #utc_offset 37
              clockClass 255
              clockAccuracy 0xFE
              offsetScaledLogVariance 0xFFFF
              free_running 0
              freq_est_interval 1
              dscp_event 0
              dscp_general 0
              dataset_comparison G.8275.x
              G.8275.defaultDS.localPriority 128
              #
              # Port Data Set
              #
              logAnnounceInterval -3
              logSyncInterval -4
              logMinDelayReqInterval -4
              logMinPdelayReqInterval -4
              announceReceiptTimeout 3
              syncReceiptTimeout 0
              delayAsymmetry 0
              fault_reset_interval -4
              neighborPropDelayThresh 20000000
              masterOnly 0
              G.8275.portDS.localPriority 128
              #
              # Run time options
              #
              assume_two_step 0
              logging_level 6
              path_trace_enabled 0
              follow_up_info 0
              hybrid_e2e 0
              inhibit_multicast_service 0
              net_sync_monitor 0
              tc_spanning_tree 0
              tx_timestamp_timeout 50
              unicast_listen 0
              unicast_master_table 0
              unicast_req_duration 3600
              use_syslog 1
              verbose 0
              summary_interval 0
              kernel_leap 1
              check_fup_sync 0
              clock_class_threshold 7
              #
              # Servo Options
              #
              pi_proportional_const 0.0
              pi_integral_const 0.0
              pi_proportional_scale 0.0
              pi_proportional_exponent -0.3
              pi_proportional_norm_max 0.7
              pi_integral_scale 0.0
              pi_integral_exponent 0.4
              pi_integral_norm_max 0.3
              step_threshold 2.0
              first_step_threshold 0.00002
              max_frequency 900000000
              clock_servo pi
              sanity_freq_limit 200000000
              ntpshm_segment 0
              #
              # Transport options
              #
              transportSpecific 0x0
              ptp_dst_mac 01:1B:19:00:00:00
              p2p_dst_mac 01:80:C2:00:00:0E
              udp_ttl 1
              udp6_scope 0x0E
              uds_address /var/run/ptp4l
              #
              # Default interface options
              #
              clock_type OC
              network_transport L2
              delay_mechanism E2E
              time_stamping hardware
              tsproc_mode filter
              delay_filter moving_median
              delay_filter_length 10
              egressLatency 0
              ingressLatency 0
              boundary_clock_jbod 0
              #
              # Clock description
              #
              productDescription ;;
              revisionData ;;
              manufacturerIdentity 00:00:00
              userDescription ;
              timeSource 0xA0
        recommend:
          - profile: grandmaster-clock
            priority: 4
            match:
              - nodeLabel: "node-role.kubernetes.io/$mcp"
      Copy to Clipboard Toggle word wrap

    2. Create the CR by running the following command:

      $ oc create -f grandmaster-clock-ptp-config.yaml
      Copy to Clipboard Toggle word wrap

Verification

  1. Check that the PtpConfig profile is applied to the node.

    1. Get the list of pods in the openshift-ptp namespace by running the following command:

      $ oc get pods -n openshift-ptp -o wide
      Copy to Clipboard Toggle word wrap

      Example output

      NAME                          READY   STATUS    RESTARTS   AGE     IP             NODE
      linuxptp-daemon-74m2g         3/3     Running   3          4d15h   10.16.230.7    compute-1.example.com
      ptp-operator-5f4f48d7c-x7zkf  1/1     Running   1          4d15h   10.128.1.145   compute-1.example.com
      Copy to Clipboard Toggle word wrap

    2. Check that the profile is correct. Examine the logs of the linuxptp daemon that corresponds to the node you specified in the PtpConfig profile. Run the following command:

      $ oc logs linuxptp-daemon-74m2g -n openshift-ptp -c linuxptp-daemon-container
      Copy to Clipboard Toggle word wrap

      Example output

      ts2phc[94980.334]: [ts2phc.0.config] nmea delay: 98690975 ns
      ts2phc[94980.334]: [ts2phc.0.config] ens3f0 extts index 0 at 1676577329.999999999 corr 0 src 1676577330.901342528 diff -1
      ts2phc[94980.334]: [ts2phc.0.config] ens3f0 master offset         -1 s2 freq      -1
      ts2phc[94980.441]: [ts2phc.0.config] nmea sentence: GNRMC,195453.00,A,4233.24427,N,07126.64420,W,0.008,,160223,,,A,V
      phc2sys[94980.450]: [ptp4l.0.config] CLOCK_REALTIME phc offset       943 s2 freq  -89604 delay    504
      phc2sys[94980.512]: [ptp4l.0.config] CLOCK_REALTIME phc offset      1000 s2 freq  -89264 delay    474
      Copy to Clipboard Toggle word wrap

You can configure linuxptp services (ptp4l, phc2sys) as ordinary clock by creating a PtpConfig custom resource (CR) object.

Note

Use the following example PtpConfig CR as the basis to configure linuxptp services as an ordinary clock for your particular hardware and environment. This example CR does not configure PTP fast events. To configure PTP fast events, set appropriate values for ptp4lOpts, ptp4lConf, and ptpClockThreshold. ptpClockThreshold is required only when events are enabled. See "Configuring the PTP fast event notifications publisher" for more information.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the PTP Operator.

Procedure

  1. Create the following PtpConfig CR, and then save the YAML in the ordinary-clock-ptp-config.yaml file.

    Example PTP ordinary clock configuration

    apiVersion: ptp.openshift.io/v1
    kind: PtpConfig
    metadata:
      name: ordinary-clock
      namespace: openshift-ptp
      annotations: {}
    spec:
      profile:
        - name: ordinary-clock
          # The interface name is hardware-specific
          interface: $interface
          ptp4lOpts: "-2 -s"
          phc2sysOpts: "-a -r -n 24"
          ptpSchedulingPolicy: SCHED_FIFO
          ptpSchedulingPriority: 10
          ptpSettings:
            logReduce: "true"
          ptp4lConf: |
            [global]
            #
            # Default Data Set
            #
            twoStepFlag 1
            slaveOnly 1
            priority1 128
            priority2 128
            domainNumber 24
            #utc_offset 37
            clockClass 255
            clockAccuracy 0xFE
            offsetScaledLogVariance 0xFFFF
            free_running 0
            freq_est_interval 1
            dscp_event 0
            dscp_general 0
            dataset_comparison G.8275.x
            G.8275.defaultDS.localPriority 128
            #
            # Port Data Set
            #
            logAnnounceInterval -3
            logSyncInterval -4
            logMinDelayReqInterval -4
            logMinPdelayReqInterval -4
            announceReceiptTimeout 3
            syncReceiptTimeout 0
            delayAsymmetry 0
            fault_reset_interval -4
            neighborPropDelayThresh 20000000
            masterOnly 0
            G.8275.portDS.localPriority 128
            #
            # Run time options
            #
            assume_two_step 0
            logging_level 6
            path_trace_enabled 0
            follow_up_info 0
            hybrid_e2e 0
            inhibit_multicast_service 0
            net_sync_monitor 0
            tc_spanning_tree 0
            tx_timestamp_timeout 50
            unicast_listen 0
            unicast_master_table 0
            unicast_req_duration 3600
            use_syslog 1
            verbose 0
            summary_interval 0
            kernel_leap 1
            check_fup_sync 0
            clock_class_threshold 7
            #
            # Servo Options
            #
            pi_proportional_const 0.0
            pi_integral_const 0.0
            pi_proportional_scale 0.0
            pi_proportional_exponent -0.3
            pi_proportional_norm_max 0.7
            pi_integral_scale 0.0
            pi_integral_exponent 0.4
            pi_integral_norm_max 0.3
            step_threshold 2.0
            first_step_threshold 0.00002
            max_frequency 900000000
            clock_servo pi
            sanity_freq_limit 200000000
            ntpshm_segment 0
            #
            # Transport options
            #
            transportSpecific 0x0
            ptp_dst_mac 01:1B:19:00:00:00
            p2p_dst_mac 01:80:C2:00:00:0E
            udp_ttl 1
            udp6_scope 0x0E
            uds_address /var/run/ptp4l
            #
            # Default interface options
            #
            clock_type OC
            network_transport L2
            delay_mechanism E2E
            time_stamping hardware
            tsproc_mode filter
            delay_filter moving_median
            delay_filter_length 10
            egressLatency 0
            ingressLatency 0
            boundary_clock_jbod 0
            #
            # Clock description
            #
            productDescription ;;
            revisionData ;;
            manufacturerIdentity 00:00:00
            userDescription ;
            timeSource 0xA0
      recommend:
        - profile: ordinary-clock
          priority: 4
          match:
            - nodeLabel: "node-role.kubernetes.io/$mcp"
    Copy to Clipboard Toggle word wrap

    Expand
    Table 15.1. PTP ordinary clock CR configuration options
    Custom resource fieldDescription

    name

    The name of the PtpConfig CR.

    profile

    Specify an array of one or more profile objects. Each profile must be uniquely named.

    interface

    Specify the network interface to be used by the ptp4l service, for example ens787f1.

    ptp4lOpts

    Specify system config options for the ptp4l service, for example -2 to select the IEEE 802.3 network transport. The options should not include the network interface name -i <interface> and service config file -f /etc/ptp4l.conf because the network interface name and the service config file are automatically appended. Append --summary_interval -4 to use PTP fast events with this interface.

    phc2sysOpts

    Specify system config options for the phc2sys service. If this field is empty, the PTP Operator does not start the phc2sys service. For Intel Columbiaville 800 Series NICs, set phc2sysOpts options to -a -r -m -n 24 -N 8 -R 16. -m prints messages to stdout. The linuxptp-daemon DaemonSet parses the logs and generates Prometheus metrics.

    ptp4lConf

    Specify a string that contains the configuration to replace the default /etc/ptp4l.conf file. To use the default configuration, leave the field empty.

    tx_timestamp_timeout

    For Intel Columbiaville 800 Series NICs, set tx_timestamp_timeout to 50.

    boundary_clock_jbod

    For Intel Columbiaville 800 Series NICs, set boundary_clock_jbod to 0.

    ptpSchedulingPolicy

    Scheduling policy for ptp4l and phc2sys processes. Default value is SCHED_OTHER. Use SCHED_FIFO on systems that support FIFO scheduling.

    ptpSchedulingPriority

    Integer value from 1-65 used to set FIFO priority for ptp4l and phc2sys processes when ptpSchedulingPolicy is set to SCHED_FIFO. The ptpSchedulingPriority field is not used when ptpSchedulingPolicy is set to SCHED_OTHER.

    ptpClockThreshold

    Optional. If ptpClockThreshold is not present, default values are used for the ptpClockThreshold fields. ptpClockThreshold configures how long after the PTP master clock is disconnected before PTP events are triggered. holdOverTimeout is the time value in seconds before the PTP clock event state changes to FREERUN when the PTP master clock is disconnected. The maxOffsetThreshold and minOffsetThreshold settings configure offset values in nanoseconds that compare against the values for CLOCK_REALTIME (phc2sys) or master offset (ptp4l). When the ptp4l or phc2sys offset value is outside this range, the PTP clock state is set to FREERUN. When the offset value is within this range, the PTP clock state is set to LOCKED.

    recommend

    Specify an array of one or more recommend objects that define rules on how the profile should be applied to nodes.

    .recommend.profile

    Specify the .recommend.profile object name defined in the profile section.

    .recommend.priority

    Set .recommend.priority to 0 for ordinary clock.

    .recommend.match

    Specify .recommend.match rules with nodeLabel or nodeName values.

    .recommend.match.nodeLabel

    Set nodeLabel with the key of the node.Labels field from the node object by using the oc get nodes --show-labels command. For example, node-role.kubernetes.io/worker.

    .recommend.match.nodeName

    Set nodeName with the value of the node.Name field from the node object by using the oc get nodes command. For example, compute-1.example.com.

  2. Create the PtpConfig CR by running the following command:

    $ oc create -f ordinary-clock-ptp-config.yaml
    Copy to Clipboard Toggle word wrap

Verification

  1. Check that the PtpConfig profile is applied to the node.

    1. Get the list of pods in the openshift-ptp namespace by running the following command:

      $ oc get pods -n openshift-ptp -o wide
      Copy to Clipboard Toggle word wrap

      Example output

      NAME                            READY   STATUS    RESTARTS   AGE   IP               NODE
      linuxptp-daemon-4xkbb           1/1     Running   0          43m   10.1.196.24      compute-0.example.com
      linuxptp-daemon-tdspf           1/1     Running   0          43m   10.1.196.25      compute-1.example.com
      ptp-operator-657bbb64c8-2f8sj   1/1     Running   0          43m   10.129.0.61      control-plane-1.example.com
      Copy to Clipboard Toggle word wrap

    2. Check that the profile is correct. Examine the logs of the linuxptp daemon that corresponds to the node you specified in the PtpConfig profile. Run the following command:

      $ oc logs linuxptp-daemon-4xkbb -n openshift-ptp -c linuxptp-daemon-container
      Copy to Clipboard Toggle word wrap

      Example output

      I1115 09:41:17.117596 4143292 daemon.go:107] in applyNodePTPProfile
      I1115 09:41:17.117604 4143292 daemon.go:109] updating NodePTPProfile to:
      I1115 09:41:17.117607 4143292 daemon.go:110] ------------------------------------
      I1115 09:41:17.117612 4143292 daemon.go:102] Profile Name: profile1
      I1115 09:41:17.117616 4143292 daemon.go:102] Interface: ens787f1
      I1115 09:41:17.117620 4143292 daemon.go:102] Ptp4lOpts: -2 -s
      I1115 09:41:17.117623 4143292 daemon.go:102] Phc2sysOpts: -a -r -n 24
      I1115 09:41:17.117626 4143292 daemon.go:116] ------------------------------------
      Copy to Clipboard Toggle word wrap

You can configure the linuxptp services (ptp4l, phc2sys) as boundary clock by creating a PtpConfig custom resource (CR) object.

Note

Use the following example PtpConfig CR as the basis to configure linuxptp services as the boundary clock for your particular hardware and environment. This example CR does not configure PTP fast events. To configure PTP fast events, set appropriate values for ptp4lOpts, ptp4lConf, and ptpClockThreshold. ptpClockThreshold is used only when events are enabled. See "Configuring the PTP fast event notifications publisher" for more information.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the PTP Operator.

Procedure

  1. Create the following PtpConfig CR, and then save the YAML in the boundary-clock-ptp-config.yaml file.

    Example PTP boundary clock configuration

    apiVersion: ptp.openshift.io/v1
    kind: PtpConfig
    metadata:
      name: boundary-clock
      namespace: openshift-ptp
      annotations: {}
    spec:
      profile:
        - name: boundary-clock
          ptp4lOpts: "-2"
          phc2sysOpts: "-a -r -n 24"
          ptpSchedulingPolicy: SCHED_FIFO
          ptpSchedulingPriority: 10
          ptpSettings:
            logReduce: "true"
          ptp4lConf: |
            # The interface name is hardware-specific
            [$iface_slave]
            masterOnly 0
            [$iface_master_1]
            masterOnly 1
            [$iface_master_2]
            masterOnly 1
            [$iface_master_3]
            masterOnly 1
            [global]
            #
            # Default Data Set
            #
            twoStepFlag 1
            slaveOnly 0
            priority1 128
            priority2 128
            domainNumber 24
            #utc_offset 37
            clockClass 248
            clockAccuracy 0xFE
            offsetScaledLogVariance 0xFFFF
            free_running 0
            freq_est_interval 1
            dscp_event 0
            dscp_general 0
            dataset_comparison G.8275.x
            G.8275.defaultDS.localPriority 128
            #
            # Port Data Set
            #
            logAnnounceInterval -3
            logSyncInterval -4
            logMinDelayReqInterval -4
            logMinPdelayReqInterval -4
            announceReceiptTimeout 3
            syncReceiptTimeout 0
            delayAsymmetry 0
            fault_reset_interval -4
            neighborPropDelayThresh 20000000
            masterOnly 0
            G.8275.portDS.localPriority 128
            #
            # Run time options
            #
            assume_two_step 0
            logging_level 6
            path_trace_enabled 0
            follow_up_info 0
            hybrid_e2e 0
            inhibit_multicast_service 0
            net_sync_monitor 0
            tc_spanning_tree 0
            tx_timestamp_timeout 50
            unicast_listen 0
            unicast_master_table 0
            unicast_req_duration 3600
            use_syslog 1
            verbose 0
            summary_interval 0
            kernel_leap 1
            check_fup_sync 0
            clock_class_threshold 135
            #
            # Servo Options
            #
            pi_proportional_const 0.0
            pi_integral_const 0.0
            pi_proportional_scale 0.0
            pi_proportional_exponent -0.3
            pi_proportional_norm_max 0.7
            pi_integral_scale 0.0
            pi_integral_exponent 0.4
            pi_integral_norm_max 0.3
            step_threshold 2.0
            first_step_threshold 0.00002
            max_frequency 900000000
            clock_servo pi
            sanity_freq_limit 200000000
            ntpshm_segment 0
            #
            # Transport options
            #
            transportSpecific 0x0
            ptp_dst_mac 01:1B:19:00:00:00
            p2p_dst_mac 01:80:C2:00:00:0E
            udp_ttl 1
            udp6_scope 0x0E
            uds_address /var/run/ptp4l
            #
            # Default interface options
            #
            clock_type BC
            network_transport L2
            delay_mechanism E2E
            time_stamping hardware
            tsproc_mode filter
            delay_filter moving_median
            delay_filter_length 10
            egressLatency 0
            ingressLatency 0
            boundary_clock_jbod 0
            #
            # Clock description
            #
            productDescription ;;
            revisionData ;;
            manufacturerIdentity 00:00:00
            userDescription ;
            timeSource 0xA0
      recommend:
        - profile: boundary-clock
          priority: 4
          match:
            - nodeLabel: "node-role.kubernetes.io/$mcp"
    Copy to Clipboard Toggle word wrap

    Expand
    Table 15.2. PTP boundary clock CR configuration options
    Custom resource fieldDescription

    name

    The name of the PtpConfig CR.

    profile

    Specify an array of one or more profile objects.

    name

    Specify the name of a profile object which uniquely identifies a profile object.

    ptp4lOpts

    Specify system config options for the ptp4l service. The options should not include the network interface name -i <interface> and service config file -f /etc/ptp4l.conf because the network interface name and the service config file are automatically appended.

    ptp4lConf

    Specify the required configuration to start ptp4l as boundary clock. For example, ens1f0 synchronizes from a grandmaster clock and ens1f3 synchronizes connected devices.

    <interface_1>

    The interface that receives the synchronization clock.

    <interface_2>

    The interface that sends the synchronization clock.

    tx_timestamp_timeout

    For Intel Columbiaville 800 Series NICs, set tx_timestamp_timeout to 50.

    boundary_clock_jbod

    For Intel Columbiaville 800 Series NICs, ensure boundary_clock_jbod is set to 0. For Intel Fortville X710 Series NICs, ensure boundary_clock_jbod is set to 1.

    phc2sysOpts

    Specify system config options for the phc2sys service. If this field is empty, the PTP Operator does not start the phc2sys service.

    ptpSchedulingPolicy

    Scheduling policy for ptp4l and phc2sys processes. Default value is SCHED_OTHER. Use SCHED_FIFO on systems that support FIFO scheduling.

    ptpSchedulingPriority

    Integer value from 1-65 used to set FIFO priority for ptp4l and phc2sys processes when ptpSchedulingPolicy is set to SCHED_FIFO. The ptpSchedulingPriority field is not used when ptpSchedulingPolicy is set to SCHED_OTHER.

    ptpClockThreshold

    Optional. If ptpClockThreshold is not present, default values are used for the ptpClockThreshold fields. ptpClockThreshold configures how long after the PTP master clock is disconnected before PTP events are triggered. holdOverTimeout is the time value in seconds before the PTP clock event state changes to FREERUN when the PTP master clock is disconnected. The maxOffsetThreshold and minOffsetThreshold settings configure offset values in nanoseconds that compare against the values for CLOCK_REALTIME (phc2sys) or master offset (ptp4l). When the ptp4l or phc2sys offset value is outside this range, the PTP clock state is set to FREERUN. When the offset value is within this range, the PTP clock state is set to LOCKED.

    recommend

    Specify an array of one or more recommend objects that define rules on how the profile should be applied to nodes.

    .recommend.profile

    Specify the .recommend.profile object name defined in the profile section.

    .recommend.priority

    Specify the priority with an integer value between 0 and 99. A larger number gets lower priority, so a priority of 99 is lower than a priority of 10. If a node can be matched with multiple profiles according to rules defined in the match field, the profile with the higher priority is applied to that node.

    .recommend.match

    Specify .recommend.match rules with nodeLabel or nodeName values.

    .recommend.match.nodeLabel

    Set nodeLabel with the key of the node.Labels field from the node object by using the oc get nodes --show-labels command. For example, node-role.kubernetes.io/worker.

    .recommend.match.nodeName

    Set nodeName with the value of the node.Name field from the node object by using the oc get nodes command. For example, compute-1.example.com.

  2. Create the CR by running the following command:

    $ oc create -f boundary-clock-ptp-config.yaml
    Copy to Clipboard Toggle word wrap

Verification

  1. Check that the PtpConfig profile is applied to the node.

    1. Get the list of pods in the openshift-ptp namespace by running the following command:

      $ oc get pods -n openshift-ptp -o wide
      Copy to Clipboard Toggle word wrap

      Example output

      NAME                            READY   STATUS    RESTARTS   AGE   IP               NODE
      linuxptp-daemon-4xkbb           1/1     Running   0          43m   10.1.196.24      compute-0.example.com
      linuxptp-daemon-tdspf           1/1     Running   0          43m   10.1.196.25      compute-1.example.com
      ptp-operator-657bbb64c8-2f8sj   1/1     Running   0          43m   10.129.0.61      control-plane-1.example.com
      Copy to Clipboard Toggle word wrap

    2. Check that the profile is correct. Examine the logs of the linuxptp daemon that corresponds to the node you specified in the PtpConfig profile. Run the following command:

      $ oc logs linuxptp-daemon-4xkbb -n openshift-ptp -c linuxptp-daemon-container
      Copy to Clipboard Toggle word wrap

      Example output

      I1115 09:41:17.117596 4143292 daemon.go:107] in applyNodePTPProfile
      I1115 09:41:17.117604 4143292 daemon.go:109] updating NodePTPProfile to:
      I1115 09:41:17.117607 4143292 daemon.go:110] ------------------------------------
      I1115 09:41:17.117612 4143292 daemon.go:102] Profile Name: profile1
      I1115 09:41:17.117616 4143292 daemon.go:102] Interface:
      I1115 09:41:17.117620 4143292 daemon.go:102] Ptp4lOpts: -2
      I1115 09:41:17.117623 4143292 daemon.go:102] Phc2sysOpts: -a -r -n 24
      I1115 09:41:17.117626 4143292 daemon.go:116] ------------------------------------
      Copy to Clipboard Toggle word wrap

Important

Precision Time Protocol (PTP) hardware with dual NIC configured as boundary clocks is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can configure the linuxptp services (ptp4l, phc2sys) as boundary clocks for dual NIC hardware by creating a PtpConfig custom resource (CR) object for each NIC.

Dual NIC hardware allows you to connect each NIC to the same upstream leader clock with separate ptp4l instances for each NIC feeding the downstream clocks.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the PTP Operator.

Procedure

  1. Create two separate PtpConfig CRs, one for each NIC, using the reference CR in "Configuring linuxptp services as a boundary clock" as the basis for each CR. For example:

    1. Create boundary-clock-ptp-config-nic1.yaml, specifying values for phc2sysOpts:

      apiVersion: ptp.openshift.io/v1
      kind: PtpConfig
      metadata:
        name: boundary-clock-ptp-config-nic1
        namespace: openshift-ptp
      spec:
        profile:
        - name: "profile1"
          ptp4lOpts: "-2 --summary_interval -4"
          ptp4lConf: | 
      1
      
            [ens5f1]
            masterOnly 1
            [ens5f0]
            masterOnly 0
          ...
          phc2sysOpts: "-a -r -m -n 24 -N 8 -R 16" 
      2
      Copy to Clipboard Toggle word wrap
      1
      Specify the required interfaces to start ptp4l as a boundary clock. For example, ens5f0 synchronizes from a grandmaster clock and ens5f1 synchronizes connected devices.
      2
      Required phc2sysOpts values. -m prints messages to stdout. The linuxptp-daemon DaemonSet parses the logs and generates Prometheus metrics.
    2. Create boundary-clock-ptp-config-nic2.yaml, removing the phc2sysOpts field altogether to disable the phc2sys service for the second NIC:

      apiVersion: ptp.openshift.io/v1
      kind: PtpConfig
      metadata:
        name: boundary-clock-ptp-config-nic2
        namespace: openshift-ptp
      spec:
        profile:
        - name: "profile2"
          ptp4lOpts: "-2 --summary_interval -4"
          ptp4lConf: | 
      1
      
            [ens7f1]
            masterOnly 1
            [ens7f0]
            masterOnly 0
      ...
      Copy to Clipboard Toggle word wrap
      1
      Specify the required interfaces to start ptp4l as a boundary clock on the second NIC.
      Note

      You must completely remove the phc2sysOpts field from the second PtpConfig CR to disable the phc2sys service on the second NIC.

  2. Create the dual NIC PtpConfig CRs by running the following commands:

    1. Create the CR that configures PTP for the first NIC:

      $ oc create -f boundary-clock-ptp-config-nic1.yaml
      Copy to Clipboard Toggle word wrap
    2. Create the CR that configures PTP for the second NIC:

      $ oc create -f boundary-clock-ptp-config-nic2.yaml
      Copy to Clipboard Toggle word wrap

Verification

  • Check that the PTP Operator has applied the PtpConfig CRs for both NICs. Examine the logs for the linuxptp daemon corresponding to the node that has the dual NIC hardware installed. For example, run the following command:

    $ oc logs linuxptp-daemon-cvgr6 -n openshift-ptp -c linuxptp-daemon-container
    Copy to Clipboard Toggle word wrap

    Example output

    ptp4l[80828.335]: [ptp4l.1.config] master offset          5 s2 freq   -5727 path delay       519
    ptp4l[80828.343]: [ptp4l.0.config] master offset         -5 s2 freq  -10607 path delay       533
    phc2sys[80828.390]: [ptp4l.0.config] CLOCK_REALTIME phc offset         1 s2 freq  -87239 delay    539
    Copy to Clipboard Toggle word wrap

The following table describes the changes that you must make to the reference PTP configuration in order to use Intel Columbiaville E800 series NICs as ordinary clocks. Make the changes in a PtpConfig custom resource (CR) that you apply to the cluster.

Expand
Table 15.3. Recommended PTP settings for Intel Columbiaville NIC
PTP configurationRecommended setting

phc2sysOpts

-a -r -m -n 24 -N 8 -R 16

tx_timestamp_timeout

50

boundary_clock_jbod

0

Note

For phc2sysOpts, -m prints messages to stdout. The linuxptp-daemon DaemonSet parses the logs and generates Prometheus metrics.

In telco or other deployment configurations that require low latency performance, PTP daemon threads run in a constrained CPU footprint alongside the rest of the infrastructure components. By default, PTP threads run with the SCHED_OTHER policy. Under high load, these threads might not get the scheduling latency they require for error-free operation.

To mitigate against potential scheduling latency errors, you can configure the PTP Operator linuxptp services to allow threads to run with a SCHED_FIFO policy. If SCHED_FIFO is set for a PtpConfig CR, then ptp4l and phc2sys will run in the parent container under chrt with a priority set by the ptpSchedulingPriority field of the PtpConfig CR.

Note

Setting ptpSchedulingPolicy is optional, and is only required if you are experiencing latency errors.

Procedure

  1. Edit the PtpConfig CR profile:

    $ oc edit PtpConfig -n openshift-ptp
    Copy to Clipboard Toggle word wrap
  2. Change the ptpSchedulingPolicy and ptpSchedulingPriority fields:

    apiVersion: ptp.openshift.io/v1
    kind: PtpConfig
    metadata:
      name: <ptp_config_name>
      namespace: openshift-ptp
    ...
    spec:
      profile:
      - name: "profile1"
    ...
        ptpSchedulingPolicy: SCHED_FIFO 
    1
    
        ptpSchedulingPriority: 10 
    2
    Copy to Clipboard Toggle word wrap
    1
    Scheduling policy for ptp4l and phc2sys processes. Use SCHED_FIFO on systems that support FIFO scheduling.
    2
    Required. Sets the integer value 1-65 used to configure FIFO priority for ptp4l and phc2sys processes.
  3. Save and exit to apply the changes to the PtpConfig CR.

Verification

  1. Get the name of the linuxptp-daemon pod and corresponding node where the PtpConfig CR has been applied:

    $ oc get pods -n openshift-ptp -o wide
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE
    linuxptp-daemon-gmv2n           3/3     Running   0          1d17h   10.1.196.24   compute-0.example.com
    linuxptp-daemon-lgm55           3/3     Running   0          1d17h   10.1.196.25   compute-1.example.com
    ptp-operator-3r4dcvf7f4-zndk7   1/1     Running   0          1d7h    10.129.0.61   control-plane-1.example.com
    Copy to Clipboard Toggle word wrap

  2. Check that the ptp4l process is running with the updated chrt FIFO priority:

    $ oc -n openshift-ptp logs linuxptp-daemon-lgm55 -c linuxptp-daemon-container|grep chrt
    Copy to Clipboard Toggle word wrap

    Example output

    I1216 19:24:57.091872 1600715 daemon.go:285] /bin/chrt -f 65 /usr/sbin/ptp4l -f /var/run/ptp4l.0.config -2  --summary_interval -4 -m
    Copy to Clipboard Toggle word wrap

15.6. Troubleshooting common PTP Operator issues

Troubleshoot common problems with the PTP Operator by performing the following steps.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the PTP Operator on a bare-metal cluster with hosts that support PTP.

Procedure

  1. Check the Operator and operands are successfully deployed in the cluster for the configured nodes.

    $ oc get pods -n openshift-ptp -o wide
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE
    linuxptp-daemon-lmvgn           3/3     Running   0          4d17h   10.1.196.24   compute-0.example.com
    linuxptp-daemon-qhfg7           3/3     Running   0          4d17h   10.1.196.25   compute-1.example.com
    ptp-operator-6b8dcbf7f4-zndk7   1/1     Running   0          5d7h    10.129.0.61   control-plane-1.example.com
    Copy to Clipboard Toggle word wrap

    Note

    When the PTP fast event bus is enabled, the number of ready linuxptp-daemon pods is 3/3. If the PTP fast event bus is not enabled, 2/2 is displayed.

  2. Check that supported hardware is found in the cluster.

    $ oc -n openshift-ptp get nodeptpdevices.ptp.openshift.io
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                  AGE
    control-plane-0.example.com           10d
    control-plane-1.example.com           10d
    compute-0.example.com                 10d
    compute-1.example.com                 10d
    compute-2.example.com                 10d
    Copy to Clipboard Toggle word wrap

  3. Check the available PTP network interfaces for a node:

    $ oc -n openshift-ptp get nodeptpdevices.ptp.openshift.io <node_name> -o yaml
    Copy to Clipboard Toggle word wrap

    where:

    <node_name>

    Specifies the node you want to query, for example, compute-0.example.com.

    Example output

    apiVersion: ptp.openshift.io/v1
    kind: NodePtpDevice
    metadata:
      creationTimestamp: "2021-09-14T16:52:33Z"
      generation: 1
      name: compute-0.example.com
      namespace: openshift-ptp
      resourceVersion: "177400"
      uid: 30413db0-4d8d-46da-9bef-737bacd548fd
    spec: {}
    status:
      devices:
      - name: eno1
      - name: eno2
      - name: eno3
      - name: eno4
      - name: enp5s0f0
      - name: enp5s0f1
    Copy to Clipboard Toggle word wrap

  4. Check that the PTP interface is successfully synchronized to the primary clock by accessing the linuxptp-daemon pod for the corresponding node.

    1. Get the name of the linuxptp-daemon pod and corresponding node you want to troubleshoot by running the following command:

      $ oc get pods -n openshift-ptp -o wide
      Copy to Clipboard Toggle word wrap

      Example output

      NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE
      linuxptp-daemon-lmvgn           3/3     Running   0          4d17h   10.1.196.24   compute-0.example.com
      linuxptp-daemon-qhfg7           3/3     Running   0          4d17h   10.1.196.25   compute-1.example.com
      ptp-operator-6b8dcbf7f4-zndk7   1/1     Running   0          5d7h    10.129.0.61   control-plane-1.example.com
      Copy to Clipboard Toggle word wrap

    2. Remote shell into the required linuxptp-daemon container:

      $ oc rsh -n openshift-ptp -c linuxptp-daemon-container <linux_daemon_container>
      Copy to Clipboard Toggle word wrap

      where:

      <linux_daemon_container>
      is the container you want to diagnose, for example linuxptp-daemon-lmvgn.
    3. In the remote shell connection to the linuxptp-daemon container, use the PTP Management Client (pmc) tool to diagnose the network interface. Run the following pmc command to check the sync status of the PTP device, for example ptp4l.

      # pmc -u -f /var/run/ptp4l.0.config -b 0 'GET PORT_DATA_SET'
      Copy to Clipboard Toggle word wrap

      Example output when the node is successfully synced to the primary clock

      sending: GET PORT_DATA_SET
          40a6b7.fffe.166ef0-1 seq 0 RESPONSE MANAGEMENT PORT_DATA_SET
              portIdentity            40a6b7.fffe.166ef0-1
              portState               SLAVE
              logMinDelayReqInterval  -4
              peerMeanPathDelay       0
              logAnnounceInterval     -3
              announceReceiptTimeout  3
              logSyncInterval         -4
              delayMechanism          1
              logMinPdelayReqInterval -4
              versionNumber           2
      Copy to Clipboard Toggle word wrap

You can use the oc adm must-gather CLI command to collect information about your cluster, including features and objects associated with Precision Time Protocol (PTP) Operator.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the OpenShift CLI (oc).
  • You have installed the PTP Operator.

Procedure

  • To collect PTP Operator data with must-gather, you must specify the PTP Operator must-gather image.

    $ oc adm must-gather --image=registry.redhat.io/openshift4/ptp-must-gather-rhel8:v4.11
    Copy to Clipboard Toggle word wrap

Cloud native applications such as virtual RAN require access to notifications about hardware timing events that are critical to the functioning of the overall network. Fast event notifications are early warning signals about impending and real-time Precision Time Protocol (PTP) clock synchronization events. PTP clock synchronization errors can negatively affect the performance and reliability of your low latency application, for example, a vRAN application running in a distributed unit (DU).

Loss of PTP synchronization is a critical error for a RAN network. If synchronization is lost on a node, the radio might be shut down and the network Over the Air (OTA) traffic might be shifted to another node in the wireless network. Fast event notifications mitigate against workload errors by allowing cluster nodes to communicate PTP clock sync status to the vRAN application running in the DU.

Event notifications are available to RAN applications running on the same DU node. A publish/subscribe REST API passes events notifications to the messaging bus. Publish/subscribe messaging, or pub/sub messaging, is an asynchronous service to service communication architecture where any message published to a topic is immediately received by all the subscribers to the topic.

Fast event notifications are generated by the PTP Operator in OpenShift Container Platform for every PTP-capable network interface. The events are made available using a cloud-event-proxy sidecar container over an Advanced Message Queuing Protocol (AMQP) message bus. The AMQP message bus is provided by the AMQ Interconnect Operator.

Note

PTP fast event notifications are available for network interfaces configured to use PTP ordinary clocks or PTP boundary clocks.

You can subscribe distributed unit (DU) applications to Precision Time Protocol (PTP) fast events notifications that are generated by OpenShift Container Platform with the PTP Operator and cloud-event-proxy sidecar container. You enable the cloud-event-proxy sidecar container by setting the enableEventPublisher field to true in the ptpOperatorConfig custom resource (CR) and specifying an Advanced Message Queuing Protocol (AMQP) transportHost address. PTP fast events use an AMQP event notification bus provided by the AMQ Interconnect Operator. AMQ Interconnect is a component of Red Hat AMQ, a messaging router that provides flexible routing of messages between any AMQP-enabled endpoints. An overview of the PTP fast events framework is below:

Figure 15.1. Overview of PTP fast events

The cloud-event-proxy sidecar container can access the same resources as the primary vRAN application without using any of the resources of the primary application and with no significant latency.

The fast events notifications framework uses a REST API for communication and is based on the O-RAN REST API specification. The framework consists of a publisher, subscriber, and an AMQ messaging bus to handle communications between the publisher and subscriber applications. The cloud-event-proxy sidecar is a utility container that runs in a pod that is loosely coupled to the main DU application container on the DU node. It provides an event publishing framework that allows you to subscribe DU applications to published PTP events.

DU applications run the cloud-event-proxy container in a sidecar pattern to subscribe to PTP events. The following workflow describes how a DU application uses PTP fast events:

  1. DU application requests a subscription: The DU sends an API request to the cloud-event-proxy sidecar to create a PTP events subscription. The cloud-event-proxy sidecar creates a subscription resource.
  2. cloud-event-proxy sidecar creates the subscription: The event resource is persisted by the cloud-event-proxy sidecar. The cloud-event-proxy sidecar container sends an acknowledgment with an ID and URL location to access the stored subscription resource. The sidecar creates an AMQ messaging listener protocol for the resource specified in the subscription.
  3. DU application receives the PTP event notification: The cloud-event-proxy sidecar container listens to the address specified in the resource qualifier. The DU events consumer processes the message and passes it to the return URL specified in the subscription.
  4. cloud-event-proxy sidecar validates the PTP event and posts it to the DU application: The cloud-event-proxy sidecar receives the event, unwraps the cloud events object to retrieve the data, and fetches the return URL to post the event back to the DU consumer application.
  5. DU application uses the PTP event: The DU application events consumer receives and processes the PTP event.

15.7.3. Installing the AMQ messaging bus

To pass PTP fast event notifications between publisher and subscriber on a node, you must install and configure an AMQ messaging bus to run locally on the node. You do this by installing the AMQ Interconnect Operator for use in the cluster.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

Verification

  1. Check that the AMQ Interconnect Operator is available and the required pods are running:

    $ oc get pods -n amq-interconnect
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                    READY   STATUS    RESTARTS   AGE
    amq-interconnect-645db76c76-k8ghs       1/1     Running   0          23h
    interconnect-operator-5cb5fc7cc-4v7qm   1/1     Running   0          23h
    Copy to Clipboard Toggle word wrap

  2. Check that the required linuxptp-daemon PTP event producer pods are running in the openshift-ptp namespace.

    $ oc get pods -n openshift-ptp
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                     READY   STATUS    RESTARTS       AGE
    linuxptp-daemon-2t78p    3/3     Running   0              12h
    linuxptp-daemon-k8n88    3/3     Running   0              12h
    Copy to Clipboard Toggle word wrap

To start using PTP fast event notifications for a network interface in your cluster, you must enable the fast event publisher in the PTP Operator PtpOperatorConfig custom resource (CR) and configure ptpClockThreshold values in a PtpConfig CR that you create.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the PTP Operator and AMQ Interconnect Operator.

Procedure

  1. Modify the default PTP Operator config to enable PTP fast events.

    1. Save the following YAML in the ptp-operatorconfig.yaml file:

      apiVersion: ptp.openshift.io/v1
      kind: PtpOperatorConfig
      metadata:
        name: default
        namespace: openshift-ptp
      spec:
        daemonNodeSelector:
          node-role.kubernetes.io/worker: ""
        ptpEventConfig:
          enableEventPublisher: true 
      1
      
          transportHost: amqp://<instance_name>.<namespace>.svc.cluster.local 
      2
      Copy to Clipboard Toggle word wrap
      1
      Set enableEventPublisher to true to enable PTP fast event notifications.
      2
      Set transportHost to the AMQ router that you configured where <instance_name> and <namespace> correspond to the AMQ Interconnect router instance name and namespace, for example, amqp://amq-interconnect.amq-interconnect.svc.cluster.local
    2. Update the PtpOperatorConfig CR:

      $ oc apply -f ptp-operatorconfig.yaml
      Copy to Clipboard Toggle word wrap
  2. Create a PtpConfig custom resource (CR) for the PTP enabled interface, and set the required values for ptpClockThreshold and ptp4lOpts. The following YAML illustrates the required values that you must set in the PtpConfig CR:

    spec:
      profile:
      - name: "profile1"
        interface: "enp5s0f0"
        ptp4lOpts: "-2 -s --summary_interval -4" 
    1
    
        phc2sysOpts: "-a -r -m -n 24 -N 8 -R 16" 
    2
    
        ptp4lConf: "" 
    3
    
        ptpClockThreshold: 
    4
    
          holdOverTimeout: 5
          maxOffsetThreshold: 100
          minOffsetThreshold: -100
    Copy to Clipboard Toggle word wrap
    1
    Append --summary_interval -4 to use PTP fast events.
    2
    Required phc2sysOpts values. -m prints messages to stdout. The linuxptp-daemon DaemonSet parses the logs and generates Prometheus metrics.
    3
    Specify a string that contains the configuration to replace the default /etc/ptp4l.conf file. To use the default configuration, leave the field empty.
    4
    Optional. If the ptpClockThreshold stanza is not present, default values are used for the ptpClockThreshold fields. The stanza shows default ptpClockThreshold values. The ptpClockThreshold values configure how long after the PTP master clock is disconnected before PTP events are triggered. holdOverTimeout is the time value in seconds before the PTP clock event state changes to FREERUN when the PTP master clock is disconnected. The maxOffsetThreshold and minOffsetThreshold settings configure offset values in nanoseconds that compare against the values for CLOCK_REALTIME (phc2sys) or master offset (ptp4l). When the ptp4l or phc2sys offset value is outside this range, the PTP clock state is set to FREERUN. When the offset value is within this range, the PTP clock state is set to LOCKED.

Use the PTP event notifications REST API to subscribe a distributed unit (DU) application to the PTP events that are generated on the parent node.

Subscribe applications to PTP events by using the resource address /cluster/node/<node_name>/ptp, where <node_name> is the cluster node running the DU application.

Deploy your cloud-event-consumer DU application container and cloud-event-proxy sidecar container in a separate DU application pod. The cloud-event-consumer DU application subscribes to the cloud-event-proxy container in the application pod.

Use the following API endpoints to subscribe the cloud-event-consumer DU application to PTP events posted by the cloud-event-proxy container at http://localhost:8089/api/ocloudNotifications/v1/ in the DU application pod:

  • /api/ocloudNotifications/v1/subscriptions

    • POST: Creates a new subscription
    • GET: Retrieves a list of subscriptions
  • /api/ocloudNotifications/v1/subscriptions/<subscription_id>

    • GET: Returns details for the specified subscription ID
  • /api/ocloudNotifications/v1/health

    • GET: Returns the health status of ocloudNotifications API
  • api/ocloudNotifications/v1/publishers

    • GET: Returns an array of os-clock-sync-state, ptp-clock-class-change, and lock-state messages for the cluster node
  • /api/ocloudnotifications/v1/<resource_address>/CurrentState

    • GET: Returns the current state of one the following event types: os-clock-sync-state, ptp-clock-class-change, or lock-state events
Note

9089 is the default port for the cloud-event-consumer container deployed in the application pod. You can configure a different port for your DU application as required.

15.7.5.1. api/ocloudNotifications/v1/subscriptions
HTTP method

GET api/ocloudNotifications/v1/subscriptions

Description

Returns a list of subscriptions. If subscriptions exist, a 200 OK status code is returned along with the list of subscriptions.

Example API response

[
 {
  "id": "75b1ad8f-c807-4c23-acf5-56f4b7ee3826",
  "endpointUri": "http://localhost:9089/event",
  "uriLocation": "http://localhost:8089/api/ocloudNotifications/v1/subscriptions/75b1ad8f-c807-4c23-acf5-56f4b7ee3826",
  "resource": "/cluster/node/compute-1.example.com/ptp"
 }
]
Copy to Clipboard Toggle word wrap

HTTP method

POST api/ocloudNotifications/v1/subscriptions

Description

Creates a new subscription. If a subscription is successfully created, or if it already exists, a 201 Created status code is returned.

Expand
Table 15.4. Query parameters
ParameterType

subscription

data

Example payload

{
  "uriLocation": "http://localhost:8089/api/ocloudNotifications/v1/subscriptions",
  "resource": "/cluster/node/compute-1.example.com/ptp"
}
Copy to Clipboard Toggle word wrap

HTTP method

GET api/ocloudNotifications/v1/subscriptions/<subscription_id>

Description

Returns details for the subscription with ID <subscription_id>

Expand
Table 15.5. Query parameters
ParameterType

<subscription_id>

string

Example API response

{
  "id":"48210fb3-45be-4ce0-aa9b-41a0e58730ab",
  "endpointUri": "http://localhost:9089/event",
  "uriLocation":"http://localhost:8089/api/ocloudNotifications/v1/subscriptions/48210fb3-45be-4ce0-aa9b-41a0e58730ab",
  "resource":"/cluster/node/compute-1.example.com/ptp"
}
Copy to Clipboard Toggle word wrap

15.7.5.3. api/ocloudNotifications/v1/health/
HTTP method

GET api/ocloudNotifications/v1/health/

Description

Returns the health status for the ocloudNotifications REST API.

Example API response

OK
Copy to Clipboard Toggle word wrap

15.7.5.4. api/ocloudNotifications/v1/publishers
HTTP method

GET api/ocloudNotifications/v1/publishers

Description

Returns an array of os-clock-sync-state, ptp-clock-class-change, and lock-state details for the cluster node. The system generates notifications when the relevant equipment state changes.

  • os-clock-sync-state notifications describe the host operating system clock synchronization state. Can be in LOCKED or FREERUN state.
  • ptp-clock-class-change notifications describe the current state of the PTP clock class.
  • lock-state notifications describe the current status of the PTP equipment lock state. Can be in LOCKED, HOLDOVER or FREERUN state.

Example API response

[
  {
    "id": "0fa415ae-a3cf-4299-876a-589438bacf75",
    "endpointUri": "http://localhost:9085/api/ocloudNotifications/v1/dummy",
    "uriLocation": "http://localhost:9085/api/ocloudNotifications/v1/publishers/0fa415ae-a3cf-4299-876a-589438bacf75",
    "resource": "/cluster/node/compute-1.example.com/sync/sync-status/os-clock-sync-state"
  },
  {
    "id": "28cd82df-8436-4f50-bbd9-7a9742828a71",
    "endpointUri": "http://localhost:9085/api/ocloudNotifications/v1/dummy",
    "uriLocation": "http://localhost:9085/api/ocloudNotifications/v1/publishers/28cd82df-8436-4f50-bbd9-7a9742828a71",
    "resource": "/cluster/node/compute-1.example.com/sync/ptp-status/ptp-clock-class-change"
  },
  {
    "id": "44aa480d-7347-48b0-a5b0-e0af01fa9677",
    "endpointUri": "http://localhost:9085/api/ocloudNotifications/v1/dummy",
    "uriLocation": "http://localhost:9085/api/ocloudNotifications/v1/publishers/44aa480d-7347-48b0-a5b0-e0af01fa9677",
    "resource": "/cluster/node/compute-1.example.com/sync/ptp-status/lock-state"
  }
]
Copy to Clipboard Toggle word wrap

You can find os-clock-sync-state, ptp-clock-class-change and lock-state events in the logs for the cloud-event-proxy container. For example:

$ oc logs -f linuxptp-daemon-cvgr6 -n openshift-ptp -c cloud-event-proxy
Copy to Clipboard Toggle word wrap

Example os-clock-sync-state event

{
   "id":"c8a784d1-5f4a-4c16-9a81-a3b4313affe5",
   "type":"event.sync.sync-status.os-clock-sync-state-change",
   "source":"/cluster/compute-1.example.com/ptp/CLOCK_REALTIME",
   "dataContentType":"application/json",
   "time":"2022-05-06T15:31:23.906277159Z",
   "data":{
      "version":"v1",
      "values":[
         {
            "resource":"/sync/sync-status/os-clock-sync-state",
            "dataType":"notification",
            "valueType":"enumeration",
            "value":"LOCKED"
         },
         {
            "resource":"/sync/sync-status/os-clock-sync-state",
            "dataType":"metric",
            "valueType":"decimal64.3",
            "value":"-53"
         }
      ]
   }
}
Copy to Clipboard Toggle word wrap

Example ptp-clock-class-change event

{
   "id":"69eddb52-1650-4e56-b325-86d44688d02b",
   "type":"event.sync.ptp-status.ptp-clock-class-change",
   "source":"/cluster/compute-1.example.com/ptp/ens2fx/master",
   "dataContentType":"application/json",
   "time":"2022-05-06T15:31:23.147100033Z",
   "data":{
      "version":"v1",
      "values":[
         {
            "resource":"/sync/ptp-status/ptp-clock-class-change",
            "dataType":"metric",
            "valueType":"decimal64.3",
            "value":"135"
         }
      ]
   }
}
Copy to Clipboard Toggle word wrap

Example lock-state event

{
   "id":"305ec18b-1472-47b3-aadd-8f37933249a9",
   "type":"event.sync.ptp-status.ptp-state-change",
   "source":"/cluster/compute-1.example.com/ptp/ens2fx/master",
   "dataContentType":"application/json",
   "time":"2022-05-06T15:31:23.467684081Z",
   "data":{
      "version":"v1",
      "values":[
         {
            "resource":"/sync/ptp-status/lock-state",
            "dataType":"notification",
            "valueType":"enumeration",
            "value":"LOCKED"
         },
         {
            "resource":"/sync/ptp-status/lock-state",
            "dataType":"metric",
            "valueType":"decimal64.3",
            "value":"62"
         }
      ]
   }
}
Copy to Clipboard Toggle word wrap

HTTP method

GET api/ocloudNotifications/v1/cluster/node/<node_name>/sync/ptp-status/lock-state/CurrentState

GET api/ocloudNotifications/v1/cluster/node/<node_name>/sync/sync-status/os-clock-sync-state/CurrentState

GET api/ocloudNotifications/v1/cluster/node/<node_name>/sync/ptp-status/ptp-clock-class-change/CurrentState

Description

Configure the CurrentState API endpoint to return the current state of the os-clock-sync-state, ptp-clock-class-change, or lock-state events for the cluster node.

  • os-clock-sync-state notifications describe the host operating system clock synchronization state. Can be in LOCKED or FREERUN state.
  • ptp-clock-class-change notifications describe the current state of the PTP clock class.
  • lock-state notifications describe the current status of the PTP equipment lock state. Can be in LOCKED, HOLDOVER or FREERUN state.
Expand
Table 15.6. Query parameters
ParameterType

<resource_address>

string

Example lock-state API response

{
  "id": "c1ac3aa5-1195-4786-84f8-da0ea4462921",
  "type": "event.sync.ptp-status.ptp-state-change",
  "source": "/cluster/node/compute-1.example.com/sync/ptp-status/lock-state",
  "dataContentType": "application/json",
  "time": "2023-01-10T02:41:57.094981478Z",
  "data": {
    "version": "v1",
    "values": [
      {
        "resource": "/cluster/node/compute-1.example.com/ens5fx/master",
        "dataType": "notification",
        "valueType": "enumeration",
        "value": "LOCKED"
      },
      {
        "resource": "/cluster/node/compute-1.example.com/ens5fx/master",
        "dataType": "metric",
        "valueType": "decimal64.3",
        "value": "29"
      }
    ]
  }
}
Copy to Clipboard Toggle word wrap

Example os-clock-sync-state API response

{
  "specversion": "0.3",
  "id": "4f51fe99-feaa-4e66-9112-66c5c9b9afcb",
  "source": "/cluster/node/compute-1.example.com/sync/sync-status/os-clock-sync-state",
  "type": "event.sync.sync-status.os-clock-sync-state-change",
  "subject": "/cluster/node/compute-1.example.com/sync/sync-status/os-clock-sync-state",
  "datacontenttype": "application/json",
  "time": "2022-11-29T17:44:22.202Z",
  "data": {
    "version": "v1",
    "values": [
      {
        "resource": "/cluster/node/compute-1.example.com/CLOCK_REALTIME",
        "dataType": "notification",
        "valueType": "enumeration",
        "value": "LOCKED"
      },
      {
        "resource": "/cluster/node/compute-1.example.com/CLOCK_REALTIME",
        "dataType": "metric",
        "valueType": "decimal64.3",
        "value": "27"
      }
    ]
  }
}
Copy to Clipboard Toggle word wrap

Example ptp-clock-class-change API response

{
  "id": "064c9e67-5ad4-4afb-98ff-189c6aa9c205",
  "type": "event.sync.ptp-status.ptp-clock-class-change",
  "source": "/cluster/node/compute-1.example.com/sync/ptp-status/ptp-clock-class-change",
  "dataContentType": "application/json",
  "time": "2023-01-10T02:41:56.785673989Z",
  "data": {
    "version": "v1",
    "values": [
      {
        "resource": "/cluster/node/compute-1.example.com/ens5fx/master",
        "dataType": "metric",
        "valueType": "decimal64.3",
        "value": "165"
      }
    ]
  }
}
Copy to Clipboard Toggle word wrap

You can monitor fast events bus metrics directly from cloud-event-proxy containers using the oc CLI.

Note

PTP fast event notification metrics are also available in the OpenShift Container Platform web console.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install and configure the PTP Operator.

Procedure

  1. Get the list of active linuxptp-daemon pods.

    $ oc get pods -n openshift-ptp
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                    READY   STATUS    RESTARTS   AGE
    linuxptp-daemon-2t78p   3/3     Running   0          8h
    linuxptp-daemon-k8n88   3/3     Running   0          8h
    Copy to Clipboard Toggle word wrap

  2. Access the metrics for the required cloud-event-proxy container by running the following command:

    $ oc exec -it <linuxptp-daemon> -n openshift-ptp -c cloud-event-proxy -- curl 127.0.0.1:9091/metrics
    Copy to Clipboard Toggle word wrap

    where:

    <linuxptp-daemon>

    Specifies the pod you want to query, for example, linuxptp-daemon-2t78p.

    Example output

    # HELP cne_amqp_events_published Metric to get number of events published by the transport
    # TYPE cne_amqp_events_published gauge
    cne_amqp_events_published{address="/cluster/node/compute-1.example.com/ptp/status",status="success"} 1041
    # HELP cne_amqp_events_received Metric to get number of events received  by the transport
    # TYPE cne_amqp_events_received gauge
    cne_amqp_events_received{address="/cluster/node/compute-1.example.com/ptp",status="success"} 1019
    # HELP cne_amqp_receiver Metric to get number of receiver created
    # TYPE cne_amqp_receiver gauge
    cne_amqp_receiver{address="/cluster/node/mock",status="active"} 1
    cne_amqp_receiver{address="/cluster/node/compute-1.example.com/ptp",status="active"} 1
    cne_amqp_receiver{address="/cluster/node/compute-1.example.com/redfish/event",status="active"}
    ...
    Copy to Clipboard Toggle word wrap

You can monitor PTP fast event metrics in the OpenShift Container Platform web console by using the pre-configured and self-updating Prometheus monitoring stack.

Prerequisites

  • Install the OpenShift Container Platform CLI oc.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Enter the following command to return the list of available PTP metrics from the cloud-event-proxy sidecar container:

    $ oc exec -it <linuxptp_daemon_pod> -n openshift-ptp -c cloud-event-proxy -- curl 127.0.0.1:9091/metrics
    Copy to Clipboard Toggle word wrap

    where:

    <linuxptp_daemon_pod>
    Specifies the pod you want to query, for example, linuxptp-daemon-2t78p.
  2. Copy the name of the PTP metric you want to query from the list of returned metrics, for example, cne_amqp_events_received.
  3. In the OpenShift Container Platform web console, click ObserveMetrics.
  4. Paste the PTP metric into the Expression field, and click Run queries.

Chapter 16. External DNS Operator

The External DNS Operator deploys and manages ExternalDNS to provide the name resolution for services and routes from the external DNS provider to OpenShift Container Platform.

16.1.1. External DNS Operator

The External DNS Operator implements the External DNS API from the olm.openshift.io API group. The External DNS Operator deploys the ExternalDNS using a deployment resource. The ExternalDNS deployment watches the resources such as services and routes in the cluster and updates the external DNS providers.

Procedure

You can deploy the ExternalDNS Operator on demand from the OperatorHub, this creates a Subscription object.

  1. Check the name of an install plan:

    $ oc -n external-dns-operator get sub external-dns-operator -o yaml | yq '.status.installplan.name'
    Copy to Clipboard Toggle word wrap

    Example output

    install-zcvlr
    Copy to Clipboard Toggle word wrap

  2. Check the status of an install plan, the status of an install plan must be Complete:

    $ oc -n external-dns-operator get ip <install_plan_name> -o yaml | yq '.status.phase'
    Copy to Clipboard Toggle word wrap

    Example output

    Complete
    Copy to Clipboard Toggle word wrap

  3. Use the oc get command to view the Deployment status:

    $ oc get -n external-dns-operator deployment/external-dns-operator
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                    READY     UP-TO-DATE   AVAILABLE   AGE
    external-dns-operator   1/1       1            1           23h
    Copy to Clipboard Toggle word wrap

16.1.2. External DNS Operator logs

You can view External DNS Operator logs by using the oc logs command.

Procedure

  1. View the logs of the External DNS Operator:

    $ oc logs -n external-dns-operator deployment/external-dns-operator -c external-dns-operator
    Copy to Clipboard Toggle word wrap

External DNS Operator uses the TXT registry, which follows the new format and adds the prefix for the TXT records. This reduces the maximum length of the domain name for the TXT records. A DNS record cannot be present without a corresponding TXT record, so the domain name of the DNS record must follow the same limit as the TXT records. For example, DNS record is <domain-name-from-source> and the TXT record is external-dns-<record-type>-<domain-name-from-source>.

The domain name of the DNS records generated by External DNS Operator has the following limitations:

Expand
Record typeNumber of characters

CNAME

44

Wildcard CNAME records on AzureDNS

42

A

48

Wildcard A records on AzureDNS

46

If the domain name generated by External DNS exceeds the domain name limitation, the External DNS instance gives the following error:

$ oc -n external-dns-operator logs external-dns-aws-7ddbd9c7f8-2jqjh 
1
Copy to Clipboard Toggle word wrap
1
The external-dns-aws-7ddbd9c7f8-2jqjh parameter specifies the name of the External DNS pod.

Example output

time="2022-09-02T08:53:57Z" level=info msg="Desired change: CREATE external-dns-cname-hello-openshift-aaaaaaaaaa-bbbbbbbbbb-ccccccc.test.example.io TXT [Id: /hostedzone/Z06988883Q0H0RL6UMXXX]"
time="2022-09-02T08:53:57Z" level=info msg="Desired change: CREATE external-dns-hello-openshift-aaaaaaaaaa-bbbbbbbbbb-ccccccc.test.example.io TXT [Id: /hostedzone/Z06988883Q0H0RL6UMXXX]"
time="2022-09-02T08:53:57Z" level=info msg="Desired change: CREATE hello-openshift-aaaaaaaaaa-bbbbbbbbbb-ccccccc.test.example.io A [Id: /hostedzone/Z06988883Q0H0RL6UMXXX]"
time="2022-09-02T08:53:57Z" level=error msg="Failure in zone test.example.io. [Id: /hostedzone/Z06988883Q0H0RL6UMXXX]"
time="2022-09-02T08:53:57Z" level=error msg="InvalidChangeBatch: [FATAL problem: DomainLabelTooLong (Domain label is too long) encountered with 'external-dns-a-hello-openshift-aaaaaaaaaa-bbbbbbbbbb-ccccccc']\n\tstatus code: 400, request id: e54dfd5a-06c6-47b0-bcb9-a4f7c3a4e0c6"
Copy to Clipboard Toggle word wrap

You can install External DNS Operator on cloud providers such as AWS, Azure and GCP.

16.2.1. Installing the External DNS Operator

You can install the External DNS Operator using the OpenShift Container Platform OperatorHub.

Procedure

  1. Click OperatorsOperatorHub in the OpenShift Container Platform Web Console.
  2. Click External DNS Operator. You can use the Filter by keyword text box or the filter list to search for External DNS Operator from the list of Operators.
  3. Select the external-dns-operator namespace.
  4. On the External DNS Operator page, click Install.
  5. On the Install Operator page, ensure that you selected the following options:

    1. Update the channel as stable-v1.0.
    2. Installation mode as A specific name on the cluster.
    3. Installed namespace as external-dns-operator. If namespace external-dns-operator does not exist, it gets created during the Operator installation.
    4. Select Approval Strategy as Automatic or Manual. Approval Strategy is set to Automatic by default.
    5. Click Install.

If you select Automatic updates, the Operator Lifecycle Manager (OLM) automatically upgrades the running instance of your Operator without any intervention.

If you select Manual updates, the OLM creates an update request. As a cluster administrator, you must then manually approve that update request to have the Operator updated to the new version.

Verification

Verify that External DNS Operator shows the Status as Succeeded on the Installed Operators dashboard.

The External DNS Operators includes the following configuration parameters:

The External DNS Operator includes the following configuration parameters:

Expand
ParameterDescription

spec

Enables the type of a cloud provider.

spec:
  provider:
    type: AWS 
1

    aws:
      credentials:
        name: aws-access-key 
2
Copy to Clipboard Toggle word wrap
1
Defines available options such as AWS, GCP and Azure.
2
Defines a name of the secret which contains credentials for your cloud provider.

zones

Enables you to specify DNS zones by their domains. If you do not specify zones, ExternalDNS discovers all the zones present in your cloud provider account.

zones:
- "myzoneid" 
1
Copy to Clipboard Toggle word wrap
1
Specifies the IDs of DNS zones.

domains

Enables you to specify AWS zones by their domains. If you do not specify domains, ExternalDNS discovers all the zones present in your cloud provider account.

domains:
- filterType: Include 
1

  matchType: Exact 
2

  name: "myzonedomain1.com" 
3

- filterType: Include
  matchType: Pattern 
4

  pattern: ".*\\.otherzonedomain\\.com" 
5
Copy to Clipboard Toggle word wrap
1
Instructs ExternalDNS to include the domain specified.
2
Instructs ExtrnalDNS that the domain matching has to be exact as opposed to regular expression match.
3
Defines the exact domain name by which ExternalDNS filters.
4
Sets regex-domain-filter flag in ExternalDNS. You can limit possible domains by using a Regex filter.
5
Defines the regex pattern to be used by ExternalDNS to filter the domains of the target zones.

source

Enables you to specify the source for the DNS records, Service or Route.

source: 
1

  type: Service 
2

  service:
    serviceType:
3

      - LoadBalancer
      - ClusterIP
  labelFilter: 
4

    matchLabels:
      external-dns.mydomain.org/publish: "yes"
  hostnameAnnotation: "Allow" 
5

  fqdnTemplate:
  - "{{.Name}}.myzonedomain.com" 
6
Copy to Clipboard Toggle word wrap
1
Defines the settings for the source of DNS records.
2
The ExternalDNS uses Service type as source for creating dns records.
3
Sets service-type-filter flag in ExternalDNS. The serviceType contains the following fields:
  • default: LoadBalancer
  • expected: ClusterIP
  • NodePort
  • LoadBalancer
  • ExternalName
4
Ensures that the controller considers only those resources which matches with label filter.
5
The default value for hostnameAnnotation is Ignore which instructs ExternalDNS to generate DNS records using the templates specified in the field fqdnTemplates. When the value is Allow the DNS records get generated based on the value specified in the external-dns.alpha.kubernetes.io/hostname annotation.
6
External DNS Operator uses a string to generate DNS names from sources that don’t define a hostname, or to add a hostname suffix when paired with the fake source.
source:
  type: OpenShiftRoute 
1

  openshiftRouteOptions:
    routerName: default 
2

    labelFilter:
      matchLabels:
        external-dns.mydomain.org/publish: "yes"
Copy to Clipboard Toggle word wrap
1
ExternalDNS` uses type route as source for creating dns records.
2
If the source is OpenShiftRoute, then you can pass the Ingress Controller name. The ExternalDNS uses canonical name of Ingress Controller as the target for CNAME record.

16.4. Creating DNS records on AWS

You can create DNS records on AWS and AWS GovCloud by using External DNS Operator.

You can create DNS records on a public hosted zone for AWS by using the Red Hat External DNS Operator. You can use the same instructions to create DNS records on a hosted zone for AWS GovCloud.

Procedure

  1. Check the user. The user must have access to the kube-system namespace. If you don’t have the credentials, as you can fetch the credentials from the kube-system namespace to use the cloud provider client:

    $ oc whoami
    Copy to Clipboard Toggle word wrap

    Example output

    system:admin
    Copy to Clipboard Toggle word wrap

  2. Fetch the values from aws-creds secret present in kube-system namespace.

    $ export AWS_ACCESS_KEY_ID=$(oc get secrets aws-creds -n kube-system  --template={{.data.aws_access_key_id}} | base64 -d)
    $ export AWS_SECRET_ACCESS_KEY=$(oc get secrets aws-creds -n kube-system  --template={{.data.aws_secret_access_key}} | base64 -d)
    Copy to Clipboard Toggle word wrap
  3. Get the routes to check the domain:

    $ oc get routes --all-namespaces | grep console
    Copy to Clipboard Toggle word wrap

    Example output

    openshift-console          console             console-openshift-console.apps.testextdnsoperator.apacshift.support                       console             https   reencrypt/Redirect     None
    openshift-console          downloads           downloads-openshift-console.apps.testextdnsoperator.apacshift.support                     downloads           http    edge/Redirect          None
    Copy to Clipboard Toggle word wrap

  4. Get the list of dns zones to find the one which corresponds to the previously found route’s domain:

    $ aws route53 list-hosted-zones | grep testextdnsoperator.apacshift.support
    Copy to Clipboard Toggle word wrap

    Example output

    HOSTEDZONES	terraform	/hostedzone/Z02355203TNN1XXXX1J6O	testextdnsoperator.apacshift.support.	5
    Copy to Clipboard Toggle word wrap

  5. Create ExternalDNS resource for route source:

    $ cat <<EOF | oc create -f -
    apiVersion: externaldns.olm.openshift.io/v1beta1
    kind: ExternalDNS
    metadata:
      name: sample-aws 
    1
    
    spec:
      domains:
      - filterType: Include   
    2
    
        matchType: Exact   
    3
    
        name: testextdnsoperator.apacshift.support 
    4
    
      provider:
        type: AWS 
    5
    
      source:  
    6
    
        type: OpenShiftRoute 
    7
    
        openshiftRouteOptions:
          routerName: default 
    8
    
    EOF
    Copy to Clipboard Toggle word wrap
    1
    Defines the name of external DNS resource.
    2
    By default all hosted zones are selected as potential targets. You can include a hosted zone that you need.
    3
    The matching of the target zone’s domain has to be exact (as opposed to regular expression match).
    4
    Specify the exact domain of the zone you want to update. The hostname of the routes must be subdomains of the specified domain.
    5
    Defines the AWS Route53 DNS provider.
    6
    Defines options for the source of DNS records.
    7
    Defines OpenShift route resource as the source for the DNS records which gets created in the previously specified DNS provider.
    8
    If the source is OpenShiftRoute, then you can pass the OpenShift Ingress Controller name. External DNS Operator selects the canonical hostname of that router as the target while creating CNAME record.
  6. Check the records created for OCP routes using the following command:

    $ aws route53 list-resource-record-sets --hosted-zone-id Z02355203TNN1XXXX1J6O --query "ResourceRecordSets[?Type == 'CNAME']" | grep console
    Copy to Clipboard Toggle word wrap

16.5. Creating DNS records on Azure

You can create DNS records on Azure using External DNS Operator.

You can create DNS records on a public DNS zone for Azure by using Red Hat External DNS Operator.

Procedure

  1. Check the user. The user must have access to the kube-system namespace. If you don’t have the credentials, as you can fetch the credentials from the kube-system namespace to use the cloud provider client:

    $ oc whoami
    Copy to Clipboard Toggle word wrap

    Example output

    system:admin
    Copy to Clipboard Toggle word wrap

  2. Fetch the values from azure-credentials secret present in kube-system namespace.

    $ CLIENT_ID=$(oc get secrets azure-credentials  -n kube-system  --template={{.data.azure_client_id}} | base64 -d)
    $ CLIENT_SECRET=$(oc get secrets azure-credentials  -n kube-system  --template={{.data.azure_client_secret}} | base64 -d)
    $ RESOURCE_GROUP=$(oc get secrets azure-credentials  -n kube-system  --template={{.data.azure_resourcegroup}} | base64 -d)
    $ SUBSCRIPTION_ID=$(oc get secrets azure-credentials  -n kube-system  --template={{.data.azure_subscription_id}} | base64 -d)
    $ TENANT_ID=$(oc get secrets azure-credentials  -n kube-system  --template={{.data.azure_tenant_id}} | base64 -d)
    Copy to Clipboard Toggle word wrap
  3. Login to azure with base64 decoded values:

    $ az login --service-principal -u "${CLIENT_ID}" -p "${CLIENT_SECRET}" --tenant "${TENANT_ID}"
    Copy to Clipboard Toggle word wrap
  4. Get the routes to check the domain:

    $ oc get routes --all-namespaces | grep console
    Copy to Clipboard Toggle word wrap

    Example output

    openshift-console          console             console-openshift-console.apps.test.azure.example.com                       console             https   reencrypt/Redirect     None
    openshift-console          downloads           downloads-openshift-console.apps.test.azure.example.com                     downloads           http    edge/Redirect          None
    Copy to Clipboard Toggle word wrap

  5. Get the list of dns zones to find the one which corresponds to the previously found route’s domain:

    $ az network dns zone list --resource-group "${RESOURCE_GROUP}"
    Copy to Clipboard Toggle word wrap
  6. Create ExternalDNS resource for route source:

    apiVersion: externaldns.olm.openshift.io/v1beta1
    kind: ExternalDNS
    metadata:
      name: sample-azure 
    1
    
    spec:
      zones:
      - "/subscriptions/1234567890/resourceGroups/test-azure-xxxxx-rg/providers/Microsoft.Network/dnszones/test.azure.example.com" 
    2
    
      provider:
        type: Azure 
    3
    
      source:
        openshiftRouteOptions: 
    4
    
          routerName: default 
    5
    
        type: OpenShiftRoute 
    6
    
    EOF
    Copy to Clipboard Toggle word wrap
    1
    Specifies the name of External DNS CR.
    2
    Define the zone ID.
    3
    Defines the Azure DNS provider.
    4
    You can define options for the source of DNS records.
    5
    If the source is OpenShiftRoute then you can pass the OpenShift Ingress Controller name. External DNS selects the canonical hostname of that router as the target while creating CNAME record.
    6
    Defines OpenShift route resource as the source for the DNS records which gets created in the previously specified DNS provider.
  7. Check the records created for OCP routes using the following command:

    $ az network dns record-set list -g "${RESOURCE_GROUP}"  -z test.azure.example.com | grep console
    Copy to Clipboard Toggle word wrap
    Note

    To create records on private hosted zones on private Azure dns, you need to specify the private zone under zones which populates the provider type to azure-private-dns in the ExternalDNS container args.

16.6. Creating DNS records on GCP

You can create DNS records on GCP using External DNS Operator.

You can create DNS records on a public managed zone for GCP by using Red Hat External DNS Operator.

Procedure

  1. Check the user. The user must have access to the kube-system namespace. If you don’t have the credentials, as you can fetch the credentials from the kube-system namespace to use the cloud provider client:

    $ oc whoami
    Copy to Clipboard Toggle word wrap

    Example output

    system:admin
    Copy to Clipboard Toggle word wrap

  2. Copy the value of service_account.json in gcp-credentials secret in a file encoded-gcloud.json by running the following command:

    $ oc get secret gcp-credentials -n kube-system --template='{{$v := index .data "service_account.json"}}{{$v}}' | base64 -d - > decoded-gcloud.json
    Copy to Clipboard Toggle word wrap
  3. Export Google credentials:

    $ export GOOGLE_CREDENTIALS=decoded-gcloud.json
    Copy to Clipboard Toggle word wrap
  4. Activate your account by using the following command:

    $ gcloud auth activate-service-account  <client_email as per decoded-gcloud.json> --key-file=decoded-gcloud.json
    Copy to Clipboard Toggle word wrap
  5. Set your project:

    $ gcloud config set project <project_id as per decoded-gcloud.json>
    Copy to Clipboard Toggle word wrap
  6. Get the routes to check the domain:

    $ oc get routes --all-namespaces | grep console
    Copy to Clipboard Toggle word wrap

    Example output

    openshift-console          console             console-openshift-console.apps.test.gcp.example.com                       console             https   reencrypt/Redirect     None
    openshift-console          downloads           downloads-openshift-console.apps.test.gcp.example.com                     downloads           http    edge/Redirect          None
    Copy to Clipboard Toggle word wrap

  7. Get the list of managed zones to find the zone which corresponds to the previously found route’s domain:

    $ gcloud dns managed-zones list | grep test.gcp.example.com
    qe-cvs4g-private-zone test.gcp.example.com
    Copy to Clipboard Toggle word wrap
  8. Create ExternalDNS resource for route source:

    apiVersion: externaldns.olm.openshift.io/v1beta1
    kind: ExternalDNS
    metadata:
      name: sample-gcp 
    1
    
    spec:
      domains:
        - filterType: Include 
    2
    
          matchType: Exact 
    3
    
          name: test.gcp.example.com 
    4
    
      provider:
        type: GCP 
    5
    
      source:
        openshiftRouteOptions: 
    6
    
          routerName: default 
    7
    
        type: OpenShiftRoute 
    8
    
    EOF
    Copy to Clipboard Toggle word wrap
    1
    Specifies the name of External DNS CR.
    2
    By default all hosted zones are selected as potential targets. You can include a hosted zone that you need.
    3
    The matching of the target zone’s domain has to be exact (as opposed to regular expression match).
    4
    Specify the exact domain of the zone you want to update. The hostname of the routes must be subdomains of the specified domain.
    5
    Defines Google Cloud DNS provider.
    6
    You can define options for the source of DNS records.
    7
    If the source is OpenShiftRoute then you can pass the OpenShift Ingress Controller name. External DNS selects the canonical hostname of that router as the target while creating CNAME record.
    8
    Defines OpenShift route resource as the source for the DNS records which gets created in the previously specified DNS provider.
  9. Check the records created for OCP routes using the following command:

    $ gcloud dns record-sets list --zone=qe-cvs4g-private-zone | grep console
    Copy to Clipboard Toggle word wrap

16.7. Creating DNS records on Infoblox

You can create DNS records on Infoblox using the Red Hat External DNS Operator.

You can create DNS records on a public DNS zone on Infoblox by using the Red Hat External DNS Operator.

Prerequisites

  • You have access to the OpenShift CLI (oc).
  • You have access to the Infoblox UI.

Procedure

  1. Create a secret object with Infoblox credentials by running the following command:

    $ oc -n external-dns-operator create secret generic infoblox-credentials --from-literal=EXTERNAL_DNS_INFOBLOX_WAPI_USERNAME=<infoblox_username> --from-literal=EXTERNAL_DNS_INFOBLOX_WAPI_PASSWORD=<infoblox_password>
    Copy to Clipboard Toggle word wrap
  2. Get the routes objects to check your cluster domain by running the following command:

    $ oc get routes --all-namespaces | grep console
    Copy to Clipboard Toggle word wrap

    Example Output

    openshift-console          console             console-openshift-console.apps.test.example.com                       console             https   reencrypt/Redirect     None
    openshift-console          downloads           downloads-openshift-console.apps.test.example.com                     downloads           http    edge/Redirect          None
    Copy to Clipboard Toggle word wrap

  3. Create an ExternalDNS resource YAML file, for example, sample-infoblox.yaml, as follows:

    apiVersion: externaldns.olm.openshift.io/v1beta1
    kind: ExternalDNS
    metadata:
      name: sample-infoblox
    spec:
      provider:
        type: Infoblox
        infoblox:
          credentials:
            name: infoblox-credentials
          gridHost: ${INFOBLOX_GRID_PUBLIC_IP}
          wapiPort: 443
          wapiVersion: "2.3.1"
      domains:
      - filterType: Include
        matchType: Exact
        name: test.example.com
      source:
        type: OpenShiftRoute
        openshiftRouteOptions:
          routerName: default
    Copy to Clipboard Toggle word wrap
  4. Create an ExternalDNS resource on Infoblox by running the following command:

    $ oc create -f sample-infoblox.yaml
    Copy to Clipboard Toggle word wrap
  5. From the Infoblox UI, check the DNS records created for console routes:

    1. Click Data ManagementDNSZones.
    2. Select the zone name.

You can configure the cluster-wide proxy in the External DNS Operator. After configuring the cluster-wide proxy in the External DNS Operator, Operator Lifecycle Manager (OLM) automatically updates all the deployments of the Operators with the environment variables such as HTTP_PROXY, HTTPS_PROXY, and NO_PROXY.

You can configure the External DNS Operator to trust the certificate authority of the cluster-wide proxy.

Procedure

  1. Create the config map to contain the CA bundle in the external-dns-operator namespace by running the following command:

    $ oc -n external-dns-operator create configmap trusted-ca
    Copy to Clipboard Toggle word wrap
  2. To inject the trusted CA bundle into the config map, add the config.openshift.io/inject-trusted-cabundle=true label to the config map by running the following command:

    $ oc -n external-dns-operator label cm trusted-ca config.openshift.io/inject-trusted-cabundle=true
    Copy to Clipboard Toggle word wrap
  3. Update the subscription of the External DNS Operator by running the following command:

    $ oc -n external-dns-operator patch subscription external-dns-operator --type='json' -p='[{"op": "add", "path": "/spec/config", "value":{"env":[{"name":"TRUSTED_CA_CONFIGMAP_NAME","value":"trusted-ca"}]}}]'
    Copy to Clipboard Toggle word wrap

Verification

  • After the deployment of the External DNS Operator is completed, verify that the trusted CA environment variable is added to the external-dns-operator deployment by running the following command:

    $ oc -n external-dns-operator exec deploy/external-dns-operator -c external-dns-operator -- printenv TRUSTED_CA_CONFIGMAP_NAME
    Copy to Clipboard Toggle word wrap

    Example output

    trusted-ca
    Copy to Clipboard Toggle word wrap

Chapter 17. Network policy

17.1. About network policy

As a cluster administrator, you can define network policies that restrict traffic to pods in your cluster.

17.1.1. About network policy

In a cluster using a Kubernetes Container Network Interface (CNI) plugin that supports Kubernetes network policy, network isolation is controlled entirely by NetworkPolicy objects. In OpenShift Container Platform 4.11, OpenShift SDN supports using network policy in its default network isolation mode.

Warning

Network policy does not apply to the host network namespace. Pods with host networking enabled are unaffected by network policy rules. However, pods connecting to the host-networked pods might be affected by the network policy rules.

Network policies cannot block traffic from localhost or from their resident nodes.

By default, all pods in a project are accessible from other pods and network endpoints. To isolate one or more pods in a project, you can create NetworkPolicy objects in that project to indicate the allowed incoming connections. Project administrators can create and delete NetworkPolicy objects within their own project.

If a pod is matched by selectors in one or more NetworkPolicy objects, then the pod will accept only connections that are allowed by at least one of those NetworkPolicy objects. A pod that is not selected by any NetworkPolicy objects is fully accessible.

A network policy applies to only the TCP, UDP, ICMP, and SCTP protocols. Other protocols are not affected.

The following example NetworkPolicy objects demonstrate supporting different scenarios:

  • Deny all traffic:

    To make a project deny by default, add a NetworkPolicy object that matches all pods but accepts no traffic:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: deny-by-default
    spec:
      podSelector: {}
      ingress: []
    Copy to Clipboard Toggle word wrap
  • Only allow connections from the OpenShift Container Platform Ingress Controller:

    To make a project allow only connections from the OpenShift Container Platform Ingress Controller, add the following NetworkPolicy object.

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-from-openshift-ingress
    spec:
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              network.openshift.io/policy-group: ingress
      podSelector: {}
      policyTypes:
      - Ingress
    Copy to Clipboard Toggle word wrap
  • Only accept connections from pods within a project:

    To make pods accept connections from other pods in the same project, but reject all other connections from pods in other projects, add the following NetworkPolicy object:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-same-namespace
    spec:
      podSelector: {}
      ingress:
      - from:
        - podSelector: {}
    Copy to Clipboard Toggle word wrap
  • Only allow HTTP and HTTPS traffic based on pod labels:

    To enable only HTTP and HTTPS access to the pods with a specific label (role=frontend in following example), add a NetworkPolicy object similar to the following:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-http-and-https
    spec:
      podSelector:
        matchLabels:
          role: frontend
      ingress:
      - ports:
        - protocol: TCP
          port: 80
        - protocol: TCP
          port: 443
    Copy to Clipboard Toggle word wrap
  • Accept connections by using both namespace and pod selectors:

    To match network traffic by combining namespace and pod selectors, you can use a NetworkPolicy object similar to the following:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-pod-and-namespace-both
    spec:
      podSelector:
        matchLabels:
          name: test-pods
      ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                project: project_name
            podSelector:
              matchLabels:
                name: test-pods
    Copy to Clipboard Toggle word wrap

NetworkPolicy objects are additive, which means you can combine multiple NetworkPolicy objects together to satisfy complex network requirements.

For example, for the NetworkPolicy objects defined in previous samples, you can define both allow-same-namespace and allow-http-and-https policies within the same project. Thus allowing the pods with the label role=frontend, to accept any connection allowed by each policy. That is, connections on any port from pods in the same namespace, and connections on ports 80 and 443 from pods in any namespace.

Use the following NetworkPolicy to allow external traffic regardless of the router configuration:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-router
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          policy-group.network.openshift.io/ingress: ""
1

  podSelector: {}
  policyTypes:
  - Ingress
Copy to Clipboard Toggle word wrap
1
policy-group.network.openshift.io/ingress:"" label supports both OpenShift-SDN and OVN-Kubernetes.

Add the following allow-from-hostnetwork NetworkPolicy object to direct traffic from the host network pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-hostnetwork
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          policy-group.network.openshift.io/host-network: ""
  podSelector: {}
  policyTypes:
  - Ingress
Copy to Clipboard Toggle word wrap

17.1.2. Optimizations for network policy

Use a network policy to isolate pods that are differentiated from one another by labels within a namespace.

Note

The guidelines for efficient use of network policy rules applies to only the OpenShift SDN cluster network provider.

It is inefficient to apply NetworkPolicy objects to large numbers of individual pods in a single namespace. Pod labels do not exist at the IP address level, so a network policy generates a separate Open vSwitch (OVS) flow rule for every possible link between every pod selected with a podSelector.

For example, if the spec podSelector and the ingress podSelector within a NetworkPolicy object each match 200 pods, then 40,000 (200*200) OVS flow rules are generated. This might slow down a node.

When designing your network policy, refer to the following guidelines:

  • Reduce the number of OVS flow rules by using namespaces to contain groups of pods that need to be isolated.

    NetworkPolicy objects that select a whole namespace, by using the namespaceSelector or an empty podSelector, generate only a single OVS flow rule that matches the VXLAN virtual network ID (VNID) of the namespace.

  • Keep the pods that do not need to be isolated in their original namespace, and move the pods that require isolation into one or more different namespaces.
  • Create additional targeted cross-namespace network policies to allow the specific traffic that you do want to allow from the isolated pods.

17.1.3. Next steps

17.2. Logging network policy events

As a cluster administrator, you can configure network policy audit logging for your cluster and enable logging for one or more namespaces.

Note

Audit logging of network policies is available for only the OVN-Kubernetes cluster network provider.

17.2.1. Network policy audit logging

The OVN-Kubernetes cluster network provider uses Open Virtual Network (OVN) ACLs to manage network policy. Audit logging exposes allow and deny ACL events.

You can configure the destination for network policy audit logs, such as a syslog server or a UNIX domain socket. Regardless of any additional configuration, an audit log is always saved to /var/log/ovn/acl-audit-log.log on each OVN-Kubernetes pod in the cluster.

Network policy audit logging is enabled per namespace by annotating the namespace with the k8s.ovn.org/acl-logging key as in the following example:

Example namespace annotation

kind: Namespace
apiVersion: v1
metadata:
  name: example1
  annotations:
    k8s.ovn.org/acl-logging: |-
      {
        "deny": "info",
        "allow": "info"
      }
Copy to Clipboard Toggle word wrap

The logging format is compatible with syslog as defined by RFC5424. The syslog facility is configurable and defaults to local0. An example log entry might resemble the following:

Example ACL deny log entry

2021-06-13T19:33:11.590Z|00005|acl_log(ovn_pinctrl0)|INFO|name="verify-audit-logging_deny-all", verdict=drop, severity=alert: icmp,vlan_tci=0x0000,dl_src=0a:58:0a:80:02:39,dl_dst=0a:58:0a:80:02:37,nw_src=10.128.2.57,nw_dst=10.128.2.55,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
Copy to Clipboard Toggle word wrap

The following table describes namespace annotation values:

Expand
Table 17.1. Network policy audit logging namespace annotation
AnnotationValue

k8s.ovn.org/acl-logging

You must specify at least one of allow, deny, or both to enable network policy audit logging for a namespace.

deny
Optional: Specify alert, warning, notice, info, or debug.
allow
Optional: Specify alert, warning, notice, info, or debug.

17.2.2. Network policy audit configuration

The configuration for audit logging is specified as part of the OVN-Kubernetes cluster network provider configuration. The following YAML illustrates default values for network policy audit logging feature.

Audit logging configuration

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  defaultNetwork:
    ovnKubernetesConfig:
      policyAuditConfig:
        destination: "null"
        maxFileSize: 50
        rateLimit: 20
        syslogFacility: local0
Copy to Clipboard Toggle word wrap

The following table describes the configuration fields for network policy audit logging.

Expand
Table 17.2. policyAuditConfig object
FieldTypeDescription

rateLimit

integer

The maximum number of messages to generate every second per node. The default value is 20 messages per second.

maxFileSize

integer

The maximum size for the audit log in bytes. The default value is 50000000 or 50 MB.

destination

string

One of the following additional audit log targets:

libc
The libc syslog() function of the journald process on the host.
udp:<host>:<port>
A syslog server. Replace <host>:<port> with the host and port of the syslog server.
unix:<file>
A Unix Domain Socket file specified by <file>.
null
Do not send the audit logs to any additional target.

syslogFacility

string

The syslog facility, such as kern, as defined by RFC5424. The default value is local0.

As a cluster administrator, you can customize network policy audit logging for your cluster.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.

Procedure

  • To customize the network policy audit logging configuration, enter the following command:

    $ oc edit network.operator.openshift.io/cluster
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively customize and apply the following YAML to configure audit logging:

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      defaultNetwork:
        ovnKubernetesConfig:
          policyAuditConfig:
            destination: "null"
            maxFileSize: 50
            rateLimit: 20
            syslogFacility: local0
    Copy to Clipboard Toggle word wrap

Verification

  1. To create a namespace with network policies complete the following steps:

    1. Create a namespace for verification:

      $ cat <<EOF| oc create -f -
      kind: Namespace
      apiVersion: v1
      metadata:
        name: verify-audit-logging
        annotations:
          k8s.ovn.org/acl-logging: '{ "deny": "alert", "allow": "alert" }'
      EOF
      Copy to Clipboard Toggle word wrap

      Example output

      namespace/verify-audit-logging created
      Copy to Clipboard Toggle word wrap

    2. Enable audit logging:

      $ oc annotate namespace verify-audit-logging k8s.ovn.org/acl-logging='{ "deny": "alert", "allow": "alert" }'
      Copy to Clipboard Toggle word wrap
      namespace/verify-audit-logging annotated
      Copy to Clipboard Toggle word wrap
    3. Create network policies for the namespace:

      $ cat <<EOF| oc create -n verify-audit-logging -f -
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: deny-all
      spec:
        podSelector:
          matchLabels:
        policyTypes:
        - Ingress
        - Egress
      ---
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-same-namespace
      spec:
        podSelector: {}
        policyTypes:
         - Ingress
         - Egress
        ingress:
          - from:
              - podSelector: {}
        egress:
          - to:
             - namespaceSelector:
                matchLabels:
                  namespace: verify-audit-logging
      EOF
      Copy to Clipboard Toggle word wrap

      Example output

      networkpolicy.networking.k8s.io/deny-all created
      networkpolicy.networking.k8s.io/allow-from-same-namespace created
      Copy to Clipboard Toggle word wrap

  2. Create a pod for source traffic in the default namespace:

    $ cat <<EOF| oc create -n default -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: client
    spec:
      containers:
        - name: client
          image: registry.access.redhat.com/rhel7/rhel-tools
          command: ["/bin/sh", "-c"]
          args:
            ["sleep inf"]
    EOF
    Copy to Clipboard Toggle word wrap
  3. Create two pods in the verify-audit-logging namespace:

    $ for name in client server; do
    cat <<EOF| oc create -n verify-audit-logging -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: ${name}
    spec:
      containers:
        - name: ${name}
          image: registry.access.redhat.com/rhel7/rhel-tools
          command: ["/bin/sh", "-c"]
          args:
            ["sleep inf"]
    EOF
    done
    Copy to Clipboard Toggle word wrap

    Example output

    pod/client created
    pod/server created
    Copy to Clipboard Toggle word wrap

  4. To generate traffic and produce network policy audit log entries, complete the following steps:

    1. Obtain the IP address for pod named server in the verify-audit-logging namespace:

      $ POD_IP=$(oc get pods server -n verify-audit-logging -o jsonpath='{.status.podIP}')
      Copy to Clipboard Toggle word wrap
    2. Ping the IP address from the previous command from the pod named client in the default namespace and confirm that all packets are dropped:

      $ oc exec -it client -n default -- /bin/ping -c 2 $POD_IP
      Copy to Clipboard Toggle word wrap

      Example output

      PING 10.128.2.55 (10.128.2.55) 56(84) bytes of data.
      
      --- 10.128.2.55 ping statistics ---
      2 packets transmitted, 0 received, 100% packet loss, time 2041ms
      Copy to Clipboard Toggle word wrap

    3. Ping the IP address saved in the POD_IP shell environment variable from the pod named client in the verify-audit-logging namespace and confirm that all packets are allowed:

      $ oc exec -it client -n verify-audit-logging -- /bin/ping -c 2 $POD_IP
      Copy to Clipboard Toggle word wrap

      Example output

      PING 10.128.0.86 (10.128.0.86) 56(84) bytes of data.
      64 bytes from 10.128.0.86: icmp_seq=1 ttl=64 time=2.21 ms
      64 bytes from 10.128.0.86: icmp_seq=2 ttl=64 time=0.440 ms
      
      --- 10.128.0.86 ping statistics ---
      2 packets transmitted, 2 received, 0% packet loss, time 1001ms
      rtt min/avg/max/mdev = 0.440/1.329/2.219/0.890 ms
      Copy to Clipboard Toggle word wrap

  5. Display the latest entries in the network policy audit log:

    $ for pod in $(oc get pods -n openshift-ovn-kubernetes -l app=ovnkube-node --no-headers=true | awk '{ print $1 }') ; do
        oc exec -it $pod -n openshift-ovn-kubernetes -- tail -4 /var/log/ovn/acl-audit-log.log
      done
    Copy to Clipboard Toggle word wrap

    Example output

    Defaulting container name to ovn-controller.
    Use 'oc describe pod/ovnkube-node-hdb8v -n openshift-ovn-kubernetes' to see all of the containers in this pod.
    2021-06-13T19:33:11.590Z|00005|acl_log(ovn_pinctrl0)|INFO|name="verify-audit-logging_deny-all", verdict=drop, severity=alert: icmp,vlan_tci=0x0000,dl_src=0a:58:0a:80:02:39,dl_dst=0a:58:0a:80:02:37,nw_src=10.128.2.57,nw_dst=10.128.2.55,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
    2021-06-13T19:33:12.614Z|00006|acl_log(ovn_pinctrl0)|INFO|name="verify-audit-logging_deny-all", verdict=drop, severity=alert: icmp,vlan_tci=0x0000,dl_src=0a:58:0a:80:02:39,dl_dst=0a:58:0a:80:02:37,nw_src=10.128.2.57,nw_dst=10.128.2.55,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
    2021-06-13T19:44:10.037Z|00007|acl_log(ovn_pinctrl0)|INFO|name="verify-audit-logging_allow-from-same-namespace_0", verdict=allow, severity=alert: icmp,vlan_tci=0x0000,dl_src=0a:58:0a:80:02:3b,dl_dst=0a:58:0a:80:02:3a,nw_src=10.128.2.59,nw_dst=10.128.2.58,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
    2021-06-13T19:44:11.037Z|00008|acl_log(ovn_pinctrl0)|INFO|name="verify-audit-logging_allow-from-same-namespace_0", verdict=allow, severity=alert: icmp,vlan_tci=0x0000,dl_src=0a:58:0a:80:02:3b,dl_dst=0a:58:0a:80:02:3a,nw_src=10.128.2.59,nw_dst=10.128.2.58,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can enable network policy audit logging for a namespace.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.

Procedure

  • To enable network policy audit logging for a namespace, enter the following command:

    $ oc annotate namespace <namespace> \
      k8s.ovn.org/acl-logging='{ "deny": "alert", "allow": "notice" }'
    Copy to Clipboard Toggle word wrap

    where:

    <namespace>
    Specifies the name of the namespace.
    Tip

    You can alternatively apply the following YAML to enable audit logging:

    kind: Namespace
    apiVersion: v1
    metadata:
      name: <namespace>
      annotations:
        k8s.ovn.org/acl-logging: |-
          {
            "deny": "alert",
            "allow": "notice"
          }
    Copy to Clipboard Toggle word wrap

    Example output

    namespace/verify-audit-logging annotated
    Copy to Clipboard Toggle word wrap

Verification

  • Display the latest entries in the network policy audit log:

    $ for pod in $(oc get pods -n openshift-ovn-kubernetes -l app=ovnkube-node --no-headers=true | awk '{ print $1 }') ; do
        oc exec -it $pod -n openshift-ovn-kubernetes -- tail -4 /var/log/ovn/acl-audit-log.log
      done
    Copy to Clipboard Toggle word wrap

    Example output

    2021-06-13T19:33:11.590Z|00005|acl_log(ovn_pinctrl0)|INFO|name="verify-audit-logging_deny-all", verdict=drop, severity=alert: icmp,vlan_tci=0x0000,dl_src=0a:58:0a:80:02:39,dl_dst=0a:58:0a:80:02:37,nw_src=10.128.2.57,nw_dst=10.128.2.55,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can disable network policy audit logging for a namespace.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.

Procedure

  • To disable network policy audit logging for a namespace, enter the following command:

    $ oc annotate --overwrite namespace <namespace> k8s.ovn.org/acl-logging-
    Copy to Clipboard Toggle word wrap

    where:

    <namespace>
    Specifies the name of the namespace.
    Tip

    You can alternatively apply the following YAML to disable audit logging:

    kind: Namespace
    apiVersion: v1
    metadata:
      name: <namespace>
      annotations:
        k8s.ovn.org/acl-logging: null
    Copy to Clipboard Toggle word wrap

    Example output

    namespace/verify-audit-logging annotated
    Copy to Clipboard Toggle word wrap

17.3. Creating a network policy

As a user with the admin role, you can create a network policy for a namespace.

17.3.1. Example NetworkPolicy object

The following annotates an example NetworkPolicy object:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-27107 
1

spec:
  podSelector: 
2

    matchLabels:
      app: mongodb
  ingress:
  - from:
    - podSelector: 
3

        matchLabels:
          app: app
    ports: 
4

    - protocol: TCP
      port: 27017
Copy to Clipboard Toggle word wrap
1
The name of the NetworkPolicy object.
2
A selector that describes the pods to which the policy applies. The policy object can only select pods in the project that defines the NetworkPolicy object.
3
A selector that matches the pods from which the policy object allows ingress traffic. The selector matches pods in the same namespace as the NetworkPolicy.
4
A list of one or more destination ports on which to accept traffic.

17.3.2. Creating a network policy using the CLI

To define granular rules describing ingress or egress network traffic allowed for namespaces in your cluster, you can create a network policy.

Note

If you log in with a user with the cluster-admin role, then you can create a network policy in any namespace in the cluster.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with admin privileges.
  • You are working in the namespace that the network policy applies to.

Procedure

  1. Create a policy rule:

    1. Create a <policy_name>.yaml file:

      $ touch <policy_name>.yaml
      Copy to Clipboard Toggle word wrap

      where:

      <policy_name>
      Specifies the network policy file name.
    2. Define a network policy in the file that you just created, such as in the following examples:

      Deny ingress from all pods in all namespaces

      kind: NetworkPolicy
      apiVersion: networking.k8s.io/v1
      metadata:
        name: deny-by-default
      spec:
        podSelector:
        ingress: []
      Copy to Clipboard Toggle word wrap

      Allow ingress from all pods in the same namespace

      kind: NetworkPolicy
      apiVersion: networking.k8s.io/v1
      metadata:
        name: allow-same-namespace
      spec:
        podSelector:
        ingress:
        - from:
          - podSelector: {}
      Copy to Clipboard Toggle word wrap

  2. To create the network policy object, enter the following command:

    $ oc apply -f <policy_name>.yaml -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <policy_name>
    Specifies the network policy file name.
    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.

    Example output

    networkpolicy.networking.k8s.io/default-deny created
    Copy to Clipboard Toggle word wrap

Note

If you log in to the web console with cluster-admin privileges, you have a choice of creating a network policy in any namespace in the cluster directly in YAML or from a form in the web console.

17.4. Viewing a network policy

As a user with the admin role, you can view a network policy for a namespace.

17.4.1. Example NetworkPolicy object

The following annotates an example NetworkPolicy object:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-27107 
1

spec:
  podSelector: 
2

    matchLabels:
      app: mongodb
  ingress:
  - from:
    - podSelector: 
3

        matchLabels:
          app: app
    ports: 
4

    - protocol: TCP
      port: 27017
Copy to Clipboard Toggle word wrap
1
The name of the NetworkPolicy object.
2
A selector that describes the pods to which the policy applies. The policy object can only select pods in the project that defines the NetworkPolicy object.
3
A selector that matches the pods from which the policy object allows ingress traffic. The selector matches pods in the same namespace as the NetworkPolicy.
4
A list of one or more destination ports on which to accept traffic.

17.4.2. Viewing network policies using the CLI

You can examine the network policies in a namespace.

Note

If you log in with a user with the cluster-admin role, then you can view any network policy in the cluster.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with admin privileges.
  • You are working in the namespace where the network policy exists.

Procedure

  • List network policies in a namespace:

    • To view network policy objects defined in a namespace, enter the following command:

      $ oc get networkpolicy
      Copy to Clipboard Toggle word wrap
    • Optional: To examine a specific network policy, enter the following command:

      $ oc describe networkpolicy <policy_name> -n <namespace>
      Copy to Clipboard Toggle word wrap

      where:

      <policy_name>
      Specifies the name of the network policy to inspect.
      <namespace>
      Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.

      For example:

      $ oc describe networkpolicy allow-same-namespace
      Copy to Clipboard Toggle word wrap

      Output for oc describe command

      Name:         allow-same-namespace
      Namespace:    ns1
      Created on:   2021-05-24 22:28:56 -0400 EDT
      Labels:       <none>
      Annotations:  <none>
      Spec:
        PodSelector:     <none> (Allowing the specific traffic to all pods in this namespace)
        Allowing ingress traffic:
          To Port: <any> (traffic allowed to all ports)
          From:
            PodSelector: <none>
        Not affecting egress traffic
        Policy Types: Ingress
      Copy to Clipboard Toggle word wrap

Note

If you log in to the web console with cluster-admin privileges, you have a choice of viewing a network policy in any namespace in the cluster directly in YAML or from a form in the web console.

17.5. Editing a network policy

As a user with the admin role, you can edit an existing network policy for a namespace.

17.5.1. Editing a network policy

You can edit a network policy in a namespace.

Note

If you log in with a user with the cluster-admin role, then you can edit a network policy in any namespace in the cluster.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with admin privileges.
  • You are working in the namespace where the network policy exists.

Procedure

  1. Optional: To list the network policy objects in a namespace, enter the following command:

    $ oc get networkpolicy
    Copy to Clipboard Toggle word wrap

    where:

    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
  2. Edit the network policy object.

    • If you saved the network policy definition in a file, edit the file and make any necessary changes, and then enter the following command.

      $ oc apply -n <namespace> -f <policy_file>.yaml
      Copy to Clipboard Toggle word wrap

      where:

      <namespace>
      Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
      <policy_file>
      Specifies the name of the file containing the network policy.
    • If you need to update the network policy object directly, enter the following command:

      $ oc edit networkpolicy <policy_name> -n <namespace>
      Copy to Clipboard Toggle word wrap

      where:

      <policy_name>
      Specifies the name of the network policy.
      <namespace>
      Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
  3. Confirm that the network policy object is updated.

    $ oc describe networkpolicy <policy_name> -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <policy_name>
    Specifies the name of the network policy.
    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
Note

If you log in to the web console with cluster-admin privileges, you have a choice of editing a network policy in any namespace in the cluster directly in YAML or from the policy in the web console through the Actions menu.

17.5.2. Example NetworkPolicy object

The following annotates an example NetworkPolicy object:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-27107 
1

spec:
  podSelector: 
2

    matchLabels:
      app: mongodb
  ingress:
  - from:
    - podSelector: 
3

        matchLabels:
          app: app
    ports: 
4

    - protocol: TCP
      port: 27017
Copy to Clipboard Toggle word wrap
1
The name of the NetworkPolicy object.
2
A selector that describes the pods to which the policy applies. The policy object can only select pods in the project that defines the NetworkPolicy object.
3
A selector that matches the pods from which the policy object allows ingress traffic. The selector matches pods in the same namespace as the NetworkPolicy.
4
A list of one or more destination ports on which to accept traffic.

17.6. Deleting a network policy

As a user with the admin role, you can delete a network policy from a namespace.

17.6.1. Deleting a network policy using the CLI

You can delete a network policy in a namespace.

Note

If you log in with a user with the cluster-admin role, then you can delete any network policy in the cluster.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with admin privileges.
  • You are working in the namespace where the network policy exists.

Procedure

  • To delete a network policy object, enter the following command:

    $ oc delete networkpolicy <policy_name> -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <policy_name>
    Specifies the name of the network policy.
    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.

    Example output

    networkpolicy.networking.k8s.io/default-deny deleted
    Copy to Clipboard Toggle word wrap

Note

If you log in to the web console with cluster-admin privileges, you have a choice of deleting a network policy in any namespace in the cluster directly in YAML or from the policy in the web console through the Actions menu.

As a cluster administrator, you can modify the new project template to automatically include network policies when you create a new project. If you do not yet have a customized template for new projects, you must first create one.

17.7.1. Modifying the template for new projects

As a cluster administrator, you can modify the default project template so that new projects are created using your custom requirements.

To create your own custom project template:

Procedure

  1. Log in as a user with cluster-admin privileges.
  2. Generate the default project template:

    $ oc adm create-bootstrap-project-template -o yaml > template.yaml
    Copy to Clipboard Toggle word wrap
  3. Use a text editor to modify the generated template.yaml file by adding objects or modifying existing objects.
  4. The project template must be created in the openshift-config namespace. Load your modified template:

    $ oc create -f template.yaml -n openshift-config
    Copy to Clipboard Toggle word wrap
  5. Edit the project configuration resource using the web console or CLI.

    • Using the web console:

      1. Navigate to the AdministrationCluster Settings page.
      2. Click Configuration to view all configuration resources.
      3. Find the entry for Project and click Edit YAML.
    • Using the CLI:

      1. Edit the project.config.openshift.io/cluster resource:

        $ oc edit project.config.openshift.io/cluster
        Copy to Clipboard Toggle word wrap
  6. Update the spec section to include the projectRequestTemplate and name parameters, and set the name of your uploaded project template. The default name is project-request.

    Project configuration resource with custom project template

    apiVersion: config.openshift.io/v1
    kind: Project
    metadata:
      ...
    spec:
      projectRequestTemplate:
        name: <template_name>
    Copy to Clipboard Toggle word wrap

  7. After you save your changes, create a new project to verify that your changes were successfully applied.

As a cluster administrator, you can add network policies to the default template for new projects. OpenShift Container Platform will automatically create all the NetworkPolicy objects specified in the template in the project.

Prerequisites

  • Your cluster uses a default CNI network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You must log in to the cluster with a user with cluster-admin privileges.
  • You must have created a custom default project template for new projects.

Procedure

  1. Edit the default template for a new project by running the following command:

    $ oc edit template <project_template> -n openshift-config
    Copy to Clipboard Toggle word wrap

    Replace <project_template> with the name of the default template that you configured for your cluster. The default template name is project-request.

  2. In the template, add each NetworkPolicy object as an element to the objects parameter. The objects parameter accepts a collection of one or more objects.

    In the following example, the objects parameter collection includes several NetworkPolicy objects.

    objects:
    - apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-same-namespace
      spec:
        podSelector: {}
        ingress:
        - from:
          - podSelector: {}
    - apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-openshift-ingress
      spec:
        ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                network.openshift.io/policy-group: ingress
        podSelector: {}
        policyTypes:
        - Ingress
    - apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-kube-apiserver-operator
      spec:
        ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                kubernetes.io/metadata.name: openshift-kube-apiserver-operator
            podSelector:
              matchLabels:
                app: kube-apiserver-operator
        policyTypes:
        - Ingress
    ...
    Copy to Clipboard Toggle word wrap
  3. Optional: Create a new project to confirm that your network policy objects are created successfully by running the following commands:

    1. Create a new project:

      $ oc new-project <project> 
      1
      Copy to Clipboard Toggle word wrap
      1
      Replace <project> with the name for the project you are creating.
    2. Confirm that the network policy objects in the new project template exist in the new project:

      $ oc get networkpolicy
      NAME                           POD-SELECTOR   AGE
      allow-from-openshift-ingress   <none>         7s
      allow-from-same-namespace      <none>         7s
      Copy to Clipboard Toggle word wrap

As a cluster administrator, you can configure your network policies to provide multitenant network isolation.

Note

If you are using the OpenShift SDN cluster network provider, configuring network policies as described in this section provides network isolation similar to multitenant mode but with network policy mode set.

You can configure your project to isolate it from pods and services in other project namespaces.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with admin privileges.

Procedure

  1. Create the following NetworkPolicy objects:

    1. A policy named allow-from-openshift-ingress.

      $ cat << EOF| oc create -f -
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-openshift-ingress
      spec:
        ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                policy-group.network.openshift.io/ingress: ""
        podSelector: {}
        policyTypes:
        - Ingress
      EOF
      Copy to Clipboard Toggle word wrap
      Note

      policy-group.network.openshift.io/ingress: "" is the preferred namespace selector label for OpenShift SDN. You can use the network.openshift.io/policy-group: ingress namespace selector label, but this is a legacy label.

    2. A policy named allow-from-openshift-monitoring:

      $ cat << EOF| oc create -f -
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-openshift-monitoring
      spec:
        ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                network.openshift.io/policy-group: monitoring
        podSelector: {}
        policyTypes:
        - Ingress
      EOF
      Copy to Clipboard Toggle word wrap
    3. A policy named allow-same-namespace:

      $ cat << EOF| oc create -f -
      kind: NetworkPolicy
      apiVersion: networking.k8s.io/v1
      metadata:
        name: allow-same-namespace
      spec:
        podSelector:
        ingress:
        - from:
          - podSelector: {}
      EOF
      Copy to Clipboard Toggle word wrap
    4. A policy named allow-from-kube-apiserver-operator:

      $ cat << EOF| oc create -f -
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-kube-apiserver-operator
      spec:
        ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                kubernetes.io/metadata.name: openshift-kube-apiserver-operator
            podSelector:
              matchLabels:
                app: kube-apiserver-operator
        policyTypes:
        - Ingress
      EOF
      Copy to Clipboard Toggle word wrap

      For more details, see New kube-apiserver-operator webhook controller validating health of webhook.

  2. Optional: To confirm that the network policies exist in your current project, enter the following command:

    $ oc describe networkpolicy
    Copy to Clipboard Toggle word wrap

    Example output

    Name:         allow-from-openshift-ingress
    Namespace:    example1
    Created on:   2020-06-09 00:28:17 -0400 EDT
    Labels:       <none>
    Annotations:  <none>
    Spec:
      PodSelector:     <none> (Allowing the specific traffic to all pods in this namespace)
      Allowing ingress traffic:
        To Port: <any> (traffic allowed to all ports)
        From:
          NamespaceSelector: network.openshift.io/policy-group: ingress
      Not affecting egress traffic
      Policy Types: Ingress
    
    
    Name:         allow-from-openshift-monitoring
    Namespace:    example1
    Created on:   2020-06-09 00:29:57 -0400 EDT
    Labels:       <none>
    Annotations:  <none>
    Spec:
      PodSelector:     <none> (Allowing the specific traffic to all pods in this namespace)
      Allowing ingress traffic:
        To Port: <any> (traffic allowed to all ports)
        From:
          NamespaceSelector: network.openshift.io/policy-group: monitoring
      Not affecting egress traffic
      Policy Types: Ingress
    Copy to Clipboard Toggle word wrap

17.8.2. Next steps

Chapter 18. CIDR range definitions

You must specify non-overlapping ranges for the following CIDR ranges.

Note

Machine CIDR ranges cannot be changed after creating your cluster.

Important

OVN-Kubernetes, the default network provider in OpenShift Container Platform 4.11 and later, uses the 100.64.0.0/16 IP address range internally. If your cluster uses OVN-Kubernetes, do not include the 100.64.0.0/16 IP address range in any other CIDR definitions in your cluster.

18.1. Machine CIDR

In the Machine CIDR field, you must specify the IP address range for machines or cluster nodes.

The default is 10.0.0.0/16. This range must not conflict with any connected networks.

18.2. Service CIDR

In the Service CIDR field, you must specify the IP address range for services. The range must be large enough to accommodate your workload. The address block must not overlap with any external service accessed from within the cluster. The default is 172.30.0.0/16.

18.3. Pod CIDR

In the pod CIDR field, you must specify the IP address range for pods.

The pod CIDR is the same as the clusterNetwork CIDR and the cluster CIDR. The range must be large enough to accommodate your workload. The address block must not overlap with any external service accessed from within the cluster. The default is 10.128.0.0/14. You can expand the range after cluster installation.

18.4. Host Prefix

In the Host Prefix field, you must specify the subnet prefix length assigned to pods scheduled to individual machines. The host prefix determines the pod IP address pool for each machine.

For example, if the host prefix is set to /23, each machine is assigned a /23 subnet from the pod CIDR address range. The default is /23, allowing 510 cluster nodes, and 510 pod IP addresses per node.

Chapter 19. AWS Load Balancer Operator

The AWS Load Balancer (ALB) Operator deploys and manages an instance of the aws-load-balancer-controller. You can install the ALB Operator from the OperatorHub by using OpenShift Container Platform web console or CLI.

19.1.1. AWS Load Balancer Operator considerations

Review the following limitations before installing and using the AWS Load Balancer Operator.

  • The IP traffic mode only works on AWS Elastic Kubernetes Service (EKS). The AWS Load Balancer Operator disables the IP traffic mode for the AWS Load Balancer Controller. As a result of disabling the IP traffic mode, the AWS Load Balancer Controller cannot use the pod readiness gate.
  • The AWS Load Balancer Operator adds command-line flags such as --disable-ingress-class-annotation and --disable-ingress-group-name-annotation to the AWS Load Balancer Controller. Therefore, the AWS Load Balancer Operator does not allow using the kubernetes.io/ingress.class and alb.ingress.kubernetes.io/group.name annotations in the Ingress resource.

19.1.2. AWS Load Balancer Operator

The AWS Load Balancer Operator can tag the public subnets if the kubernetes.io/role/elb tag is missing. Also, the AWS Load Balancer Operator detects the following from the underlying AWS cloud:

  • The ID of the virtual private cloud (VPC) on which the cluster hosting the Operator is deployed in.
  • Public and private subnets of the discovered VPC.

Prerequisites

  • You must have the AWS credentials secret. The credentials are used to provide subnet tagging and VPC discovery.

Procedure

  1. You can deploy the AWS Load Balancer Operator on demand from the OperatorHub, by creating a Subscription object:

    $ oc -n aws-load-balancer-operator get sub aws-load-balancer-operator --template='{{.status.installplan.name}}{{"\n"}}'
    Copy to Clipboard Toggle word wrap

    Example output

    install-zlfbt
    Copy to Clipboard Toggle word wrap

  2. Check the status of an install plan. The status of an install plan must be Complete:

    $ oc -n aws-load-balancer-operator get ip <install_plan_name> --template='{{.status.phase}}{{"\n"}}'
    Copy to Clipboard Toggle word wrap

    Example output

    Complete
    Copy to Clipboard Toggle word wrap

  3. Use the oc get command to view the Deployment status:

    $ oc get -n aws-load-balancer-operator deployment/aws-load-balancer-operator-controller-manager
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                           READY     UP-TO-DATE   AVAILABLE   AGE
    aws-load-balancer-operator-controller-manager  1/1       1            1           23h
    Copy to Clipboard Toggle word wrap

19.1.3. AWS Load Balancer Operator logs

Use the oc logs command to view the AWS Load Balancer Operator logs.

Procedure

  • View the logs of the AWS Load Balancer Operator:

    $ oc logs -n aws-load-balancer-operator deployment/aws-load-balancer-operator-controller-manager -c manager
    Copy to Clipboard Toggle word wrap

19.2. Understanding AWS Load Balancer Operator

The AWS Load Balancer (ALB) Operator deploys and manages an instance of the aws-load-balancer-controller resource. You can install the AWS Load Balancer Operator from the OperatorHub by using OpenShift Container Platform web console or CLI.

19.2.1. Installing the AWS Load Balancer Operator

You can install the AWS Load Balancer Operator from the OperatorHub by using the OpenShift Container Platform web console.

Prerequisites

  • You have logged in to the OpenShift Container Platform web console as a user with cluster-admin permissions.
  • Your cluster is configured with AWS as the platform type and cloud provider.

Procedure

  1. Navigate to OperatorsOperatorHub in the OpenShift Container Platform web console.
  2. Select the AWS Load Balancer Operator. You can use the Filter by keyword text box or use the filter list to search for the AWS Load Balancer Operator from the list of Operators.
  3. Select the aws-load-balancer-operator namespace.
  4. Follow the instructions to prepare the Operator for installation.
  5. On the AWS Load Balancer Operator page, click Install.
  6. On the Install Operator page, select the following options:

    1. Update the channel as stable-v0.1.
    2. Installation mode as A specific namespace on the cluster.
    3. Installed Namespace as aws-load-balancer-operator. If the aws-load-balancer-operator namespace does not exist, it gets created during the Operator installation.
    4. Select Update approval as Automatic or Manual. By default, the Update approval is set to Automatic. If you select automatic updates, the Operator Lifecycle Manager (OLM) automatically upgrades the running instance of your Operator without any intervention. If you select manual updates, the OLM creates an update request. As a cluster administrator, you must then manually approve that update request to update the Operator updated to the new version.
    5. Click Install.

Verification

  • Verify that the AWS Load Balancer Operator shows the Status as Succeeded on the Installed Operators dashboard.

After installing the Operator, you can create an instance of the AWS Load Balancer Controller.

You can install only a single instance of the aws-load-balancer-controller in a cluster. You can create the AWS Load Balancer Controller by using CLI. The AWS Load Balancer(ALB) Operator reconciles only the resource with the name cluster.

Prerequisites

  • You have created the echoserver namespace.
  • You have access to the OpenShift CLI (oc).

Procedure

  1. Create an aws-load-balancer-controller resource YAML file, for example, sample-aws-lb.yaml, as follows:

    apiVersion: networking.olm.openshift.io/v1alpha1
    kind: AWSLoadBalancerController 
    1
    
    metadata:
      name: cluster 
    2
    
    spec:
      subnetTagging: Auto 
    3
    
      additionalResourceTags: 
    4
    
        example.org/cost-center: 5113232
        example.org/security-scope: staging
      ingressClass: alb 
    5
    
      config:
        replicas: 2 
    6
    
      enabledAddons: 
    7
    
        - AWSWAFv2 
    8
    Copy to Clipboard Toggle word wrap
    1
    Defines the aws-load-balancer-controller resource.
    2
    Defines the AWS Load Balancer Controller instance name. This instance name gets added as a suffix to all related resources.
    3
    Valid options are Auto and Manual. When the value is set to Auto, the Operator attempts to determine the subnets that belong to the cluster and tags them appropriately. The Operator cannot determine the role correctly if the internal subnet tags are not present on internal subnet. If you installed your cluster on user-provided infrastructure, you can manually tag the subnets with the appropriate role tags and set the subnet tagging policy to Manual.
    4
    Defines the tags used by the controller when it provisions AWS resources.
    5
    The default value for this field is alb. The Operator provisions an IngressClass resource with the same name if it does not exist.
    6
    Specifies the number of replicas of the controller.
    7
    Specifies add-ons for AWS load balancers, which get specified through annotations.
    8
    Enables the alb.ingress.kubernetes.io/wafv2-acl-arn annotation.
  2. Create a aws-load-balancer-controller resource by running the following command:

    $ oc create -f sample-aws-lb.yaml
    Copy to Clipboard Toggle word wrap
  3. After the AWS Load Balancer Controller is running, create a deployment resource:

    apiVersion: apps/v1
    kind: Deployment 
    1
    
    metadata:
      name: <echoserver> 
    2
    
      namespace: echoserver
    spec:
      selector:
        matchLabels:
          app: echoserver
      replicas: 3 
    3
    
      template:
        metadata:
          labels:
            app: echoserver
        spec:
          containers:
            - image: openshift/origin-node
              args:
                - TCP4-LISTEN:8080,reuseaddr,fork
                - EXEC:'/bin/bash -c \"printf \\\"HTTP/1.0 200 OK\r\n\r\n\\\"; sed -e \\\"/^\r/q\\\"\"'
              imagePullPolicy: Always
              name: echoserver
              ports:
                - containerPort: 8080
    Copy to Clipboard Toggle word wrap
    1
    Defines the deployment resource.
    2
    Specifies the deployment name.
    3
    Specifies the number of replicas of the deployment.
  4. Create a service resource:

    apiVersion: v1
    kind: Service 
    1
    
    metadata:
      name: <echoserver> 
    2
    
      namespace: echoserver
    spec:
      ports:
        - port: 80
          targetPort: 8080
          protocol: TCP
      type: NodePort
      selector:
        app: echoserver
    Copy to Clipboard Toggle word wrap
    1
    Defines the service resource.
    2
    Specifies the name of the service.
  5. Deploy an ALB-backed Ingress resource:

    apiVersion: networking.k8s.io/v1
    kind: Ingress 
    1
    
    metadata:
      name: <echoserver> 
    2
    
      namespace: echoserver
      annotations:
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/target-type: instance
    spec:
      ingressClassName: alb
      rules:
        - http:
            paths:
              - path: /
                pathType: Exact
                backend:
                  service:
                    name: <echoserver> 
    3
    
                    port:
                      number: 80
    Copy to Clipboard Toggle word wrap
    1
    Defines the ingress resource.
    2
    Specifies the name of the ingress resource.
    3
    Specifies the name of the service resource.

Verification

  • Verify the status of the Ingress resource to show the host of the provisioned AWS Load Balancer (ALB) by running the following command:

    $ HOST=$(kubectl get ingress -n echoserver echoserver -o json | jq -r '.status.loadBalancer.ingress[0].hostname')
    Copy to Clipboard Toggle word wrap
  • Verify the status of the provisioned AWS Load Balancer (ALB) host by running the following command:

    $ curl $HOST
    Copy to Clipboard Toggle word wrap

19.4. Creating multiple ingresses

You can route the traffic to different services that are part of a single domain through a single AWS Load Balancer (ALB). Each Ingress resource provides different endpoints of the domain.

You can route the traffic to multiple Ingresses through a single AWS Load Balancer (ALB) by using the CLI.

Prerequisites

  • You have an access to the OpenShift CLI (oc).

Procedure

  1. Create an IngressClassParams resource YAML file, for example, sample-single-lb-params.yaml, as follows:

    apiVersion: elbv2.k8s.aws/v1beta1 
    1
    
    kind: IngressClassParams
    metadata:
      name: single-lb-params 
    2
    
    spec:
      group:
        name: single-lb 
    3
    Copy to Clipboard Toggle word wrap
    1
    Defines the API group and version of the IngressClassParams resource.
    2
    Specifies the name of the IngressClassParams resource.
    3
    Specifies the name of the IngressGroup. All Ingresses of this class belong to this IngressGroup.
  2. Create an IngressClassParams resource by running the following command:

    $ oc create -f sample-single-lb-params.yaml
    Copy to Clipboard Toggle word wrap
  3. Create an IngressClass resource YAML file, for example, sample-single-lb-class.yaml, as follows:

    apiVersion: networking.k8s.io/v1 
    1
    
    kind: IngressClass
    metadata:
      name: single-lb 
    2
    
    spec:
      controller: ingress.k8s.aws/alb 
    3
    
      parameters:
        apiGroup: elbv2.k8s.aws 
    4
    
        kind: IngressClassParams 
    5
    
        name: single-lb-params 
    6
    Copy to Clipboard Toggle word wrap
    1
    Defines the API group and version of the IngressClass resource.
    2
    Specifies the name of the IngressClass.
    3
    Defines the controller name. ingress.k8s.aws/alb denotes that all Ingresses of this class should be managed by the aws-load-balancer-controller.
    4
    Defines the API group of the IngressClassParams resource.
    5
    Defines the resource type of the IngressClassParams resource.
    6
    Defines the name of the IngressClassParams resource.
  4. Create an IngressClass resource by running the following command:

    $ oc create -f sample-single-lb-class.yaml
    Copy to Clipboard Toggle word wrap
  5. Create an AWSLoadBalancerController resource YAML file, for example, sample-single-lb.yaml, as follows:

    apiVersion: networking.olm.openshift.io/v1
    kind: AWSLoadBalancerController
    metadata:
      name: cluster
    spec:
      subnetTagging: Auto
      ingressClass: single-lb 
    1
    Copy to Clipboard Toggle word wrap
    1
    Defines the name of the IngressClass resource.
  6. Create an AWSLoadBalancerController resource by running the following command:

    $ oc create -f sample-single-lb.yaml
    Copy to Clipboard Toggle word wrap
  7. Create an Ingress resource YAML file, for example, sample-multiple-ingress.yaml, as follows:

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: example-1 
    1
    
      annotations:
        alb.ingress.kubernetes.io/scheme: internet-facing 
    2
    
        alb.ingress.kubernetes.io/group.order: "1" 
    3
    
        alb.ingress.kubernetes.io/target-type: instance 
    4
    
    spec:
      ingressClassName: single-lb 
    5
    
      rules:
      - host: example.com 
    6
    
        http:
            paths:
            - path: /blog 
    7
    
              pathType: Prefix
              backend:
                service:
                  name: example-1 
    8
    
                  port:
                    number: 80 
    9
    
    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: example-2
      annotations:
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/group.order: "2"
        alb.ingress.kubernetes.io/target-type: instance
    spec:
      ingressClassName: single-lb
      rules:
      - host: example.com
        http:
            paths:
            - path: /store
              pathType: Prefix
              backend:
                service:
                  name: example-2
                  port:
                    number: 80
    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: example-3
      annotations:
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/group.order: "3"
        alb.ingress.kubernetes.io/target-type: instance
    spec:
      ingressClassName: single-lb
      rules:
      - host: example.com
        http:
            paths:
            - path: /
              pathType: Prefix
              backend:
                service:
                  name: example-3
                  port:
                    number: 80
    Copy to Clipboard Toggle word wrap
    1
    Specifies the name of an ingress.
    2
    Indicates the load balancer to provision in the public subnet and makes it accessible over the internet.
    3
    Specifies the order in which the rules from the Ingresses are matched when the request is received at the load balancer.
    4
    Indicates the load balancer will target OpenShift nodes to reach the service.
    5
    Specifies the Ingress Class that belongs to this ingress.
    6
    Defines the name of a domain used for request routing.
    7
    Defines the path that must route to the service.
    8
    Defines the name of the service that serves the endpoint configured in the ingress.
    9
    Defines the port on the service that serves the endpoint.
  8. Create the Ingress resources by running the following command:

    $ oc create -f sample-multiple-ingress.yaml
    Copy to Clipboard Toggle word wrap

19.5. Adding TLS termination

You can add TLS termination on the AWS Load Balancer.

You can route the traffic for the domain to pods of a service and add TLS termination on the AWS Load Balancer.

Prerequisites

  • You have an access to the OpenShift CLI (oc).

Procedure

  1. Install the Operator and create an instance of the aws-load-balancer-controller resource:

    apiVersion: networking.k8s.io/v1
    kind: AWSLoadBalancerController
    group: networking.olm.openshift.io/v1alpha1 
    1
    
    metadata:
      name: cluster
    spec:
      subnetTagging: Auto
      ingressClass: tls-termination 
    2
    Copy to Clipboard Toggle word wrap
    1 2
    Defines the name of an ingressClass resource reconciled by the AWS Load Balancer Controller. This ingressClass resource gets created if it is not present. You can add additional ingressClass values. The controller reconciles the ingressClass values if the spec.controller is set to ingress.k8s.aws/alb.
  2. Create an Ingress resource:

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: <example> 
    1
    
      annotations:
        alb.ingress.kubernetes.io/scheme: internet-facing 
    2
    
        alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxx 
    3
    
    spec:
      ingressClassName: tls-termination 
    4
    
      rules:
      - host: <example.com> 
    5
    
        http:
            paths:
              - path: /
                pathType: Exact
                backend:
                  service:
                    name: <example-service> 
    6
    
                    port:
                      number: 80
    Copy to Clipboard Toggle word wrap
    1
    Specifies the name of an ingress.
    2
    The controller provisions the load balancer for this Ingress resource in a public subnet so that the load balancer is reachable over the internet.
    3
    The Amazon Resource Name of the certificate that you attach to the load balancer.
    4
    Defines the ingress class name.
    5
    Defines the domain for traffic routing.
    6
    Defines the service for traffic routing.

Chapter 20. Multiple networks

20.1. Understanding multiple networks

In Kubernetes, container networking is delegated to networking plugins that implement the Container Network Interface (CNI).

OpenShift Container Platform uses the Multus CNI plugin to allow chaining of CNI plugins. During cluster installation, you configure your default pod network. The default network handles all ordinary network traffic for the cluster. You can define an additional network based on the available CNI plugins and attach one or more of these networks to your pods. You can define more than one additional network for your cluster, depending on your needs. This gives you flexibility when you configure pods that deliver network functionality, such as switching or routing.

20.1.1. Usage scenarios for an additional network

You can use an additional network in situations where network isolation is needed, including data plane and control plane separation. Isolating network traffic is useful for the following performance and security reasons:

Performance
You can send traffic on two different planes to manage how much traffic is along each plane.
Security
You can send sensitive traffic onto a network plane that is managed specifically for security considerations, and you can separate private data that must not be shared between tenants or customers.

All of the pods in the cluster still use the cluster-wide default network to maintain connectivity across the cluster. Every pod has an eth0 interface that is attached to the cluster-wide pod network. You can view the interfaces for a pod by using the oc exec -it <pod_name> -- ip a command. If you add additional network interfaces that use Multus CNI, they are named net1, net2, …​, netN.

To attach additional network interfaces to a pod, you must create configurations that define how the interfaces are attached. You specify each interface by using a NetworkAttachmentDefinition custom resource (CR). A CNI configuration inside each of these CRs defines how that interface is created.

OpenShift Container Platform provides the following CNI plugins for creating additional networks in your cluster:

20.2. Configuring an additional network

As a cluster administrator, you can configure an additional network for your cluster. The following network types are supported:

You can manage the life cycle of an additional network by two approaches. Each approach is mutually exclusive and you can only use one approach for managing an additional network at a time. For either approach, the additional network is managed by a Container Network Interface (CNI) plugin that you configure.

For an additional network, IP addresses are provisioned through an IP Address Management (IPAM) CNI plugin that you configure as part of the additional network. The IPAM plugin supports a variety of IP address assignment approaches including DHCP and static assignment.

  • Modify the Cluster Network Operator (CNO) configuration: The CNO automatically creates and manages the NetworkAttachmentDefinition object. In addition to managing the object lifecycle the CNO ensures a DHCP is available for an additional network that uses a DHCP assigned IP address.
  • Applying a YAML manifest: You can manage the additional network directly by creating an NetworkAttachmentDefinition object. This approach allows for the chaining of CNI plugins.

An additional network is configured via the NetworkAttachmentDefinition API in the k8s.cni.cncf.io API group.

Important

Do not store any sensitive information or a secret in the NetworkAttachmentDefinition object because this information is accessible by the project administration user.

The configuration for the API is described in the following table:

Expand
Table 20.1. NetworkAttachmentDefinition API fields
FieldTypeDescription

metadata.name

string

The name for the additional network.

metadata.namespace

string

The namespace that the object is associated with.

spec.config

string

The CNI plugin configuration in JSON format.

The configuration for an additional network attachment is specified as part of the Cluster Network Operator (CNO) configuration.

The following YAML describes the configuration parameters for managing an additional network with the CNO:

Cluster Network Operator configuration

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  # ...
  additionalNetworks: 
1

  - name: <name> 
2

    namespace: <namespace> 
3

    rawCNIConfig: |- 
4

      {
        ...
      }
    type: Raw
Copy to Clipboard Toggle word wrap

1
An array of one or more additional network configurations.
2
The name for the additional network attachment that you are creating. The name must be unique within the specified namespace.
3
The namespace to create the network attachment in. If you do not specify a value, then the default namespace is used.
4
A CNI plugin configuration in JSON format.

The configuration for an additional network is specified from a YAML configuration file, such as in the following example:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: <name> 
1

spec:
  config: |- 
2

    {
      ...
    }
Copy to Clipboard Toggle word wrap
1
The name for the additional network attachment that you are creating.
2
A CNI plugin configuration in JSON format.

The specific configuration fields for additional networks is described in the following sections.

The following object describes the configuration parameters for the bridge CNI plugin:

Expand
Table 20.2. Bridge CNI plugin JSON configuration object
FieldTypeDescription

cniVersion

string

The CNI specification version. The 0.3.1 value is required.

name

string

The value for the name parameter you provided previously for the CNO configuration.

type

string

The name of the CNI plugin to configure: bridge.

ipam

object

The configuration object for the IPAM CNI plugin. The plugin manages IP address assignment for the attachment definition.

bridge

string

Optional: Specify the name of the virtual bridge to use. If the bridge interface does not exist on the host, it is created. The default value is cni0.

ipMasq

boolean

Optional: Set to true to enable IP masquerading for traffic that leaves the virtual network. The source IP address for all traffic is rewritten to the bridge’s IP address. If the bridge does not have an IP address, this setting has no effect. The default value is false.

isGateway

boolean

Optional: Set to true to assign an IP address to the bridge. The default value is false.

isDefaultGateway

boolean

Optional: Set to true to configure the bridge as the default gateway for the virtual network. The default value is false. If isDefaultGateway is set to true, then isGateway is also set to true automatically.

forceAddress

boolean

Optional: Set to true to allow assignment of a previously assigned IP address to the virtual bridge. When set to false, if an IPv4 address or an IPv6 address from overlapping subsets is assigned to the virtual bridge, an error occurs. The default value is false.

hairpinMode

boolean

Optional: Set to true to allow the virtual bridge to send an Ethernet frame back through the virtual port it was received on. This mode is also known as reflective relay. The default value is false.

promiscMode

boolean

Optional: Set to true to enable promiscuous mode on the bridge. The default value is false.

vlan

string

Optional: Specify a virtual LAN (VLAN) tag as an integer value. By default, no VLAN tag is assigned.

preserveDefaultVlan

string

Optional: Indicates whether the default vlan must be preserved on the veth end connected to the bridge. Defaults to true.

vlanTrunk

list

Optional: Assign a VLAN trunk tag. The default value is none.

mtu

string

Optional: Set the maximum transmission unit (MTU) to the specified value. The default value is automatically set by the kernel.

enabledad

boolean

Optional: Enables duplicate address detection for the container side veth. The default value is false.

macspoofchk

boolean

Optional: Enables mac spoof check, limiting the traffic originating from the container to the mac address of the interface. The default value is false.

Note

The VLAN parameter configures the VLAN tag on the host end of the veth and also enables the vlan_filtering feature on the bridge interface.

Note

To configure uplink for a L2 network you need to allow the vlan on the uplink interface by using the following command:

$  bridge vlan add vid VLAN_ID dev DEV
Copy to Clipboard Toggle word wrap
20.2.3.1.1. bridge configuration example

The following example configures an additional network named bridge-net:

{
  "cniVersion": "0.3.1",
  "name": "bridge-net",
  "type": "bridge",
  "isGateway": true,
  "vlan": 2,
  "ipam": {
    "type": "dhcp"
    }
}
Copy to Clipboard Toggle word wrap
Note

Specify your network device by setting only one of the following parameters: device,hwaddr, kernelpath, or pciBusID.

The following object describes the configuration parameters for the host-device CNI plugin:

Expand
Table 20.3. Host device CNI plugin JSON configuration object
FieldTypeDescription

cniVersion

string

The CNI specification version. The 0.3.1 value is required.

name

string

The value for the name parameter you provided previously for the CNO configuration.

type

string

The name of the CNI plugin to configure: host-device.

device

string

Optional: The name of the device, such as eth0.

hwaddr

string

Optional: The device hardware MAC address.

kernelpath

string

Optional: The Linux kernel device path, such as /sys/devices/pci0000:00/0000:00:1f.6.

pciBusID

string

Optional: The PCI address of the network device, such as 0000:00:1f.6.

20.2.3.2.1. host-device configuration example

The following example configures an additional network named hostdev-net:

{
  "cniVersion": "0.3.1",
  "name": "hostdev-net",
  "type": "host-device",
  "device": "eth1"
}
Copy to Clipboard Toggle word wrap

The following object describes the configuration parameters for the IPVLAN CNI plugin:

Expand
Table 20.4. IPVLAN CNI plugin JSON configuration object
FieldTypeDescription

cniVersion

string

The CNI specification version. The 0.3.1 value is required.

name

string

The value for the name parameter you provided previously for the CNO configuration.

type

string

The name of the CNI plugin to configure: ipvlan.

ipam

object

The configuration object for the IPAM CNI plugin. The plugin manages IP address assignment for the attachment definition. This is required unless the plugin is chained.

mode

string

Optional: The operating mode for the virtual network. The value must be l2, l3, or l3s. The default value is l2.

master

string

Optional: The Ethernet interface to associate with the network attachment. If a master is not specified, the interface for the default network route is used.

mtu

integer

Optional: Set the maximum transmission unit (MTU) to the specified value. The default value is automatically set by the kernel.

Note
  • The ipvlan object does not allow virtual interfaces to communicate with the master interface. Therefore the container will not be able to reach the host by using the ipvlan interface. Be sure that the container joins a network that provides connectivity to the host, such as a network supporting the Precision Time Protocol (PTP).
  • A single master interface cannot simultaneously be configured to use both macvlan and ipvlan.
  • For IP allocation schemes that cannot be interface agnostic, the ipvlan plugin can be chained with an earlier plugin that handles this logic. If the master is omitted, then the previous result must contain a single interface name for the ipvlan plugin to enslave. If ipam is omitted, then the previous result is used to configure the ipvlan interface.
20.2.3.3.1. ipvlan configuration example

The following example configures an additional network named ipvlan-net:

{
  "cniVersion": "0.3.1",
  "name": "ipvlan-net",
  "type": "ipvlan",
  "master": "eth1",
  "mode": "l3",
  "ipam": {
    "type": "static",
    "addresses": [
       {
         "address": "192.168.10.10/24"
       }
    ]
  }
}
Copy to Clipboard Toggle word wrap

The following object describes the configuration parameters for the macvlan CNI plugin:

Expand
Table 20.5. MACVLAN CNI plugin JSON configuration object
FieldTypeDescription

cniVersion

string

The CNI specification version. The 0.3.1 value is required.

name

string

The value for the name parameter you provided previously for the CNO configuration.

type

string

The name of the CNI plugin to configure: macvlan.

ipam

object

The configuration object for the IPAM CNI plugin. The plugin manages IP address assignment for the attachment definition.

mode

string

Optional: Configures traffic visibility on the virtual network. Must be either bridge, passthru, private, or vepa. If a value is not provided, the default value is bridge.

master

string

Optional: The host network interface to associate with the newly created macvlan interface. If a value is not specified, then the default route interface is used.

mtu

string

Optional: The maximum transmission unit (MTU) to the specified value. The default value is automatically set by the kernel.

Note

If you specify the master key for the plugin configuration, use a different physical network interface than the one that is associated with your primary network plugin to avoid possible conflicts.

20.2.3.4.1. macvlan configuration example

The following example configures an additional network named macvlan-net:

{
  "cniVersion": "0.3.1",
  "name": "macvlan-net",
  "type": "macvlan",
  "master": "eth1",
  "mode": "bridge",
  "ipam": {
    "type": "dhcp"
    }
}
Copy to Clipboard Toggle word wrap

The IP address management (IPAM) Container Network Interface (CNI) plugin provides IP addresses for other CNI plugins.

You can use the following IP address assignment types:

  • Static assignment.
  • Dynamic assignment through a DHCP server. The DHCP server you specify must be reachable from the additional network.
  • Dynamic assignment through the Whereabouts IPAM CNI plugin.

The following table describes the configuration for static IP address assignment:

Expand
Table 20.6. ipam static configuration object
FieldTypeDescription

type

string

The IPAM address type. The value static is required.

addresses

array

An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.

routes

array

An array of objects specifying routes to configure inside the pod.

dns

array

Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Expand
Table 20.7. ipam.addresses[] array
FieldTypeDescription

address

string

An IP address and network prefix that you specify. For example, if you specify 10.10.21.10/24, then the additional network is assigned an IP address of 10.10.21.10 and the netmask is 255.255.255.0.

gateway

string

The default gateway to route egress network traffic to.

Expand
Table 20.8. ipam.routes[] array
FieldTypeDescription

dst

string

The IP address range in CIDR format, such as 192.168.17.0/24 or 0.0.0.0/0 for the default route.

gw

string

The gateway where network traffic is routed.

Expand
Table 20.9. ipam.dns object
FieldTypeDescription

nameservers

array

An array of one or more IP addresses for to send DNS queries to.

domain

array

The default domain to append to a hostname. For example, if the domain is set to example.com, a DNS lookup query for example-host is rewritten as example-host.example.com.

search

array

An array of domain names to append to an unqualified hostname, such as example-host, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}
Copy to Clipboard Toggle word wrap

The following JSON describes the configuration for dynamic IP address address assignment with DHCP.

Renewal of DHCP leases

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...
Copy to Clipboard Toggle word wrap

Expand
Table 20.10. ipam DHCP configuration object
FieldTypeDescription

type

string

The IPAM address type. The value dhcp is required.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}
Copy to Clipboard Toggle word wrap

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The following table describes the configuration for dynamic IP address assignment with Whereabouts:

Expand
Table 20.11. ipam whereabouts configuration object
FieldTypeDescription

type

string

The IPAM address type. The value whereabouts is required.

range

string

An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.

exclude

array

Optional: A list of zero or more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.

Dynamic IP address assignment configuration example that uses Whereabouts

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}
Copy to Clipboard Toggle word wrap

The Whereabouts reconciler is responsible for managing dynamic IP address assignments for the pods within a cluster using the Whereabouts IP Address Management (IPAM) solution. It ensures that each pods gets a unique IP address from the specified IP address range. It also handles IP address releases when pods are deleted or scaled down.

Note

You can also use a NetworkAttachmentDefinition custom resource for dynamic IP address assignment.

The Whereabouts reconciler daemon set is automatically created when you configure an additional network through the Cluster Network Operator. It is not automatically created when you configure an additional network from a YAML manifest.

To trigger the deployment of the Whereabouts reconciler daemonset, you must manually create a whereabouts-shim network attachment by editing the Cluster Network Operator custom resource file.

Use the following procedure to deploy the Whereabouts reconciler daemonset.

Procedure

  1. Edit the Network.operator.openshift.io custom resource (CR) by running the following command:

    $ oc edit network.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  2. Modify the additionalNetworks parameter in the CR to add the whereabouts-shim network attachment definition. For example:

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      additionalNetworks:
      - name: whereabouts-shim
        namespace: default
        rawCNIConfig: |-
          {
           "name": "whereabouts-shim",
           "cniVersion": "0.3.1",
           "type": "bridge",
           "ipam": {
             "type": "whereabouts"
           }
          }
        type: Raw
    Copy to Clipboard Toggle word wrap
  3. Save the file and exit the text editor.
  4. Verify that the whereabouts-reconciler daemon set deployed successfully by running the following command:

    $ oc get all -n openshift-multus | grep whereabouts-reconciler
    Copy to Clipboard Toggle word wrap

    Example output

    pod/whereabouts-reconciler-jnp6g 1/1 Running 0 6s
    pod/whereabouts-reconciler-k76gg 1/1 Running 0 6s
    pod/whereabouts-reconciler-k86t9 1/1 Running 0 6s
    pod/whereabouts-reconciler-p4sxw 1/1 Running 0 6s
    pod/whereabouts-reconciler-rvfdv 1/1 Running 0 6s
    pod/whereabouts-reconciler-svzw9 1/1 Running 0 6s
    daemonset.apps/whereabouts-reconciler 6 6 6 6 6 kubernetes.io/os=linux 6s
    Copy to Clipboard Toggle word wrap

The Cluster Network Operator (CNO) manages additional network definitions. When you specify an additional network to create, the CNO creates the NetworkAttachmentDefinition object automatically.

Important

Do not edit the NetworkAttachmentDefinition objects that the Cluster Network Operator manages. Doing so might disrupt network traffic on your additional network.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Optional: Create the namespace for the additional networks:

    $ oc create namespace <namespace_name>
    Copy to Clipboard Toggle word wrap
  2. To edit the CNO configuration, enter the following command:

    $ oc edit networks.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  3. Modify the CR that you are creating by adding the configuration for the additional network that you are creating, as in the following example CR.

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      # ...
      additionalNetworks:
      - name: tertiary-net
        namespace: namespace2
        type: Raw
        rawCNIConfig: |-
          {
            "cniVersion": "0.3.1",
            "name": "tertiary-net",
            "type": "ipvlan",
            "master": "eth1",
            "mode": "l2",
            "ipam": {
              "type": "static",
              "addresses": [
                {
                  "address": "192.168.1.23/24"
                }
              ]
            }
          }
    Copy to Clipboard Toggle word wrap
  4. Save your changes and quit the text editor to commit your changes.

Verification

  • Confirm that the CNO created the NetworkAttachmentDefinition object by running the following command. There might be a delay before the CNO creates the object.

    $ oc get network-attachment-definitions -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <namespace>
    Specifies the namespace for the network attachment that you added to the CNO configuration.

    Example output

    NAME                 AGE
    test-network-1       14m
    Copy to Clipboard Toggle word wrap

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a YAML file with your additional network configuration, such as in the following example:

    apiVersion: k8s.cni.cncf.io/v1
    kind: NetworkAttachmentDefinition
    metadata:
      name: next-net
    spec:
      config: |-
        {
          "cniVersion": "0.3.1",
          "name": "work-network",
          "type": "host-device",
          "device": "eth1",
          "ipam": {
            "type": "dhcp"
          }
        }
    Copy to Clipboard Toggle word wrap
  2. To create the additional network, enter the following command:

    $ oc apply -f <file>.yaml
    Copy to Clipboard Toggle word wrap

    where:

    <file>
    Specifies the name of the file contained the YAML manifest.

20.3. About virtual routing and forwarding

20.3.1. About virtual routing and forwarding

Virtual routing and forwarding (VRF) devices combined with IP rules provide the ability to create virtual routing and forwarding domains. VRF reduces the number of permissions needed by CNF, and provides increased visibility of the network topology of secondary networks. VRF is used to provide multi-tenancy functionality, for example, where each tenant has its own unique routing tables and requires different default gateways.

Processes can bind a socket to the VRF device. Packets through the binded socket use the routing table associated with the VRF device. An important feature of VRF is that it impacts only OSI model layer 3 traffic and above so L2 tools, such as LLDP, are not affected. This allows higher priority IP rules such as policy based routing to take precedence over the VRF device rules directing specific traffic.

In telecommunications use cases, each CNF can potentially be connected to multiple different networks sharing the same address space. These secondary networks can potentially conflict with the cluster’s main network CIDR. Using the CNI VRF plugin, network functions can be connected to different customers' infrastructure using the same IP address, keeping different customers isolated. IP addresses are overlapped with OpenShift Container Platform IP space. The CNI VRF plugin also reduces the number of permissions needed by CNF and increases the visibility of network topologies of secondary networks.

20.4. Configuring multi-network policy

As a cluster administrator, you can configure network policy for additional networks.

Note

You can specify multi-network policy for only macvlan additional networks. Other types of additional networks, such as ipvlan, are not supported.

Although the MultiNetworkPolicy API implements the NetworkPolicy API, there are several important differences:

  • You must use the MultiNetworkPolicy API:

    apiVersion: k8s.cni.cncf.io/v1beta1
    kind: MultiNetworkPolicy
    Copy to Clipboard Toggle word wrap
  • You must use the multi-networkpolicy resource name when using the CLI to interact with multi-network policies. For example, you can view a multi-network policy object with the oc get multi-networkpolicy <name> command where <name> is the name of a multi-network policy.
  • You must specify an annotation with the name of the network attachment definition that defines the macvlan additional network:

    apiVersion: k8s.cni.cncf.io/v1beta1
    kind: MultiNetworkPolicy
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/policy-for: <network_name>
    Copy to Clipboard Toggle word wrap

    where:

    <network_name>
    Specifies the name of a network attachment definition.

As a cluster administrator, you can enable multi-network policy support on your cluster.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.

Procedure

  1. Create the multinetwork-enable-patch.yaml file with the following YAML:

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      useMultiNetworkPolicy: true
    Copy to Clipboard Toggle word wrap
  2. Configure the cluster to enable multi-network policy:

    $ oc patch network.operator.openshift.io cluster --type=merge --patch-file=multinetwork-enable-patch.yaml
    Copy to Clipboard Toggle word wrap

    Example output

    network.operator.openshift.io/cluster patched
    Copy to Clipboard Toggle word wrap

20.4.3. Working with multi-network policy

As a cluster administrator, you can create, edit, view, and delete multi-network policies.

20.4.3.1. Prerequisites
  • You have enabled multi-network policy support for your cluster.

To define granular rules describing ingress or egress network traffic allowed for namespaces in your cluster, you can create a multi-network policy.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You are working in the namespace that the multi-network policy applies to.

Procedure

  1. Create a policy rule:

    1. Create a <policy_name>.yaml file:

      $ touch <policy_name>.yaml
      Copy to Clipboard Toggle word wrap

      where:

      <policy_name>
      Specifies the multi-network policy file name.
    2. Define a multi-network policy in the file that you just created, such as in the following examples:

      Deny ingress from all pods in all namespaces

      apiVersion: k8s.cni.cncf.io/v1beta1
      kind: MultiNetworkPolicy
      metadata:
        name: deny-by-default
        annotations:
          k8s.v1.cni.cncf.io/policy-for: <network_name>
      spec:
        podSelector:
        ingress: []
      Copy to Clipboard Toggle word wrap

      where

      <network_name>
      Specifies the name of a network attachment definition.

      Allow ingress from all pods in the same namespace

      apiVersion: k8s.cni.cncf.io/v1beta1
      kind: MultiNetworkPolicy
      metadata:
        name: allow-same-namespace
        annotations:
          k8s.v1.cni.cncf.io/policy-for: <network_name>
      spec:
        podSelector:
        ingress:
        - from:
          - podSelector: {}
      Copy to Clipboard Toggle word wrap

      where

      <network_name>
      Specifies the name of a network attachment definition.
  2. To create the multi-network policy object, enter the following command:

    $ oc apply -f <policy_name>.yaml -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <policy_name>
    Specifies the multi-network policy file name.
    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.

    Example output

    multinetworkpolicy.k8s.cni.cncf.io/default-deny created
    Copy to Clipboard Toggle word wrap

Note

If you log in to the web console with cluster-admin privileges, you have a choice of creating a network policy in any namespace in the cluster directly in YAML or from a form in the web console.

20.4.3.3. Editing a multi-network policy

You can edit a multi-network policy in a namespace.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You are working in the namespace where the multi-network policy exists.

Procedure

  1. Optional: To list the multi-network policy objects in a namespace, enter the following command:

    $ oc get multi-networkpolicy
    Copy to Clipboard Toggle word wrap

    where:

    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
  2. Edit the multi-network policy object.

    • If you saved the multi-network policy definition in a file, edit the file and make any necessary changes, and then enter the following command.

      $ oc apply -n <namespace> -f <policy_file>.yaml
      Copy to Clipboard Toggle word wrap

      where:

      <namespace>
      Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
      <policy_file>
      Specifies the name of the file containing the network policy.
    • If you need to update the multi-network policy object directly, enter the following command:

      $ oc edit multi-networkpolicy <policy_name> -n <namespace>
      Copy to Clipboard Toggle word wrap

      where:

      <policy_name>
      Specifies the name of the network policy.
      <namespace>
      Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
  3. Confirm that the multi-network policy object is updated.

    $ oc describe multi-networkpolicy <policy_name> -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <policy_name>
    Specifies the name of the multi-network policy.
    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
Note

If you log in to the web console with cluster-admin privileges, you have a choice of editing a network policy in any namespace in the cluster directly in YAML or from the policy in the web console through the Actions menu.

You can examine the multi-network policies in a namespace.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You are working in the namespace where the multi-network policy exists.

Procedure

  • List multi-network policies in a namespace:

    • To view multi-network policy objects defined in a namespace, enter the following command:

      $ oc get multi-networkpolicy
      Copy to Clipboard Toggle word wrap
    • Optional: To examine a specific multi-network policy, enter the following command:

      $ oc describe multi-networkpolicy <policy_name> -n <namespace>
      Copy to Clipboard Toggle word wrap

      where:

      <policy_name>
      Specifies the name of the multi-network policy to inspect.
      <namespace>
      Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.
Note

If you log in to the web console with cluster-admin privileges, you have a choice of viewing a network policy in any namespace in the cluster directly in YAML or from a form in the web console.

You can delete a multi-network policy in a namespace.

Prerequisites

  • Your cluster uses a cluster network provider that supports NetworkPolicy objects, such as the OpenShift SDN network provider with mode: NetworkPolicy set. This mode is the default for OpenShift SDN.
  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You are working in the namespace where the multi-network policy exists.

Procedure

  • To delete a multi-network policy object, enter the following command:

    $ oc delete multi-networkpolicy <policy_name> -n <namespace>
    Copy to Clipboard Toggle word wrap

    where:

    <policy_name>
    Specifies the name of the multi-network policy.
    <namespace>
    Optional: Specifies the namespace if the object is defined in a different namespace than the current namespace.

    Example output

    multinetworkpolicy.k8s.cni.cncf.io/default-deny deleted
    Copy to Clipboard Toggle word wrap

Note

If you log in to the web console with cluster-admin privileges, you have a choice of deleting a network policy in any namespace in the cluster directly in YAML or from the policy in the web console through the Actions menu.

20.5. Attaching a pod to an additional network

As a cluster user you can attach a pod to an additional network.

20.5.1. Adding a pod to an additional network

You can add a pod to an additional network. The pod continues to send normal cluster-related network traffic over the default network.

When a pod is created additional networks are attached to it. However, if a pod already exists, you cannot attach additional networks to it.

The pod must be in the same namespace as the additional network.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster.

Procedure

  1. Add an annotation to the Pod object. Only one of the following annotation formats can be used:

    1. To attach an additional network without any customization, add an annotation with the following format. Replace <network> with the name of the additional network to associate with the pod:

      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: <network>[,<network>,...] 
      1
      Copy to Clipboard Toggle word wrap
      1
      To specify more than one additional network, separate each network with a comma. Do not include whitespace between the comma. If you specify the same additional network multiple times, that pod will have multiple network interfaces attached to that network.
    2. To attach an additional network with customizations, add an annotation with the following format:

      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: |-
            [
              {
                "name": "<network>", 
      1
      
                "namespace": "<namespace>", 
      2
      
                "default-route": ["<default-route>"] 
      3
      
              }
            ]
      Copy to Clipboard Toggle word wrap
      1
      Specify the name of the additional network defined by a NetworkAttachmentDefinition object.
      2
      Specify the namespace where the NetworkAttachmentDefinition object is defined.
      3
      Optional: Specify an override for the default route, such as 192.168.17.1.
  2. To create the pod, enter the following command. Replace <name> with the name of the pod.

    $ oc create -f <name>.yaml
    Copy to Clipboard Toggle word wrap
  3. Optional: To Confirm that the annotation exists in the Pod CR, enter the following command, replacing <name> with the name of the pod.

    $ oc get pod <name> -o yaml
    Copy to Clipboard Toggle word wrap

    In the following example, the example-pod pod is attached to the net1 additional network:

    $ oc get pod example-pod -o yaml
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: macvlan-bridge
        k8s.v1.cni.cncf.io/networks-status: |- 
    1
    
          [{
              "name": "openshift-sdn",
              "interface": "eth0",
              "ips": [
                  "10.128.2.14"
              ],
              "default": true,
              "dns": {}
          },{
              "name": "macvlan-bridge",
              "interface": "net1",
              "ips": [
                  "20.2.2.100"
              ],
              "mac": "22:2f:60:a5:f8:00",
              "dns": {}
          }]
      name: example-pod
      namespace: default
    spec:
      ...
    status:
      ...
    Copy to Clipboard Toggle word wrap
    1
    The k8s.v1.cni.cncf.io/networks-status parameter is a JSON array of objects. Each object describes the status of an additional network attached to the pod. The annotation value is stored as a plain text value.

When attaching a pod to an additional network, you may want to specify further properties about that network in a particular pod. This allows you to change some aspects of routing, as well as specify static IP addresses and MAC addresses. To accomplish this, you can use the JSON formatted annotations.

Prerequisites

  • The pod must be in the same namespace as the additional network.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster.

Procedure

To add a pod to an additional network while specifying addressing and/or routing options, complete the following steps:

  1. Edit the Pod resource definition. If you are editing an existing Pod resource, run the following command to edit its definition in the default editor. Replace <name> with the name of the Pod resource to edit.

    $ oc edit pod <name>
    Copy to Clipboard Toggle word wrap
  2. In the Pod resource definition, add the k8s.v1.cni.cncf.io/networks parameter to the pod metadata mapping. The k8s.v1.cni.cncf.io/networks accepts a JSON string of a list of objects that reference the name of NetworkAttachmentDefinition custom resource (CR) names in addition to specifying additional properties.

    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: '[<network>[,<network>,...]]' 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <network> with a JSON object as shown in the following examples. The single quotes are required.
  3. In the following example the annotation specifies which network attachment will have the default route, using the default-route parameter.

    apiVersion: v1
    kind: Pod
    metadata:
      name: example-pod
      annotations:
        k8s.v1.cni.cncf.io/networks: '[
        {
          "name": "net1"
        },
        {
          "name": "net2", 
    1
    
          "default-route": ["192.0.2.1"] 
    2
    
        }]'
    spec:
      containers:
      - name: example-pod
        command: ["/bin/bash", "-c", "sleep 2000000000000"]
        image: centos/tools
    Copy to Clipboard Toggle word wrap
    1
    The name key is the name of the additional network to associate with the pod.
    2
    The default-route key specifies a value of a gateway for traffic to be routed over if no other routing entry is present in the routing table. If more than one default-route key is specified, this will cause the pod to fail to become active.

The default route will cause any traffic that is not specified in other routes to be routed to the gateway.

Important

Setting the default route to an interface other than the default network interface for OpenShift Container Platform may cause traffic that is anticipated for pod-to-pod traffic to be routed over another interface.

To verify the routing properties of a pod, the oc command may be used to execute the ip command within a pod.

$ oc exec -it <pod_name> -- ip route
Copy to Clipboard Toggle word wrap
Note

You may also reference the pod’s k8s.v1.cni.cncf.io/networks-status to see which additional network has been assigned the default route, by the presence of the default-route key in the JSON-formatted list of objects.

To set a static IP address or MAC address for a pod you can use the JSON formatted annotations. This requires you create networks that specifically allow for this functionality. This can be specified in a rawCNIConfig for the CNO.

  1. Edit the CNO CR by running the following command:

    $ oc edit networks.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap

The following YAML describes the configuration parameters for the CNO:

Cluster Network Operator YAML configuration

name: <name> 
1

namespace: <namespace> 
2

rawCNIConfig: '{ 
3

  ...
}'
type: Raw
Copy to Clipboard Toggle word wrap

1
Specify a name for the additional network attachment that you are creating. The name must be unique within the specified namespace.
2
Specify the namespace to create the network attachment in. If you do not specify a value, then the default namespace is used.
3
Specify the CNI plugin configuration in JSON format, which is based on the following template.

The following object describes the configuration parameters for utilizing static MAC address and IP address using the macvlan CNI plugin:

macvlan CNI plugin JSON configuration object using static IP and MAC address

{
  "cniVersion": "0.3.1",
  "name": "<name>", 
1

  "plugins": [{ 
2

      "type": "macvlan",
      "capabilities": { "ips": true }, 
3

      "master": "eth0", 
4

      "mode": "bridge",
      "ipam": {
        "type": "static"
      }
    }, {
      "capabilities": { "mac": true }, 
5

      "type": "tuning"
    }]
}
Copy to Clipboard Toggle word wrap

1
Specifies the name for the additional network attachment to create. The name must be unique within the specified namespace.
2
Specifies an array of CNI plugin configurations. The first object specifies a macvlan plugin configuration and the second object specifies a tuning plugin configuration.
3
Specifies that a request is made to enable the static IP address functionality of the CNI plugin runtime configuration capabilities.
4
Specifies the interface that the macvlan plugin uses.
5
Specifies that a request is made to enable the static MAC address functionality of a CNI plugin.

The above network attachment can be referenced in a JSON formatted annotation, along with keys to specify which static IP and MAC address will be assigned to a given pod.

Edit the pod with:

$ oc edit pod <name>
Copy to Clipboard Toggle word wrap

macvlan CNI plugin JSON configuration object using static IP and MAC address

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "<name>", 
1

        "ips": [ "192.0.2.205/24" ], 
2

        "mac": "CA:FE:C0:FF:EE:00" 
3

      }
    ]'
Copy to Clipboard Toggle word wrap

1
Use the <name> as provided when creating the rawCNIConfig above.
2
Provide an IP address including the subnet mask.
3
Provide the MAC address.
Note

Static IP addresses and MAC addresses do not have to be used at the same time, you may use them individually, or together.

To verify the IP address and MAC properties of a pod with additional networks, use the oc command to execute the ip command within a pod.

$ oc exec -it <pod_name> -- ip a
Copy to Clipboard Toggle word wrap

20.6. Removing a pod from an additional network

As a cluster user you can remove a pod from an additional network.

20.6.1. Removing a pod from an additional network

You can remove a pod from an additional network only by deleting the pod.

Prerequisites

  • An additional network is attached to the pod.
  • Install the OpenShift CLI (oc).
  • Log in to the cluster.

Procedure

  • To delete the pod, enter the following command:

    $ oc delete pod <name> -n <namespace>
    Copy to Clipboard Toggle word wrap
    • <name> is the name of the pod.
    • <namespace> is the namespace that contains the pod.

20.7. Editing an additional network

As a cluster administrator you can modify the configuration for an existing additional network.

As a cluster administrator, you can make changes to an existing additional network. Any existing pods attached to the additional network will not be updated.

Prerequisites

  • You have configured an additional network for your cluster.
  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

To edit an additional network for your cluster, complete the following steps:

  1. Run the following command to edit the Cluster Network Operator (CNO) CR in your default text editor:

    $ oc edit networks.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  2. In the additionalNetworks collection, update the additional network with your changes.
  3. Save your changes and quit the text editor to commit your changes.
  4. Optional: Confirm that the CNO updated the NetworkAttachmentDefinition object by running the following command. Replace <network-name> with the name of the additional network to display. There might be a delay before the CNO updates the NetworkAttachmentDefinition object to reflect your changes.

    $ oc get network-attachment-definitions <network-name> -o yaml
    Copy to Clipboard Toggle word wrap

    For example, the following console output displays a NetworkAttachmentDefinition object that is named net1:

    $ oc get network-attachment-definitions net1 -o go-template='{{printf "%s\n" .spec.config}}'
    { "cniVersion": "0.3.1", "type": "macvlan",
    "master": "ens5",
    "mode": "bridge",
    "ipam":       {"type":"static","routes":[{"dst":"0.0.0.0/0","gw":"10.128.2.1"}],"addresses":[{"address":"10.128.2.100/23","gateway":"10.128.2.1"}],"dns":{"nameservers":["172.30.0.10"],"domain":"us-west-2.compute.internal","search":["us-west-2.compute.internal"]}} }
    Copy to Clipboard Toggle word wrap

20.8. Removing an additional network

As a cluster administrator you can remove an additional network attachment.

As a cluster administrator, you can remove an additional network from your OpenShift Container Platform cluster. The additional network is not removed from any pods it is attached to.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

To remove an additional network from your cluster, complete the following steps:

  1. Edit the Cluster Network Operator (CNO) in your default text editor by running the following command:

    $ oc edit networks.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  2. Modify the CR by removing the configuration from the additionalNetworks collection for the network attachment definition you are removing.

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      additionalNetworks: [] 
    1
    Copy to Clipboard Toggle word wrap
    1
    If you are removing the configuration mapping for the only additional network attachment definition in the additionalNetworks collection, you must specify an empty collection.
  3. Save your changes and quit the text editor to commit your changes.
  4. Optional: Confirm that the additional network CR was deleted by running the following command:

    $ oc get network-attachment-definition --all-namespaces
    Copy to Clipboard Toggle word wrap

20.9. Assigning a secondary network to a VRF

20.9.1. Assigning a secondary network to a VRF

As a cluster administrator, you can configure an additional network for your VRF domain by using the CNI VRF plugin. The virtual network created by this plugin is associated with a physical interface that you specify.

Note

Applications that use VRFs need to bind to a specific device. The common usage is to use the SO_BINDTODEVICE option for a socket. SO_BINDTODEVICE binds the socket to a device that is specified in the passed interface name, for example, eth1. To use SO_BINDTODEVICE, the application must have CAP_NET_RAW capabilities.

Using a VRF through the ip vrf exec command is not supported in OpenShift Container Platform pods. To use VRF, bind applications directly to the VRF interface.

The Cluster Network Operator (CNO) manages additional network definitions. When you specify an additional network to create, the CNO creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit the NetworkAttachmentDefinition CRs that the Cluster Network Operator manages. Doing so might disrupt network traffic on your additional network.

To create an additional network attachment with the CNI VRF plugin, perform the following procedure.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in to the OpenShift cluster as a user with cluster-admin privileges.

Procedure

  1. Create the Network custom resource (CR) for the additional network attachment and insert the rawCNIConfig configuration for the additional network, as in the following example CR. Save the YAML as the file additional-network-attachment.yaml.

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
      spec:
      additionalNetworks:
      - name: test-network-1
        namespace: additional-network-1
        type: Raw
        rawCNIConfig: '{
          "cniVersion": "0.3.1",
          "name": "macvlan-vrf",
          "plugins": [  
    1
    
          {
            "type": "macvlan",  
    2
    
            "master": "eth1",
            "ipam": {
                "type": "static",
                "addresses": [
                {
                    "address": "191.168.1.23/24"
                }
                ]
            }
          },
          {
            "type": "vrf",
            "vrfname": "example-vrf-name",  
    3
    
            "table": 1001   
    4
    
          }]
        }'
    Copy to Clipboard Toggle word wrap
    1
    plugins must be a list. The first item in the list must be the secondary network underpinning the VRF network. The second item in the list is the VRF plugin configuration.
    2
    type must be set to vrf.
    3
    vrfname is the name of the VRF that the interface is assigned to. If it does not exist in the pod, it is created.
    4
    Optional. table is the routing table ID. By default, the tableid parameter is used. If it is not specified, the CNI assigns a free routing table ID to the VRF.
    Note

    VRF functions correctly only when the resource is of type netdevice.

  2. Create the Network resource:

    $ oc create -f additional-network-attachment.yaml
    Copy to Clipboard Toggle word wrap
  3. Confirm that the CNO created the NetworkAttachmentDefinition CR by running the following command. Replace <namespace> with the namespace that you specified when configuring the network attachment, for example, additional-network-1.

    $ oc get network-attachment-definitions -n <namespace>
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                       AGE
    additional-network-1       14m
    Copy to Clipboard Toggle word wrap

    Note

    There might be a delay before the CNO creates the CR.

Verifying that the additional VRF network attachment is successful

To verify that the VRF CNI is correctly configured and the additional network attachment is attached, do the following:

  1. Create a network that uses the VRF CNI.
  2. Assign the network to a pod.
  3. Verify that the pod network attachment is connected to the VRF additional network. Remote shell into the pod and run the following command:

    $ ip vrf show
    Copy to Clipboard Toggle word wrap

    Example output

    Name              Table
    -----------------------
    red                 10
    Copy to Clipboard Toggle word wrap

  4. Confirm the VRF interface is master of the secondary interface:

    $ ip link
    Copy to Clipboard Toggle word wrap

    Example output

    5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master red state UP mode
    Copy to Clipboard Toggle word wrap

Chapter 21. Hardware networks

The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.

SR-IOV can segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV network device driver for the device determines how the VF is exposed in the container:

  • netdevice driver: A regular kernel network device in the netns of the container
  • vfio-pci driver: A character device mounted in the container

You can use SR-IOV network devices with additional networks on your OpenShift Container Platform cluster installed on bare metal or Red Hat OpenStack Platform (RHOSP) infrastructure for applications that require high bandwidth or low latency.

You can enable SR-IOV on a node by using the following command:

$ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"
Copy to Clipboard Toggle word wrap

The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. It performs the following functions:

  • Orchestrates discovery and management of SR-IOV network devices
  • Generates NetworkAttachmentDefinition custom resources for the SR-IOV Container Network Interface (CNI)
  • Creates and updates the configuration of the SR-IOV network device plugin
  • Creates node specific SriovNetworkNodeState custom resources
  • Updates the spec.interfaces field in each SriovNetworkNodeState custom resource

The Operator provisions the following components:

SR-IOV network configuration daemon
A daemon set that is deployed on worker nodes when the SR-IOV Network Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.
SR-IOV Network Operator webhook
A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.
SR-IOV Network resources injector
A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs. The SR-IOV network resources injector adds the resource field to only the first container in a pod automatically.
SR-IOV network device plugin
A device plugin that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plugins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plugins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.
SR-IOV CNI plugin
A CNI plugin that attaches VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
SR-IOV InfiniBand CNI plugin
A CNI plugin that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
Note

The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the default SriovOperatorConfig CR. Use caution when disabling the SR-IOV Network Operator Admission Controller webhook. You can disable the webhook under specific circumstances, such as troubleshooting, or if you want to use unsupported devices.

21.1.1.1. Supported platforms

The SR-IOV Network Operator is supported on the following platforms:

  • Bare metal
  • Red Hat OpenStack Platform (RHOSP)
21.1.1.2. Supported devices

OpenShift Container Platform supports the following network interface controllers:

Expand
Table 21.1. Supported network interface controllers
ManufacturerModelVendor IDDevice ID

Broadcom

BCM57414

14e4

16d7

Broadcom

BCM57508

14e4

1750

Intel

X710

8086

1572

Intel

XL710

8086

1583

Intel

XXV710

8086

158b

Intel

E810-CQDA2

8086

1592

Intel

E810-2CQDA2

8086

1592

Intel

E810-XXVDA2

8086

159b

Intel

E810-XXVDA4

8086

1593

Mellanox

MT27700 Family [ConnectX‑4]

15b3

1013

Mellanox

MT27710 Family [ConnectX‑4 Lx]

15b3

1015

Mellanox

MT27800 Family [ConnectX‑5]

15b3

1017

Mellanox

MT28880 Family [ConnectX‑5 Ex]

15b3

1019

Mellanox

MT28908 Family [ConnectX‑6]

15b3

101b

Mellanox

MT2892 Family [ConnectX‑6 Dx]

15b3

101d

Mellanox

MT2894 Family [ConnectX‑6 Lx]

15b3

101f

Pensando [1]

DSC-25 dual-port 25G distributed services card for ionic driver

0x1dd8

0x1002

Pensando [1]

DSC-100 dual-port 100G distributed services card for ionic driver

0x1dd8

0x1003

  1. OpenShift SR-IOV is supported, but you must set a static, Virtual Function (VF) media access control (MAC) address using the SR-IOV CNI config file when using SR-IOV.
Note

For the most up-to-date list of supported cards and compatible OpenShift Container Platform versions available, see Openshift Single Root I/O Virtualization (SR-IOV) and PTP hardware networks Support Matrix.

The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.

The CR is assigned the same name as the worker node. The status.interfaces list provides information about the network devices on a node.

Important

Do not modify a SriovNetworkNodeState object. The Operator creates and manages these resources automatically.

21.1.1.3.1. Example SriovNetworkNodeState object

The following YAML is an example of a SriovNetworkNodeState object created by the SR-IOV Network Operator:

An SriovNetworkNodeState object

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
  name: node-25 
1

  namespace: openshift-sriov-network-operator
  ownerReferences:
  - apiVersion: sriovnetwork.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: SriovNetworkNodePolicy
    name: default
spec:
  dpConfigVersion: "39824"
status:
  interfaces: 
2

  - deviceID: "1017"
    driver: mlx5_core
    mtu: 1500
    name: ens785f0
    pciAddress: "0000:18:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: "1017"
    driver: mlx5_core
    mtu: 1500
    name: ens785f1
    pciAddress: "0000:18:00.1"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens817f0
    pciAddress: 0000:81:00.0
    totalvfs: 64
    vendor: "8086"
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens817f1
    pciAddress: 0000:81:00.1
    totalvfs: 64
    vendor: "8086"
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens803f0
    pciAddress: 0000:86:00.0
    totalvfs: 64
    vendor: "8086"
  syncStatus: Succeeded
Copy to Clipboard Toggle word wrap

1
The value of the name field is the same as the name of the worker node.
2
The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node.

You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.

This example shows a pod using a virtual function (VF) in RDMA mode:

Pod spec that uses RDMA mode

apiVersion: v1
kind: Pod
metadata:
  name: rdma-app
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
spec:
  containers:
  - name: testpmd
    image: <RDMA_image>
    imagePullPolicy: IfNotPresent
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
    command: ["sleep", "infinity"]
Copy to Clipboard Toggle word wrap

The following example shows a pod with a VF in DPDK mode:

Pod spec that uses DPDK mode

apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
spec:
  containers:
  - name: testpmd
    image: <DPDK_image>
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
      requests:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
Copy to Clipboard Toggle word wrap

An optional library, app-netutil, provides several API methods for gathering network information about a pod from within a container running within that pod.

This library can assist with integrating SR-IOV virtual functions (VFs) in Data Plane Development Kit (DPDK) mode into the container. The library provides both a Golang API and a C API.

Currently there are three API methods implemented:

GetCPUInfo()
This function determines which CPUs are available to the container and returns the list.
GetHugepages()
This function determines the amount of huge page memory requested in the Pod spec for each container and returns the values.
GetInterfaces()
This function determines the set of interfaces in the container and returns the list. The return value includes the interface type and type-specific data for each interface.

The repository for the library includes a sample Dockerfile to build a container image, dpdk-app-centos. The container image can run one of the following DPDK sample applications, depending on an environment variable in the pod specification: l2fwd, l3wd or testpmd. The container image provides an example of integrating the app-netutil library into the container image itself. The library can also integrate into an init container. The init container can collect the required data and pass the data to an existing DPDK workload.

When a pod specification includes a resource request or limit for huge pages, the Network Resources Injector automatically adds Downward API fields to the pod specification to provide the huge pages information to the container.

The Network Resources Injector adds a volume that is named podnetinfo and is mounted at /etc/podnetinfo for each container in the pod. The volume uses the Downward API and includes a file for huge pages requests and limits. The file naming convention is as follows:

  • /etc/podnetinfo/hugepages_1G_request_<container-name>
  • /etc/podnetinfo/hugepages_1G_limit_<container-name>
  • /etc/podnetinfo/hugepages_2M_request_<container-name>
  • /etc/podnetinfo/hugepages_2M_limit_<container-name>

The paths specified in the previous list are compatible with the app-netutil library. By default, the library is configured to search for resource information in the /etc/podnetinfo directory. If you choose to specify the Downward API path items yourself manually, the app-netutil library searches for the following paths in addition to the paths in the previous list.

  • /etc/podnetinfo/hugepages_request
  • /etc/podnetinfo/hugepages_limit
  • /etc/podnetinfo/hugepages_1G_request
  • /etc/podnetinfo/hugepages_1G_limit
  • /etc/podnetinfo/hugepages_2M_request
  • /etc/podnetinfo/hugepages_2M_limit

As with the paths that the Network Resources Injector can create, the paths in the preceding list can optionally end with a _<container-name> suffix.

21.1.2. Next steps

21.2. Installing the SR-IOV Network Operator

You can install the Single Root I/O Virtualization (SR-IOV) Network Operator on your cluster to manage SR-IOV network devices and network attachments.

21.2.1. Installing SR-IOV Network Operator

As a cluster administrator, you can install the SR-IOV Network Operator by using the OpenShift Container Platform CLI or the web console.

As a cluster administrator, you can install the Operator using the CLI.

Prerequisites

  • A cluster installed on bare-metal hardware with nodes that have hardware that supports SR-IOV.
  • Install the OpenShift CLI (oc).
  • An account with cluster-admin privileges.

Procedure

  1. To create the openshift-sriov-network-operator namespace, enter the following command:

    $ cat << EOF| oc create -f -
    apiVersion: v1
    kind: Namespace
    metadata:
      name: openshift-sriov-network-operator
      annotations:
        workload.openshift.io/allowed: management
    EOF
    Copy to Clipboard Toggle word wrap
  2. To create an OperatorGroup CR, enter the following command:

    $ cat << EOF| oc create -f -
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      name: sriov-network-operators
      namespace: openshift-sriov-network-operator
    spec:
      targetNamespaces:
      - openshift-sriov-network-operator
    EOF
    Copy to Clipboard Toggle word wrap
  3. Subscribe to the SR-IOV Network Operator.

    1. Run the following command to get the OpenShift Container Platform major and minor version. It is required for the channel value in the next step.

      $ OC_VERSION=$(oc version -o yaml | grep openshiftVersion | \
          grep -o '[0-9]*[.][0-9]*' | head -1)
      Copy to Clipboard Toggle word wrap
    2. To create a Subscription CR for the SR-IOV Network Operator, enter the following command:

      $ cat << EOF| oc create -f -
      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: sriov-network-operator-subscription
        namespace: openshift-sriov-network-operator
      spec:
        channel: "${OC_VERSION}"
        name: sriov-network-operator
        source: redhat-operators
        sourceNamespace: openshift-marketplace
      EOF
      Copy to Clipboard Toggle word wrap
  4. To verify that the Operator is installed, enter the following command:

    $ oc get csv -n openshift-sriov-network-operator \
      -o custom-columns=Name:.metadata.name,Phase:.status.phase
    Copy to Clipboard Toggle word wrap

    Example output

    Name                                         Phase
    sriov-network-operator.4.12.0-202310121402   Succeeded
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can install the Operator using the web console.

Prerequisites

  • A cluster installed on bare-metal hardware with nodes that have hardware that supports SR-IOV.
  • Install the OpenShift CLI (oc).
  • An account with cluster-admin privileges.

Procedure

  1. Install the SR-IOV Network Operator:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.
    2. Select SR-IOV Network Operator from the list of available Operators, and then click Install.
    3. On the Install Operator page, under Installed Namespace, select Operator recommended Namespace.
    4. Click Install.
  2. Verify that the SR-IOV Network Operator is installed successfully:

    1. Navigate to the OperatorsInstalled Operators page.
    2. Ensure that SR-IOV Network Operator is listed in the openshift-sriov-network-operator project with a Status of InstallSucceeded.

      Note

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
      • Navigate to the WorkloadsPods page and check the logs for pods in the openshift-sriov-network-operator project.
      • Check the namespace of the YAML file. If the annotation is missing, you can add the annotation workload.openshift.io/allowed=management to the Operator namespace with the following command:

        $ oc annotate ns/openshift-sriov-network-operator workload.openshift.io/allowed=management
        Copy to Clipboard Toggle word wrap
        Note

        For single-node OpenShift clusters, the annotation workload.openshift.io/allowed=management is required for the namespace.

21.2.2. Next steps

21.3. Configuring the SR-IOV Network Operator

The Single Root I/O Virtualization (SR-IOV) Network Operator manages the SR-IOV network devices and network attachments in your cluster.

21.3.1. Configuring the SR-IOV Network Operator

Important

Modifying the SR-IOV Network Operator configuration is not normally necessary. The default configuration is recommended for most use cases. Complete the steps to modify the relevant configuration only if the default behavior of the Operator is not compatible with your use case.

The SR-IOV Network Operator adds the SriovOperatorConfig.sriovnetwork.openshift.io CustomResourceDefinition resource. The Operator automatically creates a SriovOperatorConfig custom resource (CR) named default in the openshift-sriov-network-operator namespace.

Note

The default CR contains the SR-IOV Network Operator configuration for your cluster. To change the Operator configuration, you must modify this CR.

The fields for the sriovoperatorconfig custom resource are described in the following table:

Expand
Table 21.2. SR-IOV Network Operator config custom resource
FieldTypeDescription

metadata.name

string

Specifies the name of the SR-IOV Network Operator instance. The default value is default. Do not set a different value.

metadata.namespace

string

Specifies the namespace of the SR-IOV Network Operator instance. The default value is openshift-sriov-network-operator. Do not set a different value.

spec.configDaemonNodeSelector

string

Specifies the node selection to control scheduling the SR-IOV Network Config Daemon on selected nodes. By default, this field is not set and the Operator deploys the SR-IOV Network Config daemon set on worker nodes.

spec.disableDrain

boolean

Specifies whether to disable the node draining process or enable the node draining process when you apply a new policy to configure the NIC on a node. Setting this field to true facilitates software development and installing OpenShift Container Platform on a single node. By default, this field is not set.

For single-node clusters, set this field to true after installing the Operator. This field must remain set to true.

spec.enableInjector

boolean

Specifies whether to enable or disable the Network Resources Injector daemon set. By default, this field is set to true.

spec.enableOperatorWebhook

boolean

Specifies whether to enable or disable the Operator Admission Controller webhook daemon set. By default, this field is set to true.

spec.logLevel

integer

Specifies the log verbosity level of the Operator. Set to 0 to show only the basic logs. Set to 2 to show all the available logs. By default, this field is set to 2.

21.3.1.2. About the Network Resources Injector

The Network Resources Injector is a Kubernetes Dynamic Admission Controller application. It provides the following capabilities:

  • Mutation of resource requests and limits in a pod specification to add an SR-IOV resource name according to an SR-IOV network attachment definition annotation.
  • Mutation of a pod specification with a Downward API volume to expose pod annotations, labels, and huge pages requests and limits. Containers that run in the pod can access the exposed information as files under the /etc/podnetinfo path.

By default, the Network Resources Injector is enabled by the SR-IOV Network Operator and runs as a daemon set on all control plane nodes. The following is an example of Network Resources Injector pods running in a cluster with three control plane nodes:

$ oc get pods -n openshift-sriov-network-operator
Copy to Clipboard Toggle word wrap

Example output

NAME                                      READY   STATUS    RESTARTS   AGE
network-resources-injector-5cz5p          1/1     Running   0          10m
network-resources-injector-dwqpx          1/1     Running   0          10m
network-resources-injector-lktz5          1/1     Running   0          10m
Copy to Clipboard Toggle word wrap

The SR-IOV Network Operator Admission Controller webhook is a Kubernetes Dynamic Admission Controller application. It provides the following capabilities:

  • Validation of the SriovNetworkNodePolicy CR when it is created or updated.
  • Mutation of the SriovNetworkNodePolicy CR by setting the default value for the priority and deviceType fields when the CR is created or updated.

By default the SR-IOV Network Operator Admission Controller webhook is enabled by the Operator and runs as a daemon set on all control plane nodes.

Note

Use caution when disabling the SR-IOV Network Operator Admission Controller webhook. You can disable the webhook under specific circumstances, such as troubleshooting, or if you want to use unsupported devices. For information about configuring unsupported devices, see Configuring the SR-IOV Network Operator to use an unsupported NIC.

The following is an example of the Operator Admission Controller webhook pods running in a cluster with three control plane nodes:

$ oc get pods -n openshift-sriov-network-operator
Copy to Clipboard Toggle word wrap

Example output

NAME                                      READY   STATUS    RESTARTS   AGE
operator-webhook-9jkw6                    1/1     Running   0          16m
operator-webhook-kbr5p                    1/1     Running   0          16m
operator-webhook-rpfrl                    1/1     Running   0          16m
Copy to Clipboard Toggle word wrap

21.3.1.4. About custom node selectors

The SR-IOV Network Config daemon discovers and configures the SR-IOV network devices on cluster nodes. By default, it is deployed to all the worker nodes in the cluster. You can use node labels to specify on which nodes the SR-IOV Network Config daemon runs.

To disable or enable the Network Resources Injector, which is enabled by default, complete the following procedure.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • You must have installed the SR-IOV Network Operator.

Procedure

  • Set the enableInjector field. Replace <value> with false to disable the feature or true to enable the feature.

    $ oc patch sriovoperatorconfig default \
      --type=merge -n openshift-sriov-network-operator \
      --patch '{ "spec": { "enableInjector": <value> } }'
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      enableInjector: <value>
    Copy to Clipboard Toggle word wrap

To disable or enable the admission controller webhook, which is enabled by default, complete the following procedure.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • You must have installed the SR-IOV Network Operator.

Procedure

  • Set the enableOperatorWebhook field. Replace <value> with false to disable the feature or true to enable it:

    $ oc patch sriovoperatorconfig default --type=merge \
      -n openshift-sriov-network-operator \
      --patch '{ "spec": { "enableOperatorWebhook": <value> } }'
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      enableOperatorWebhook: <value>
    Copy to Clipboard Toggle word wrap

The SR-IOV Network Config daemon discovers and configures the SR-IOV network devices on cluster nodes. By default, it is deployed to all the worker nodes in the cluster. You can use node labels to specify on which nodes the SR-IOV Network Config daemon runs.

To specify the nodes where the SR-IOV Network Config daemon is deployed, complete the following procedure.

Important

When you update the configDaemonNodeSelector field, the SR-IOV Network Config daemon is recreated on each selected node. While the daemon is recreated, cluster users are unable to apply any new SR-IOV Network node policy or create new SR-IOV pods.

Procedure

  • To update the node selector for the operator, enter the following command:

    $ oc patch sriovoperatorconfig default --type=json \
      -n openshift-sriov-network-operator \
      --patch '[{
          "op": "replace",
          "path": "/spec/configDaemonNodeSelector",
          "value": {<node_label>}
        }]'
    Copy to Clipboard Toggle word wrap

    Replace <node_label> with a label to apply as in the following example: "node-role.kubernetes.io/worker": "".

    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      configDaemonNodeSelector:
        <node_label>
    Copy to Clipboard Toggle word wrap

By default, the SR-IOV Network Operator drains workloads from a node before every policy change. The Operator performs this action to ensure that there no workloads using the virtual functions before the reconfiguration.

For installations on a single node, there are no other nodes to receive the workloads. As a result, the Operator must be configured not to drain the workloads from the single node.

Important

After performing the following procedure to disable draining workloads, you must remove any workload that uses an SR-IOV network interface before you change any SR-IOV network node policy.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • You must have installed the SR-IOV Network Operator.

Procedure

  • To set the disableDrain field to true, enter the following command:

    $ oc patch sriovoperatorconfig default --type=merge \
      -n openshift-sriov-network-operator \
      --patch '{ "spec": { "disableDrain": true } }'
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      disableDrain: true
    Copy to Clipboard Toggle word wrap

21.3.2. Next steps

21.4. Configuring an SR-IOV network device

You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster.

21.4.1. SR-IOV network node configuration object

You specify the SR-IOV network device configuration for a node by creating an SR-IOV network node policy. The API object for the policy is part of the sriovnetwork.openshift.io API group.

The following YAML describes an SR-IOV network node policy:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <name> 
1

  namespace: openshift-sriov-network-operator 
2

spec:
  resourceName: <sriov_resource_name> 
3

  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true" 
4

  priority: <priority> 
5

  mtu: <mtu> 
6

  needVhostNet: false 
7

  numVfs: <num> 
8

  nicSelector: 
9

    vendor: "<vendor_code>" 
10

    deviceID: "<device_id>" 
11

    pfNames: ["<pf_name>", ...] 
12

    rootDevices: ["<pci_bus_id>", ...] 
13

    netFilter: "<filter_string>" 
14

  deviceType: <device_type> 
15

  isRdma: false 
16

    linkType: <link_type> 
17

  eSwitchMode: "switchdev" 
18
Copy to Clipboard Toggle word wrap
1
The name for the custom resource object.
2
The namespace where the SR-IOV Network Operator is installed.
3
The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.

When specifying a name, be sure to use the accepted syntax expression ^[a-zA-Z0-9_]+$ in the resourceName.

4
The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.
Important

The SR-IOV Network Operator applies node network configuration policies to nodes in sequence. Before applying node network configuration policies, the SR-IOV Network Operator checks if the machine config pool (MCP) for a node is in an unhealthy state such as Degraded or Updating. If a node is in an unhealthy MCP, the process of applying node network configuration policies to all targeted nodes in the cluster pauses until the MCP returns to a healthy state.

To avoid a node in an unhealthy MCP from blocking the application of node network configuration policies to other nodes, including nodes in other MCPs, you must create a separate node network configuration policy for each MCP.

5
Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
6
Optional: The maximum transmission unit (MTU) of the virtual function. The maximum MTU value can vary for different network interface controller (NIC) models.
Important

If you want to create virtual function on the default network interface, ensure that the MTU is set to a value that matches the cluster MTU.

7
Optional: Set needVhostNet to true to mount the /dev/vhost-net device in the pod. Use the mounted /dev/vhost-net device with Data Plane Development Kit (DPDK) to forward traffic to the kernel network stack.
8
The number of the virtual functions (VF) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 128.
9
The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally.

If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.

10
Optional: The vendor hexadecimal code of the SR-IOV network device. The only allowed values are 8086 and 15b3.
11
Optional: The device hexadecimal code of the SR-IOV network device. For example, 101b is the device ID for a Mellanox ConnectX-6 device.
12
Optional: An array of one or more physical function (PF) names for the device.
13
Optional: An array of one or more PCI bus addresses for the PF of the device. Provide the address in the following format: 0000:02:00.1.
14
Optional: The platform-specific network filter. The only supported platform is Red Hat OpenStack Platform (RHOSP). Acceptable values use the following format: openstack/NetworkID:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with the value from the /var/config/openstack/latest/network_data.json metadata file.
15
Optional: The driver type for the virtual functions. The only allowed values are netdevice and vfio-pci. The default value is netdevice.

For a Mellanox NIC to work in DPDK mode on bare metal nodes, use the netdevice driver type and set isRdma to true.

16
Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false.

If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode.

Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.

17
Optional: The link type for the VFs. The default value is eth for Ethernet. Change this value to 'ib' for InfiniBand.

When linkType is set to ib, isRdma is automatically set to true by the SR-IOV Network Operator webhook. When linkType is set to ib, deviceType should not be set to vfio-pci.

Do not set linkType to 'eth' for SriovNetworkNodePolicy, because this can lead to an incorrect number of available devices reported by the device plugin.

18
Optional: The NIC device mode. The only allowed values are legacy or switchdev.

When eSwitchMode is set to legacy, the default SR-IOV behavior is enabled.

When eSwitchMode is set to switchdev, hardware offloading is enabled.

The following example describes the configuration for an InfiniBand device:

Example configuration for an InfiniBand device

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib-net-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ibnic1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 4
  nicSelector:
    vendor: "15b3"
    deviceID: "101b"
    rootDevices:
      - "0000:19:00.0"
  linkType: ib
  isRdma: true
Copy to Clipboard Toggle word wrap

The following example describes the configuration for an SR-IOV network device in a RHOSP virtual machine:

Example configuration for an SR-IOV device in a virtual machine

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-sriov-net-openstack-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: sriovnic1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 1 
1

  nicSelector:
    vendor: "15b3"
    deviceID: "101b"
    netFilter: "openstack/NetworkID:ea24bd04-8674-4f69-b0ee-fa0b3bd20509" 
2
Copy to Clipboard Toggle word wrap

1
The numVfs field is always set to 1 when configuring the node network policy for a virtual machine.
2
The netFilter field must refer to a network ID when the virtual machine is deployed on RHOSP. Valid values for netFilter are available from an SriovNetworkNodeState object.

In some cases, you might want to split virtual functions (VFs) from the same physical function (PF) into multiple resource pools. For example, you might want some of the VFs to load with the default driver and the remaining VFs load with the vfio-pci driver. In such a deployment, the pfNames selector in your SriovNetworkNodePolicy custom resource (CR) can be used to specify a range of VFs for a pool using the following format: <pfname>#<first_vf>-<last_vf>.

For example, the following YAML shows the selector for an interface named netpf0 with VF 2 through 7:

pfNames: ["netpf0#2-7"]
Copy to Clipboard Toggle word wrap
  • netpf0 is the PF interface name.
  • 2 is the first VF index (0-based) that is included in the range.
  • 7 is the last VF index (0-based) that is included in the range.

You can select VFs from the same PF by using different policy CRs if the following requirements are met:

  • The numVfs value must be identical for policies that select the same PF.
  • The VF index must be in the range of 0 to <numVfs>-1. For example, if you have a policy with numVfs set to 8, then the <first_vf> value must not be smaller than 0, and the <last_vf> must not be larger than 7.
  • The VFs ranges in different policies must not overlap.
  • The <first_vf> must not be larger than the <last_vf>.

The following example illustrates NIC partitioning for an SR-IOV device.

The policy policy-net-1 defines a resource pool net-1 that contains the VF 0 of PF netpf0 with the default VF driver. The policy policy-net-1-dpdk defines a resource pool net-1-dpdk that contains the VF 8 to 15 of PF netpf0 with the vfio VF driver.

Policy policy-net-1:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-net-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  nicSelector:
    pfNames: ["netpf0#0-0"]
  deviceType: netdevice
Copy to Clipboard Toggle word wrap

Policy policy-net-1-dpdk:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-net-1-dpdk
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1dpdk
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  nicSelector:
    pfNames: ["netpf0#8-15"]
  deviceType: vfio-pci
Copy to Clipboard Toggle word wrap

Verifying that the interface is successfully partitioned

Confirm that the interface partitioned to virtual functions (VFs) for the SR-IOV device by running the following command.

$ ip link show <interface> 
1
Copy to Clipboard Toggle word wrap
1
Replace <interface> with the interface that you specified when partitioning to VFs for the SR-IOV device, for example, ens3f1.

Example output

5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:d1:bc:01 brd ff:ff:ff:ff:ff:ff

vf 0     link/ether 5a:e7:88:25:ea:a0 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1     link/ether 3e:1d:36:d7:3d:49 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2     link/ether ce:09:56:97:df:f9 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3     link/ether 5e:91:cf:88:d1:38 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 4     link/ether e6:06:a1:96:2f:de brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
Copy to Clipboard Toggle word wrap

21.4.2. Configuring SR-IOV network devices

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io CustomResourceDefinition to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes.

It might take several minutes for a configuration change to apply.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the SR-IOV Network Operator.
  • You have enough available nodes in your cluster to handle the evicted workload from drained nodes.
  • You have not selected any control plane nodes for SR-IOV network device configuration.

Procedure

  1. Create an SriovNetworkNodePolicy object, and then save the YAML in the <name>-sriov-node-network.yaml file. Replace <name> with the name for this configuration.
  2. Optional: Label the SR-IOV capable cluster nodes with SriovNetworkNodePolicy.Spec.NodeSelector if they are not already labeled. For more information about labeling nodes, see "Understanding how to update labels on nodes".
  3. Create the SriovNetworkNodePolicy object:

    $ oc create -f <name>-sriov-node-network.yaml
    Copy to Clipboard Toggle word wrap

    where <name> specifies the name for this configuration.

    After applying the configuration update, all the pods in sriov-network-operator namespace transition to the Running status.

  4. To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.

    $ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
    Copy to Clipboard Toggle word wrap

21.4.3. Troubleshooting SR-IOV configuration

After following the procedure to configure an SR-IOV network device, the following sections address some error conditions.

To display the state of nodes, run the following command:

$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name>
Copy to Clipboard Toggle word wrap

where: <node_name> specifies the name of a node with an SR-IOV network device.

Error output: Cannot allocate memory

"lastSyncError": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"
Copy to Clipboard Toggle word wrap

When a node indicates that it cannot allocate memory, check the following items:

  • Confirm that global SR-IOV settings are enabled in the BIOS for the node.
  • Confirm that VT-d is enabled in the BIOS for the node.

21.4.4. Assigning an SR-IOV network to a VRF

As a cluster administrator, you can assign an SR-IOV network interface to your VRF domain by using the CNI VRF plugin.

To do this, add the VRF configuration to the optional metaPlugins parameter of the SriovNetwork resource.

Note

Applications that use VRFs need to bind to a specific device. The common usage is to use the SO_BINDTODEVICE option for a socket. SO_BINDTODEVICE binds the socket to a device that is specified in the passed interface name, for example, eth1. To use SO_BINDTODEVICE, the application must have CAP_NET_RAW capabilities.

Using a VRF through the ip vrf exec command is not supported in OpenShift Container Platform pods. To use VRF, bind applications directly to the VRF interface.

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To create an additional SR-IOV network attachment with the CNI VRF plugin, perform the following procedure.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

  1. Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR. Save the YAML as the file sriov-network-attachment.yaml.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: example-network
      namespace: additional-sriov-network-1
    spec:
      ipam: |
        {
          "type": "host-local",
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "routes": [{
            "dst": "0.0.0.0/0"
          }],
          "gateway": "10.56.217.1"
        }
      vlan: 0
      resourceName: intelnics
      metaPlugins : |
        {
          "type": "vrf", 
    1
    
          "vrfname": "example-vrf-name" 
    2
    
        }
    Copy to Clipboard Toggle word wrap
    1
    type must be set to vrf.
    2
    vrfname is the name of the VRF that the interface is assigned to. If it does not exist in the pod, it is created.
  2. Create the SriovNetwork resource:

    $ oc create -f sriov-network-attachment.yaml
    Copy to Clipboard Toggle word wrap

Verifying that the NetworkAttachmentDefinition CR is successfully created

  • Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command.

    $ oc get network-attachment-definitions -n <namespace> 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <namespace> with the namespace that you specified when configuring the network attachment, for example, additional-sriov-network-1.

    Example output

    NAME                            AGE
    additional-sriov-network-1      14m
    Copy to Clipboard Toggle word wrap

    Note

    There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network attachment is successful

To verify that the VRF CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

  1. Create an SR-IOV network that uses the VRF CNI.
  2. Assign the network to a pod.
  3. Verify that the pod network attachment is connected to the SR-IOV additional network. Remote shell into the pod and run the following command:

    $ ip vrf show
    Copy to Clipboard Toggle word wrap

    Example output

    Name              Table
    -----------------------
    red                 10
    Copy to Clipboard Toggle word wrap

  4. Confirm the VRF interface is master of the secondary interface:

    $ ip link
    Copy to Clipboard Toggle word wrap

    Example output

    ...
    5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master red state UP mode
    ...
    Copy to Clipboard Toggle word wrap

21.4.5. Next steps

You can configure an Ethernet network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.

21.5.1. Ethernet device configuration object

You can configure an Ethernet network device by defining an SriovNetwork object.

The following YAML describes an SriovNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: <name> 
1

  namespace: openshift-sriov-network-operator 
2

spec:
  resourceName: <sriov_resource_name> 
3

  networkNamespace: <target_namespace> 
4

  vlan: <vlan> 
5

  spoofChk: "<spoof_check>" 
6

  ipam: |- 
7

    {}
  linkState: <link_state> 
8

  maxTxRate: <max_tx_rate> 
9

  minTxRate: <min_tx_rate> 
10

  vlanQoS: <vlan_qos> 
11

  trust: "<trust_vf>" 
12

  capabilities: <capabilities> 
13
Copy to Clipboard Toggle word wrap
1
A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2
The namespace where the SR-IOV Network Operator is installed.
3
The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
5
Optional: A Virtual LAN (VLAN) ID for the additional network. The integer value must be from 0 to 4095. The default value is 0.
6
Optional: The spoof check mode of the VF. The allowed values are the strings "on" and "off".
Important

You must enclose the value you specify in quotes or the object is rejected by the SR-IOV Network Operator.

7
A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
8
Optional: The link state of virtual function (VF). Allowed value are enable, disable and auto.
9
Optional: A maximum transmission rate, in Mbps, for the VF.
10
Optional: A minimum transmission rate, in Mbps, for the VF. This value must be less than or equal to the maximum transmission rate.
Note

Intel NICs do not support the minTxRate parameter. For more information, see BZ#1772847.

11
Optional: An IEEE 802.1p priority level for the VF. The default value is 0.
12
Optional: The trust mode of the VF. The allowed values are the strings "on" and "off".
Important

You must enclose the value that you specify in quotes, or the SR-IOV Network Operator rejects the object.

13
Optional: The capabilities to configure for this additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.

The IP address management (IPAM) Container Network Interface (CNI) plugin provides IP addresses for other CNI plugins.

You can use the following IP address assignment types:

  • Static assignment.
  • Dynamic assignment through a DHCP server. The DHCP server you specify must be reachable from the additional network.
  • Dynamic assignment through the Whereabouts IPAM CNI plugin.

The following table describes the configuration for static IP address assignment:

Expand
Table 21.3. ipam static configuration object
FieldTypeDescription

type

string

The IPAM address type. The value static is required.

addresses

array

An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.

routes

array

An array of objects specifying routes to configure inside the pod.

dns

array

Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Expand
Table 21.4. ipam.addresses[] array
FieldTypeDescription

address

string

An IP address and network prefix that you specify. For example, if you specify 10.10.21.10/24, then the additional network is assigned an IP address of 10.10.21.10 and the netmask is 255.255.255.0.

gateway

string

The default gateway to route egress network traffic to.

Expand
Table 21.5. ipam.routes[] array
FieldTypeDescription

dst

string

The IP address range in CIDR format, such as 192.168.17.0/24 or 0.0.0.0/0 for the default route.

gw

string

The gateway where network traffic is routed.

Expand
Table 21.6. ipam.dns object
FieldTypeDescription

nameservers

array

An array of one or more IP addresses for to send DNS queries to.

domain

array

The default domain to append to a hostname. For example, if the domain is set to example.com, a DNS lookup query for example-host is rewritten as example-host.example.com.

search

array

An array of domain names to append to an unqualified hostname, such as example-host, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}
Copy to Clipboard Toggle word wrap

The following JSON describes the configuration for dynamic IP address address assignment with DHCP.

Renewal of DHCP leases

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

The SR-IOV Network Operator does not create a DHCP server deployment; The Cluster Network Operator is responsible for creating the minimal DHCP server deployment.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...
Copy to Clipboard Toggle word wrap

Expand
Table 21.7. ipam DHCP configuration object
FieldTypeDescription

type

string

The IPAM address type. The value dhcp is required.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}
Copy to Clipboard Toggle word wrap

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The following table describes the configuration for dynamic IP address assignment with Whereabouts:

Expand
Table 21.8. ipam whereabouts configuration object
FieldTypeDescription

type

string

The IPAM address type. The value whereabouts is required.

range

string

An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.

exclude

array

Optional: A list of zero or more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.

Dynamic IP address assignment configuration example that uses Whereabouts

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}
Copy to Clipboard Toggle word wrap

The Whereabouts reconciler is responsible for managing dynamic IP address assignments for the pods within a cluster using the Whereabouts IP Address Management (IPAM) solution. It ensures that each pods gets a unique IP address from the specified IP address range. It also handles IP address releases when pods are deleted or scaled down.

Note

You can also use a NetworkAttachmentDefinition custom resource for dynamic IP address assignment.

The Whereabouts reconciler daemon set is automatically created when you configure an additional network through the Cluster Network Operator. It is not automatically created when you configure an additional network from a YAML manifest.

To trigger the deployment of the Whereabouts reconciler daemonset, you must manually create a whereabouts-shim network attachment by editing the Cluster Network Operator custom resource file.

Use the following procedure to deploy the Whereabouts reconciler daemonset.

Procedure

  1. Edit the Network.operator.openshift.io custom resource (CR) by running the following command:

    $ oc edit network.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  2. Modify the additionalNetworks parameter in the CR to add the whereabouts-shim network attachment definition. For example:

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      additionalNetworks:
      - name: whereabouts-shim
        namespace: default
        rawCNIConfig: |-
          {
           "name": "whereabouts-shim",
           "cniVersion": "0.3.1",
           "type": "bridge",
           "ipam": {
             "type": "whereabouts"
           }
          }
        type: Raw
    Copy to Clipboard Toggle word wrap
  3. Save the file and exit the text editor.
  4. Verify that the whereabouts-reconciler daemon set deployed successfully by running the following command:

    $ oc get all -n openshift-multus | grep whereabouts-reconciler
    Copy to Clipboard Toggle word wrap

    Example output

    pod/whereabouts-reconciler-jnp6g 1/1 Running 0 6s
    pod/whereabouts-reconciler-k76gg 1/1 Running 0 6s
    pod/whereabouts-reconciler-k86t9 1/1 Running 0 6s
    pod/whereabouts-reconciler-p4sxw 1/1 Running 0 6s
    pod/whereabouts-reconciler-rvfdv 1/1 Running 0 6s
    pod/whereabouts-reconciler-svzw9 1/1 Running 0 6s
    daemonset.apps/whereabouts-reconciler 6 6 6 6 6 kubernetes.io/os=linux 6s
    Copy to Clipboard Toggle word wrap

21.5.2. Configuring SR-IOV additional network

You can configure an additional network that uses SR-IOV hardware by creating an SriovNetwork object. When you create an SriovNetwork object, the SR-IOV Network Operator automatically creates a NetworkAttachmentDefinition object.

Note

Do not modify or delete an SriovNetwork object if it is attached to any pods in a running state.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a SriovNetwork object, and then save the YAML in the <name>.yaml file, where <name> is a name for this additional network. The object specification might resemble the following example:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: attach1
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: net1
      networkNamespace: project2
      ipam: |-
        {
          "type": "host-local",
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "gateway": "10.56.217.1"
        }
    Copy to Clipboard Toggle word wrap
  2. To create the object, enter the following command:

    $ oc create -f <name>.yaml
    Copy to Clipboard Toggle word wrap

    where <name> specifies the name of the additional network.

  3. Optional: To confirm that the NetworkAttachmentDefinition object that is associated with the SriovNetwork object that you created in the previous step exists, enter the following command. Replace <namespace> with the networkNamespace you specified in the SriovNetwork object.

    $ oc get net-attach-def -n <namespace>
    Copy to Clipboard Toggle word wrap

21.5.3. Next steps

You can configure an InfiniBand (IB) network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.

21.6.1. InfiniBand device configuration object

You can configure an InfiniBand (IB) network device by defining an SriovIBNetwork object.

The following YAML describes an SriovIBNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: <name> 
1

  namespace: openshift-sriov-network-operator 
2

spec:
  resourceName: <sriov_resource_name> 
3

  networkNamespace: <target_namespace> 
4

  ipam: |- 
5

    {}
  linkState: <link_state> 
6

  capabilities: <capabilities> 
7
Copy to Clipboard Toggle word wrap
1
A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2
The namespace where the SR-IOV Operator is installed.
3
The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
The target namespace for the SriovIBNetwork object. Only pods in the target namespace can attach to the network device.
5
Optional: A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
6
Optional: The link state of virtual function (VF). Allowed values are enable, disable and auto.
7
Optional: The capabilities to configure for this network. You can specify "{ "ips": true }" to enable IP address support or "{ "infinibandGUID": true }" to enable IB Global Unique Identifier (GUID) support.

The IP address management (IPAM) Container Network Interface (CNI) plugin provides IP addresses for other CNI plugins.

You can use the following IP address assignment types:

  • Static assignment.
  • Dynamic assignment through a DHCP server. The DHCP server you specify must be reachable from the additional network.
  • Dynamic assignment through the Whereabouts IPAM CNI plugin.

The following table describes the configuration for static IP address assignment:

Expand
Table 21.9. ipam static configuration object
FieldTypeDescription

type

string

The IPAM address type. The value static is required.

addresses

array

An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.

routes

array

An array of objects specifying routes to configure inside the pod.

dns

array

Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Expand
Table 21.10. ipam.addresses[] array
FieldTypeDescription

address

string

An IP address and network prefix that you specify. For example, if you specify 10.10.21.10/24, then the additional network is assigned an IP address of 10.10.21.10 and the netmask is 255.255.255.0.

gateway

string

The default gateway to route egress network traffic to.

Expand
Table 21.11. ipam.routes[] array
FieldTypeDescription

dst

string

The IP address range in CIDR format, such as 192.168.17.0/24 or 0.0.0.0/0 for the default route.

gw

string

The gateway where network traffic is routed.

Expand
Table 21.12. ipam.dns object
FieldTypeDescription

nameservers

array

An array of one or more IP addresses for to send DNS queries to.

domain

array

The default domain to append to a hostname. For example, if the domain is set to example.com, a DNS lookup query for example-host is rewritten as example-host.example.com.

search

array

An array of domain names to append to an unqualified hostname, such as example-host, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}
Copy to Clipboard Toggle word wrap

The following JSON describes the configuration for dynamic IP address address assignment with DHCP.

Renewal of DHCP leases

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...
Copy to Clipboard Toggle word wrap

Expand
Table 21.13. ipam DHCP configuration object
FieldTypeDescription

type

string

The IPAM address type. The value dhcp is required.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}
Copy to Clipboard Toggle word wrap

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The following table describes the configuration for dynamic IP address assignment with Whereabouts:

Expand
Table 21.14. ipam whereabouts configuration object
FieldTypeDescription

type

string

The IPAM address type. The value whereabouts is required.

range

string

An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.

exclude

array

Optional: A list of zero or more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.

Dynamic IP address assignment configuration example that uses Whereabouts

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}
Copy to Clipboard Toggle word wrap

The Whereabouts reconciler is responsible for managing dynamic IP address assignments for the pods within a cluster using the Whereabouts IP Address Management (IPAM) solution. It ensures that each pods gets a unique IP address from the specified IP address range. It also handles IP address releases when pods are deleted or scaled down.

Note

You can also use a NetworkAttachmentDefinition custom resource for dynamic IP address assignment.

The Whereabouts reconciler daemon set is automatically created when you configure an additional network through the Cluster Network Operator. It is not automatically created when you configure an additional network from a YAML manifest.

To trigger the deployment of the Whereabouts reconciler daemonset, you must manually create a whereabouts-shim network attachment by editing the Cluster Network Operator custom resource file.

Use the following procedure to deploy the Whereabouts reconciler daemonset.

Procedure

  1. Edit the Network.operator.openshift.io custom resource (CR) by running the following command:

    $ oc edit network.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  2. Modify the additionalNetworks parameter in the CR to add the whereabouts-shim network attachment definition. For example:

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      additionalNetworks:
      - name: whereabouts-shim
        namespace: default
        rawCNIConfig: |-
          {
           "name": "whereabouts-shim",
           "cniVersion": "0.3.1",
           "type": "bridge",
           "ipam": {
             "type": "whereabouts"
           }
          }
        type: Raw
    Copy to Clipboard Toggle word wrap
  3. Save the file and exit the text editor.
  4. Verify that the whereabouts-reconciler daemon set deployed successfully by running the following command:

    $ oc get all -n openshift-multus | grep whereabouts-reconciler
    Copy to Clipboard Toggle word wrap

    Example output

    pod/whereabouts-reconciler-jnp6g 1/1 Running 0 6s
    pod/whereabouts-reconciler-k76gg 1/1 Running 0 6s
    pod/whereabouts-reconciler-k86t9 1/1 Running 0 6s
    pod/whereabouts-reconciler-p4sxw 1/1 Running 0 6s
    pod/whereabouts-reconciler-rvfdv 1/1 Running 0 6s
    pod/whereabouts-reconciler-svzw9 1/1 Running 0 6s
    daemonset.apps/whereabouts-reconciler 6 6 6 6 6 kubernetes.io/os=linux 6s
    Copy to Clipboard Toggle word wrap

21.6.2. Configuring SR-IOV additional network

You can configure an additional network that uses SR-IOV hardware by creating an SriovIBNetwork object. When you create an SriovIBNetwork object, the SR-IOV Network Operator automatically creates a NetworkAttachmentDefinition object.

Note

Do not modify or delete an SriovIBNetwork object if it is attached to any pods in a running state.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a SriovIBNetwork object, and then save the YAML in the <name>.yaml file, where <name> is a name for this additional network. The object specification might resemble the following example:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovIBNetwork
    metadata:
      name: attach1
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: net1
      networkNamespace: project2
      ipam: |-
        {
          "type": "host-local",
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "gateway": "10.56.217.1"
        }
    Copy to Clipboard Toggle word wrap
  2. To create the object, enter the following command:

    $ oc create -f <name>.yaml
    Copy to Clipboard Toggle word wrap

    where <name> specifies the name of the additional network.

  3. Optional: To confirm that the NetworkAttachmentDefinition object that is associated with the SriovIBNetwork object that you created in the previous step exists, enter the following command. Replace <namespace> with the networkNamespace you specified in the SriovIBNetwork object.

    $ oc get net-attach-def -n <namespace>
    Copy to Clipboard Toggle word wrap

21.6.3. Next steps

21.7. Adding a pod to an SR-IOV additional network

You can add a pod to an existing Single Root I/O Virtualization (SR-IOV) network.

When attaching a pod to an additional network, you can specify a runtime configuration to make specific customizations for the pod. For example, you can request a specific MAC hardware address.

You specify the runtime configuration by setting an annotation in the pod specification. The annotation key is k8s.v1.cni.cncf.io/networks, and it accepts a JSON object that describes the runtime configuration.

The following JSON describes the runtime configuration options for an Ethernet-based SR-IOV network attachment.

[
  {
    "name": "<name>", 
1

    "mac": "<mac_address>", 
2

    "ips": ["<cidr_range>"] 
3

  }
]
Copy to Clipboard Toggle word wrap
1
The name of the SR-IOV network attachment definition CR.
2
Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
3
Optional: IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.

Example runtime configuration

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "net1",
          "mac": "20:04:0f:f1:88:01",
          "ips": ["192.168.10.1/24", "2001::1/64"]
        }
      ]
spec:
  containers:
  - name: sample-container
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]
Copy to Clipboard Toggle word wrap

The following JSON describes the runtime configuration options for an InfiniBand-based SR-IOV network attachment.

[
  {
    "name": "<network_attachment>", 
1

    "infiniband-guid": "<guid>", 
2

    "ips": ["<cidr_range>"] 
3

  }
]
Copy to Clipboard Toggle word wrap
1
The name of the SR-IOV network attachment definition CR.
2
The InfiniBand GUID for the SR-IOV device. To use this feature, you also must specify { "infinibandGUID": true } in the SriovIBNetwork object.
3
The IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovIBNetwork object.

Example runtime configuration

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "ib1",
          "infiniband-guid": "c2:11:22:33:44:55:66:77",
          "ips": ["192.168.10.1/24", "2001::1/64"]
        }
      ]
spec:
  containers:
  - name: sample-container
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]
Copy to Clipboard Toggle word wrap

21.7.2. Adding a pod to an additional network

You can add a pod to an additional network. The pod continues to send normal cluster-related network traffic over the default network.

When a pod is created additional networks are attached to it. However, if a pod already exists, you cannot attach additional networks to it.

The pod must be in the same namespace as the additional network.

Note

The SR-IOV Network Resource Injector adds the resource field to the first container in a pod automatically.

If you are using an Intel network interface controller (NIC) in Data Plane Development Kit (DPDK) mode, only the first container in your pod is configured to access the NIC. Your SR-IOV additional network is configured for DPDK mode if the deviceType is set to vfio-pci in the SriovNetworkNodePolicy object.

You can work around this issue by either ensuring that the container that needs access to the NIC is the first container defined in the Pod object or by disabling the Network Resource Injector. For more information, see BZ#1990953.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster.
  • Install the SR-IOV Operator.
  • Create either an SriovNetwork object or an SriovIBNetwork object to attach the pod to.

Procedure

  1. Add an annotation to the Pod object. Only one of the following annotation formats can be used:

    1. To attach an additional network without any customization, add an annotation with the following format. Replace <network> with the name of the additional network to associate with the pod:

      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: <network>[,<network>,...] 
      1
      Copy to Clipboard Toggle word wrap
      1
      To specify more than one additional network, separate each network with a comma. Do not include whitespace between the comma. If you specify the same additional network multiple times, that pod will have multiple network interfaces attached to that network.
    2. To attach an additional network with customizations, add an annotation with the following format:

      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: |-
            [
              {
                "name": "<network>", 
      1
      
                "namespace": "<namespace>", 
      2
      
                "default-route": ["<default-route>"] 
      3
      
              }
            ]
      Copy to Clipboard Toggle word wrap
      1
      Specify the name of the additional network defined by a NetworkAttachmentDefinition object.
      2
      Specify the namespace where the NetworkAttachmentDefinition object is defined.
      3
      Optional: Specify an override for the default route, such as 192.168.17.1.
  2. To create the pod, enter the following command. Replace <name> with the name of the pod.

    $ oc create -f <name>.yaml
    Copy to Clipboard Toggle word wrap
  3. Optional: To Confirm that the annotation exists in the Pod CR, enter the following command, replacing <name> with the name of the pod.

    $ oc get pod <name> -o yaml
    Copy to Clipboard Toggle word wrap

    In the following example, the example-pod pod is attached to the net1 additional network:

    $ oc get pod example-pod -o yaml
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: macvlan-bridge
        k8s.v1.cni.cncf.io/networks-status: |- 
    1
    
          [{
              "name": "openshift-sdn",
              "interface": "eth0",
              "ips": [
                  "10.128.2.14"
              ],
              "default": true,
              "dns": {}
          },{
              "name": "macvlan-bridge",
              "interface": "net1",
              "ips": [
                  "20.2.2.100"
              ],
              "mac": "22:2f:60:a5:f8:00",
              "dns": {}
          }]
      name: example-pod
      namespace: default
    spec:
      ...
    status:
      ...
    Copy to Clipboard Toggle word wrap
    1
    The k8s.v1.cni.cncf.io/networks-status parameter is a JSON array of objects. Each object describes the status of an additional network attached to the pod. The annotation value is stored as a plain text value.

You can create a NUMA aligned SR-IOV pod by restricting SR-IOV and the CPU resources allocated from the same NUMA node with restricted or single-numa-node Topology Manager polices.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have configured the CPU Manager policy to static. For more information on CPU Manager, see the "Additional resources" section.
  • You have configured the Topology Manager policy to single-numa-node.

    Note

    When single-numa-node is unable to satisfy the request, you can configure the Topology Manager policy to restricted.

Procedure

  1. Create the following SR-IOV pod spec, and then save the YAML in the <name>-sriov-pod.yaml file. Replace <name> with a name for this pod.

    The following example shows an SR-IOV pod spec:

    apiVersion: v1
    kind: Pod
    metadata:
      name: sample-pod
      annotations:
        k8s.v1.cni.cncf.io/networks: <name> 
    1
    
    spec:
      containers:
      - name: sample-container
        image: <image> 
    2
    
        command: ["sleep", "infinity"]
        resources:
          limits:
            memory: "1Gi" 
    3
    
            cpu: "2" 
    4
    
          requests:
            memory: "1Gi"
            cpu: "2"
    Copy to Clipboard Toggle word wrap
    1
    Replace <name> with the name of the SR-IOV network attachment definition CR.
    2
    Replace <image> with the name of the sample-pod image.
    3
    To create the SR-IOV pod with guaranteed QoS, set memory limits equal to memory requests.
    4
    To create the SR-IOV pod with guaranteed QoS, set cpu limits equals to cpu requests.
  2. Create the sample SR-IOV pod by running the following command:

    $ oc create -f <filename> 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <filename> with the name of the file you created in the previous step.
  3. Confirm that the sample-pod is configured with guaranteed QoS.

    $ oc describe pod sample-pod
    Copy to Clipboard Toggle word wrap
  4. Confirm that the sample-pod is allocated with exclusive CPUs.

    $ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
    Copy to Clipboard Toggle word wrap
  5. Confirm that the SR-IOV device and CPUs that are allocated for the sample-pod are on the same NUMA node.

    $ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
    Copy to Clipboard Toggle word wrap

The following testpmd pod demonstrates container creation with huge pages, reserved CPUs, and the SR-IOV port.

An example testpmd pod

apiVersion: v1
kind: Pod
metadata:
  name: testpmd-sriov
  namespace: mynamespace
  annotations:
    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
# ...
spec:
  containers:
  - name: testpmd
    command: ["sleep", "99999"]
    image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
    securityContext:
      capabilities:
        add: ["IPC_LOCK","SYS_ADMIN"]
      privileged: true
      runAsUser: 0
    resources:
      requests:
        memory: 1000Mi
        hugepages-1Gi: 1Gi
        cpu: '2'
        openshift.io/sriov1: 1
      limits:
        hugepages-1Gi: 1Gi
        cpu: '2'
        memory: 1000Mi
        openshift.io/sriov1: 1
    volumeMounts:
      - mountPath: /dev/hugepages
        name: hugepage
        readOnly: False
  runtimeClassName: performance-cnf-performanceprofile 
1

  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
Copy to Clipboard Toggle word wrap

1
This example assumes that the name of the performance profile is cnf-performance profile.

As a cluster administrator, you can modify interface-level network sysctls using the tuning Container Network Interface (CNI) meta plugin for a pod connected to a SR-IOV network device.

21.8.1. Labeling nodes with an SR-IOV enabled NIC

If you want to enable SR-IOV on only SR-IOV capable nodes there are a couple of ways to do this:

  1. Install the Node Feature Discovery (NFD) Operator. NFD detects the presence of SR-IOV enabled NICs and labels the nodes with node.alpha.kubernetes-incubator.io/nfd-network-sriov.capable = true.
  2. Examine the SriovNetworkNodeState CR for each node. The interfaces stanza includes a list of all of the SR-IOV devices discovered by the SR-IOV Network Operator on the worker node. Label each node with feature.node.kubernetes.io/network-sriov.capable: "true" by using the following command:

    $ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"
    Copy to Clipboard Toggle word wrap
    Note

    You can label the nodes with whatever name you want.

21.8.2. Setting one sysctl flag

You can set interface-level network sysctl settings for a pod connected to a SR-IOV network device.

In this example, net.ipv4.conf.IFNAME.accept_redirects is set to 1 on the created virtual interfaces.

The sysctl-tuning-test is a namespace used in this example.

  • Use the following command to create the sysctl-tuning-test namespace:

    $ oc create namespace sysctl-tuning-test
    Copy to Clipboard Toggle word wrap

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io custom resource definition (CRD) to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain and reboot the nodes.

It can take several minutes for a configuration change to apply.

Follow this procedure to create a SriovNetworkNodePolicy custom resource (CR).

Procedure

  1. Create an SriovNetworkNodePolicy custom resource (CR). For example, save the following YAML as the file policyoneflag-sriov-node-network.yaml:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: policyoneflag 
    1
    
      namespace: openshift-sriov-network-operator 
    2
    
    spec:
      resourceName: policyoneflag 
    3
    
      nodeSelector: 
    4
    
        feature.node.kubernetes.io/network-sriov.capable="true"
      priority: 10 
    5
    
      numVfs: 5 
    6
    
      nicSelector: 
    7
    
        pfNames: ["ens5"] 
    8
    
      deviceType: "netdevice" 
    9
    
      isRdma: false 
    10
    Copy to Clipboard Toggle word wrap
    1
    The name for the custom resource object.
    2
    The namespace where the SR-IOV Network Operator is installed.
    3
    The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.
    4
    The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.
    5
    Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
    6
    The number of the virtual functions (VFs) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 128.
    7
    The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally. If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.
    8
    Optional: An array of one or more physical function (PF) names for the device.
    9
    Optional: The driver type for the virtual functions. The only allowed value is netdevice. For a Mellanox NIC to work in DPDK mode on bare metal nodes, set isRdma to true.
    10
    Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false. If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode. Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.
    Note

    The vfio-pci driver type is not supported.

  2. Create the SriovNetworkNodePolicy object:

    $ oc create -f policyoneflag-sriov-node-network.yaml
    Copy to Clipboard Toggle word wrap

    After applying the configuration update, all the pods in sriov-network-operator namespace change to the Running status.

  3. To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.

    $ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
    Copy to Clipboard Toggle word wrap

    Example output

    Succeeded
    Copy to Clipboard Toggle word wrap

21.8.2.2. Configuring sysctl on a SR-IOV network

You can set interface specific sysctl settings on virtual interfaces created by SR-IOV by adding the tuning configuration to the optional metaPlugins parameter of the SriovNetwork resource.

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To change the interface-level network net.ipv4.conf.IFNAME.accept_redirects sysctl settings, create an additional SR-IOV network with the Container Network Interface (CNI) tuning plugin.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

  1. Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR. Save the YAML as the file sriov-network-interface-sysctl.yaml.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: onevalidflag 
    1
    
      namespace: openshift-sriov-network-operator 
    2
    
    spec:
      resourceName: policyoneflag 
    3
    
      networkNamespace: sysctl-tuning-test 
    4
    
      ipam: '{ "type": "static" }' 
    5
    
      capabilities: '{ "mac": true, "ips": true }' 
    6
    
      metaPlugins : | 
    7
    
        {
          "type": "tuning",
          "capabilities":{
            "mac":true
          },
          "sysctl":{
             "net.ipv4.conf.IFNAME.accept_redirects": "1"
          }
        }
    Copy to Clipboard Toggle word wrap
    1
    A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
    2
    The namespace where the SR-IOV Network Operator is installed.
    3
    The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
    4
    The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
    5
    A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    6
    Optional: Set capabilities for the additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.
    7
    Optional: The metaPlugins parameter is used to add additional capabilities to the device. In this use case set the type field to tuning. Specify the interface-level network sysctl you want to set in the sysctl field.
  2. Create the SriovNetwork resource:

    $ oc create -f sriov-network-interface-sysctl.yaml
    Copy to Clipboard Toggle word wrap

Verifying that the NetworkAttachmentDefinition CR is successfully created

  • Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command:

    $ oc get network-attachment-definitions -n <namespace> 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <namespace> with the value for networkNamespace that you specified in the SriovNetwork object. For example, sysctl-tuning-test.

    Example output

    NAME                                  AGE
    onevalidflag                          14m
    Copy to Clipboard Toggle word wrap

    Note

    There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network attachment is successful

To verify that the tuning CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

  1. Create a Pod CR. Save the following YAML as the file examplepod.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: tunepod
      namespace: sysctl-tuning-test
      annotations:
        k8s.v1.cni.cncf.io/networks: |-
          [
            {
              "name": "onevalidflag",  
    1
    
              "mac": "0a:56:0a:83:04:0c", 
    2
    
              "ips": ["10.100.100.200/24"] 
    3
    
           }
          ]
    spec:
      containers:
      - name: podexample
        image: centos
        command: ["/bin/bash", "-c", "sleep INF"]
        securityContext:
          runAsUser: 2000
          runAsGroup: 3000
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    Copy to Clipboard Toggle word wrap
    1
    The name of the SR-IOV network attachment definition CR.
    2
    Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
    3
    Optional: IP addresses for the SR-IOV device that are allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.
  2. Create the Pod CR:

    $ oc apply -f examplepod.yaml
    Copy to Clipboard Toggle word wrap
  3. Verify that the pod is created by running the following command:

    $ oc get pod -n sysctl-tuning-test
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      READY   STATUS    RESTARTS   AGE
    tunepod   1/1     Running   0          47s
    Copy to Clipboard Toggle word wrap

  4. Log in to the pod by running the following command:

    $ oc rsh -n sysctl-tuning-test tunepod
    Copy to Clipboard Toggle word wrap
  5. Verify the values of the configured sysctl flag. Find the value net.ipv4.conf.IFNAME.accept_redirects by running the following command::

    $ sysctl net.ipv4.conf.net1.accept_redirects
    Copy to Clipboard Toggle word wrap

    Example output

    net.ipv4.conf.net1.accept_redirects = 1
    Copy to Clipboard Toggle word wrap

You can set interface-level network sysctl settings for a pod connected to a bonded SR-IOV network device.

In this example, the specific network interface-level sysctl settings that can be configured are set on the bonded interface.

The sysctl-tuning-test is a namespace used in this example.

  • Use the following command to create the sysctl-tuning-test namespace:

    $ oc create namespace sysctl-tuning-test
    Copy to Clipboard Toggle word wrap

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io custom resource definition (CRD) to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes.

It might take several minutes for a configuration change to apply.

Follow this procedure to create a SriovNetworkNodePolicy custom resource (CR).

Procedure

  1. Create an SriovNetworkNodePolicy custom resource (CR). Save the following YAML as the file policyallflags-sriov-node-network.yaml. Replace policyallflags with the name for the configuration.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: policyallflags 
    1
    
      namespace: openshift-sriov-network-operator 
    2
    
    spec:
      resourceName: policyallflags 
    3
    
      nodeSelector: 
    4
    
        node.alpha.kubernetes-incubator.io/nfd-network-sriov.capable = `true`
      priority: 10 
    5
    
      numVfs: 5 
    6
    
      nicSelector: 
    7
    
        pfNames: ["ens1f0"]  
    8
    
      deviceType: "netdevice" 
    9
    
      isRdma: false 
    10
    Copy to Clipboard Toggle word wrap
    1
    The name for the custom resource object.
    2
    The namespace where the SR-IOV Network Operator is installed.
    3
    The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.
    4
    The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.
    5
    Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
    6
    The number of virtual functions (VFs) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 128.
    7
    The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally. If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.
    8
    Optional: An array of one or more physical function (PF) names for the device.
    9
    Optional: The driver type for the virtual functions. The only allowed value is netdevice. For a Mellanox NIC to work in DPDK mode on bare metal nodes, set isRdma to true.
    10
    Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false. If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode. Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.
    Note

    The vfio-pci driver type is not supported.

  2. Create the SriovNetworkNodePolicy object:

    $ oc create -f policyallflags-sriov-node-network.yaml
    Copy to Clipboard Toggle word wrap

    After applying the configuration update, all the pods in sriov-network-operator namespace change to the Running status.

  3. To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.

    $ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
    Copy to Clipboard Toggle word wrap

    Example output

    Succeeded
    Copy to Clipboard Toggle word wrap

You can set interface specific sysctl settings on a bonded interface created from two SR-IOV interfaces. Do this by adding the tuning configuration to the optional Plugins parameter of the bond network attachment definition.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To change specific interface-level network sysctl settings create the SriovNetwork custom resource (CR) with the Container Network Interface (CNI) tuning plugin by using the following procedure.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

  1. Create the SriovNetwork custom resource (CR) for the bonded interface as in the following example CR. Save the YAML as the file sriov-network-attachment.yaml.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: allvalidflags 
    1
    
      namespace: openshift-sriov-network-operator 
    2
    
    spec:
      resourceName: policyallflags 
    3
    
      networkNamespace: sysctl-tuning-test 
    4
    
      capabilities: '{ "mac": true, "ips": true }' 
    5
    Copy to Clipboard Toggle word wrap
    1
    A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
    2
    The namespace where the SR-IOV Network Operator is installed.
    3
    The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
    4
    The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
    5
    Optional: The capabilities to configure for this additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.
  2. Create the SriovNetwork resource:

    $ oc create -f sriov-network-attachment.yaml
    Copy to Clipboard Toggle word wrap
  3. Create a bond network attachment definition as in the following example CR. Save the YAML as the file sriov-bond-network-interface.yaml.

    apiVersion: "k8s.cni.cncf.io/v1"
    kind: NetworkAttachmentDefinition
    metadata:
      name: bond-sysctl-network
      namespace: sysctl-tuning-test
    spec:
      config: '{
      "cniVersion":"0.4.0",
      "name":"bound-net",
      "plugins":[
        {
          "type":"bond", 
    1
    
          "mode": "active-backup", 
    2
    
          "failOverMac": 1, 
    3
    
          "linksInContainer": true, 
    4
    
          "miimon": "100",
          "links": [ 
    5
    
            {"name": "net1"},
            {"name": "net2"}
          ],
          "ipam":{ 
    6
    
            "type":"static"
          }
        },
        {
          "type":"tuning", 
    7
    
          "capabilities":{
            "mac":true
          },
          "sysctl":{
            "net.ipv4.conf.IFNAME.accept_redirects": "0",
            "net.ipv4.conf.IFNAME.accept_source_route": "0",
            "net.ipv4.conf.IFNAME.disable_policy": "1",
            "net.ipv4.conf.IFNAME.secure_redirects": "0",
            "net.ipv4.conf.IFNAME.send_redirects": "0",
            "net.ipv6.conf.IFNAME.accept_redirects": "0",
            "net.ipv6.conf.IFNAME.accept_source_route": "1",
            "net.ipv6.neigh.IFNAME.base_reachable_time_ms": "20000",
            "net.ipv6.neigh.IFNAME.retrans_time_ms": "2000"
          }
        }
      ]
    }'
    Copy to Clipboard Toggle word wrap
    1
    The type is bond.
    2
    The mode attribute specifies the bonding mode. The bonding modes supported are:
    • balance-rr - 0
    • active-backup - 1
    • balance-xor - 2

      For balance-rr or balance-xor modes, you must set the trust mode to on for the SR-IOV virtual function.

    3
    The failover attribute is mandatory for active-backup mode.
    4
    The linksInContainer=true flag informs the Bond CNI that the required interfaces are to be found inside the container. By default, Bond CNI looks for these interfaces on the host which does not work for integration with SRIOV and Multus.
    5
    The links section defines which interfaces will be used to create the bond. By default, Multus names the attached interfaces as: "net", plus a consecutive number, starting with one.
    6
    A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition. In this pod example IP addresses are configured manually, so in this case,ipam is set to static.
    7
    Add additional capabilities to the device. For example, set the type field to tuning. Specify the interface-level network sysctl you want to set in the sysctl field. This example sets all interface-level network sysctl settings that can be set.
  4. Create the bond network attachment resource:

    $ oc create -f sriov-bond-network-interface.yaml
    Copy to Clipboard Toggle word wrap

Verifying that the NetworkAttachmentDefinition CR is successfully created

  • Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command:

    $ oc get network-attachment-definitions -n <namespace> 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <namespace> with the networkNamespace that you specified when configuring the network attachment, for example, sysctl-tuning-test.

    Example output

    NAME                          AGE
    bond-sysctl-network           22m
    allvalidflags                 47m
    Copy to Clipboard Toggle word wrap

    Note

    There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network resource is successful

To verify that the tuning CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

  1. Create a Pod CR. For example, save the following YAML as the file examplepod.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: tunepod
      namespace: sysctl-tuning-test
      annotations:
        k8s.v1.cni.cncf.io/networks: |-
          [
            {"name": "allvalidflags"}, 
    1
    
            {"name": "allvalidflags"},
            {
              "name": "bond-sysctl-network",
              "interface": "bond0",
              "mac": "0a:56:0a:83:04:0c", 
    2
    
              "ips": ["10.100.100.200/24"] 
    3
    
           }
          ]
    spec:
      containers:
      - name: podexample
        image: centos
        command: ["/bin/bash", "-c", "sleep INF"]
        securityContext:
          runAsUser: 2000
          runAsGroup: 3000
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    Copy to Clipboard Toggle word wrap
    1
    The name of the SR-IOV network attachment definition CR.
    2
    Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
    3
    Optional: IP addresses for the SR-IOV device that are allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.
  2. Apply the YAML:

    $ oc apply -f examplepod.yaml
    Copy to Clipboard Toggle word wrap
  3. Verify that the pod is created by running the following command:

    $ oc get pod -n sysctl-tuning-test
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      READY   STATUS    RESTARTS   AGE
    tunepod   1/1     Running   0          47s
    Copy to Clipboard Toggle word wrap

  4. Log in to the pod by running the following command:

    $ oc rsh -n sysctl-tuning-test tunepod
    Copy to Clipboard Toggle word wrap
  5. Verify the values of the configured sysctl flag. Find the value net.ipv6.neigh.IFNAME.base_reachable_time_ms by running the following command::

    $ sysctl net.ipv6.neigh.bond0.base_reachable_time_ms
    Copy to Clipboard Toggle word wrap

    Example output

    net.ipv6.neigh.bond0.base_reachable_time_ms = 20000
    Copy to Clipboard Toggle word wrap

21.9. Using high performance multicast

You can use multicast on your Single Root I/O Virtualization (SR-IOV) hardware network.

21.9.1. High performance multicast

The OpenShift SDN default Container Network Interface (CNI) network provider supports multicast between pods on the default network. This is best used for low-bandwidth coordination or service discovery, and not high-bandwidth applications. For applications such as streaming media, like Internet Protocol television (IPTV) and multipoint videoconferencing, you can utilize Single Root I/O Virtualization (SR-IOV) hardware to provide near-native performance.

When using additional SR-IOV interfaces for multicast:

  • Multicast packages must be sent or received by a pod through the additional SR-IOV interface.
  • The physical network which connects the SR-IOV interfaces decides the multicast routing and topology, which is not controlled by OpenShift Container Platform.

The follow procedure creates an example SR-IOV interface for multicast.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  1. Create a SriovNetworkNodePolicy object:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: policy-example
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: example
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      numVfs: 4
      nicSelector:
        vendor: "8086"
        pfNames: ['ens803f0']
        rootDevices: ['0000:86:00.0']
    Copy to Clipboard Toggle word wrap
  2. Create a SriovNetwork object:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: net-example
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: default
      ipam: | 
    1
    
        {
          "type": "host-local", 
    2
    
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "routes": [
            {"dst": "224.0.0.0/5"},
            {"dst": "232.0.0.0/5"}
          ],
          "gateway": "10.56.217.1"
        }
      resourceName: example
    Copy to Clipboard Toggle word wrap
    1 2
    If you choose to configure DHCP as IPAM, ensure that you provision the following default routes through your DHCP server: 224.0.0.0/5 and 232.0.0.0/5. This is to override the static multicast route set by the default network provider.
  3. Create a pod with multicast application:

    apiVersion: v1
    kind: Pod
    metadata:
      name: testpmd
      namespace: default
      annotations:
        k8s.v1.cni.cncf.io/networks: nic1
    spec:
      containers:
      - name: example
        image: rhel7:latest
        securityContext:
          capabilities:
            add: ["NET_ADMIN"] 
    1
    
        command: [ "sleep", "infinity"]
    Copy to Clipboard Toggle word wrap
    1
    The NET_ADMIN capability is required only if your application needs to assign the multicast IP address to the SR-IOV interface. Otherwise, it can be omitted.

21.10. Using DPDK and RDMA

The containerized Data Plane Development Kit (DPDK) application is supported on OpenShift Container Platform. You can use Single Root I/O Virtualization (SR-IOV) network hardware with the Data Plane Development Kit (DPDK) and with remote direct memory access (RDMA).

For information on supported devices, refer to Supported devices.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Install the SR-IOV Network Operator.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the intel-dpdk-node-policy.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: intel-dpdk-node-policy
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: intelnics
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      priority: <priority>
      numVfs: <num>
      nicSelector:
        vendor: "8086"
        deviceID: "158b"
        pfNames: ["<pf_name>", ...]
        rootDevices: ["<pci_bus_id>", "..."]
      deviceType: vfio-pci 
    1
    Copy to Clipboard Toggle word wrap
    1
    Specify the driver type for the virtual functions to vfio-pci.
    Note

    See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f intel-dpdk-node-policy.yaml
    Copy to Clipboard Toggle word wrap
  3. Create the following SriovNetwork object, and then save the YAML in the intel-dpdk-network.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: intel-dpdk-network
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: <target_namespace>
      ipam: |-
    # ... 
    1
    
      vlan: <vlan>
      resourceName: intelnics
    Copy to Clipboard Toggle word wrap
    1
    Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    Note

    See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetwork object by running the following command:

    $ oc create -f intel-dpdk-network.yaml
    Copy to Clipboard Toggle word wrap
  5. Create the following Pod spec, and then save the YAML in the intel-dpdk-pod.yaml file.

    apiVersion: v1
    kind: Pod
    metadata:
      name: dpdk-app
      namespace: <target_namespace> 
    1
    
      annotations:
        k8s.v1.cni.cncf.io/networks: intel-dpdk-network
    spec:
      containers:
      - name: testpmd
        image: <DPDK_image> 
    2
    
        securityContext:
          runAsUser: 0
          capabilities:
            add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 
    3
    
        volumeMounts:
        - mountPath: /mnt/huge 
    4
    
          name: hugepage
        resources:
          limits:
            openshift.io/intelnics: "1" 
    5
    
            memory: "1Gi"
            cpu: "4" 
    6
    
            hugepages-1Gi: "4Gi" 
    7
    
          requests:
            openshift.io/intelnics: "1"
            memory: "1Gi"
            cpu: "4"
            hugepages-1Gi: "4Gi"
        command: ["sleep", "infinity"]
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
    Copy to Clipboard Toggle word wrap
    1
    Specify the same target_namespace where the SriovNetwork object intel-dpdk-network is created. If you would like to create the pod in a different namespace, change target_namespace in both the Pod spec and the SriovNetwork object.
    2
    Specify the DPDK image which includes your application and the DPDK library used by application.
    3
    Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4
    Mount a hugepage volume to the DPDK pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5
    Optional: Specify the number of DPDK devices allocated to DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by the SR-IOV Operator. It is enabled by default and can be disabled by setting enableInjector option to false in the default SriovOperatorConfig CR.
    6
    Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and creating a pod with Guaranteed QoS.
    7
    Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes. For example, adding kernel arguments default_hugepagesz=1GB, hugepagesz=1G and hugepages=16 will result in 16*1Gi hugepages be allocated during system boot.
  6. Create the DPDK pod by running the following command:

    $ oc create -f intel-dpdk-pod.yaml
    Copy to Clipboard Toggle word wrap

You can create a network node policy and create a Data Plane Development Kit (DPDK) pod using a virtual function in DPDK mode with a Mellanox NIC.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have installed the Single Root I/O Virtualization (SR-IOV) Network Operator.
  • You have logged in as a user with cluster-admin privileges.

Procedure

  1. Save the following SriovNetworkNodePolicy YAML configuration to an mlx-dpdk-node-policy.yaml file:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: mlx-dpdk-node-policy
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: mlxnics
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      priority: <priority>
      numVfs: <num>
      nicSelector:
        vendor: "15b3"
        deviceID: "1015" 
    1
    
        pfNames: ["<pf_name>", ...]
        rootDevices: ["<pci_bus_id>", "..."]
      deviceType: netdevice 
    2
    
      isRdma: true 
    3
    Copy to Clipboard Toggle word wrap
    1
    Specify the device hex code of the SR-IOV network device.
    2
    Specify the driver type for the virtual functions to netdevice. A Mellanox SR-IOV Virtual Function (VF) can work in DPDK mode without using the vfio-pci device type. The VF device appears as a kernel network interface inside a container.
    3
    Enable Remote Direct Memory Access (RDMA) mode. This is required for Mellanox cards to work in DPDK mode.
    Note

    See Configuring an SR-IOV network device for a detailed explanation of each option in the SriovNetworkNodePolicy object.

    When applying the configuration specified in an SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes. It might take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-dpdk-node-policy.yaml
    Copy to Clipboard Toggle word wrap
  3. Save the following SriovNetwork YAML configuration to an mlx-dpdk-network.yaml file:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: mlx-dpdk-network
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: <target_namespace>
      ipam: |- 
    1
    
    ...
      vlan: <vlan>
      resourceName: mlxnics
    Copy to Clipboard Toggle word wrap
    1
    Specify a configuration object for the IP Address Management (IPAM) Container Network Interface (CNI) plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    Note

    See Configuring an SR-IOV network device for a detailed explanation on each option in the SriovNetwork object.

    The app-netutil option library provides several API methods for gathering network information about the parent pod of a container.

  4. Create the SriovNetwork object by running the following command:

    $ oc create -f mlx-dpdk-network.yaml
    Copy to Clipboard Toggle word wrap
  5. Save the following Pod YAML configuration to an mlx-dpdk-pod.yaml file:

    apiVersion: v1
    kind: Pod
    metadata:
      name: dpdk-app
      namespace: <target_namespace> 
    1
    
      annotations:
        k8s.v1.cni.cncf.io/networks: mlx-dpdk-network
    spec:
      containers:
      - name: testpmd
        image: <DPDK_image> 
    2
    
        securityContext:
          runAsUser: 0
          capabilities:
            add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 
    3
    
        volumeMounts:
        - mountPath: /mnt/huge 
    4
    
          name: hugepage
        resources:
          limits:
            openshift.io/mlxnics: "1" 
    5
    
            memory: "1Gi"
            cpu: "4" 
    6
    
            hugepages-1Gi: "4Gi" 
    7
    
          requests:
            openshift.io/mlxnics: "1"
            memory: "1Gi"
            cpu: "4"
            hugepages-1Gi: "4Gi"
        command: ["sleep", "infinity"]
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
    Copy to Clipboard Toggle word wrap
    1
    Specify the same target_namespace where SriovNetwork object mlx-dpdk-network is created. To create the pod in a different namespace, change target_namespace in both the Pod spec and SriovNetwork object.
    2
    Specify the DPDK image which includes your application and the DPDK library used by the application.
    3
    Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4
    Mount the hugepage volume to the DPDK pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5
    Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector option to false in the default SriovOperatorConfig CR.
    6
    Specify the number of CPUs. The DPDK pod usually requires that exclusive CPUs be allocated from the kubelet. To do this, set the CPU Manager policy to static and create a pod with Guaranteed Quality of Service (QoS).
    7
    Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepages requires adding kernel arguments to Nodes.
  6. Create the DPDK pod by running the following command:

    $ oc create -f mlx-dpdk-pod.yaml
    Copy to Clipboard Toggle word wrap

To achieve a specific Data Plane Development Kit (DPDK) line rate, deploy a Node Tuning Operator and configure Single Root I/O Virtualization (SR-IOV). You must also tune the DPDK settings for the following resources:

  • Isolated CPUs
  • Hugepages
  • The topology scheduler
Note

In previous versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

DPDK test environment

The following diagram shows the components of a traffic-testing environment:

  • Traffic generator: An application that can generate high-volume packet traffic.
  • SR-IOV-supporting NIC: A network interface card compatible with SR-IOV. The card runs a number of virtual functions on a physical interface.
  • Physical Function (PF): A PCI Express (PCIe) function of a network adapter that supports the SR-IOV interface.
  • Virtual Function (VF): A lightweight PCIe function on a network adapter that supports SR-IOV. The VF is associated with the PCIe PF on the network adapter. The VF represents a virtualized instance of the network adapter.
  • Switch: A network switch. Nodes can also be connected back-to-back.
  • testpmd: An example application included with DPDK. The testpmd application can be used to test the DPDK in a packet-forwarding mode. The testpmd application is also an example of how to build a fully-fledged application using the DPDK Software Development Kit (SDK).
  • worker 0 and worker 1: OpenShift Container Platform nodes.

You can use the Node Tuning Operator to configure isolated CPUs, hugepages, and a topology scheduler. You can then use the Node Tuning Operator with Single Root I/O Virtualization (SR-IOV) to achieve a specific Data Plane Development Kit (DPDK) line rate.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have installed the SR-IOV Network Operator.
  • You have logged in as a user with cluster-admin privileges.
  • You have deployed a standalone Node Tuning Operator.

    Note

    In previous versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

Procedure

  1. Create a PerformanceProfile object based on the following example:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: performance
    spec:
      globallyDisableIrqLoadBalancing: true
      cpu:
        isolated: 21-51,73-103 
    1
    
        reserved: 0-20,52-72 
    2
    
      hugepages:
        defaultHugepagesSize: 1G 
    3
    
        pages:
          - count: 32
            size: 1G
      net:
        userLevelNetworking: true
      numa:
        topologyPolicy: "single-numa-node"
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
    Copy to Clipboard Toggle word wrap
    1
    If hyperthreading is enabled on the system, allocate the relevant symbolic links to the isolated and reserved CPU groups. If the system contains multiple non-uniform memory access nodes (NUMAs), allocate CPUs from both NUMAs to both groups. You can also use the Performance Profile Creator for this task. For more information, see Creating a performance profile.
    2
    You can also specify a list of devices that will have their queues set to the reserved CPU count. For more information, see Reducing NIC queues using the Node Tuning Operator.
    3
    Allocate the number and size of hugepages needed. You can specify the NUMA configuration for the hugepages. By default, the system allocates an even number to every NUMA node on the system. If needed, you can request the use of a realtime kernel for the nodes. See Provisioning a worker with real-time capabilities for more information.
  2. Save the yaml file as mlx-dpdk-perfprofile-policy.yaml.
  3. Apply the performance profile using the following command:

    $ oc create -f mlx-dpdk-perfprofile-policy.yaml
    Copy to Clipboard Toggle word wrap

You can use the Single Root I/O Virtualization (SR-IOV) Network Operator to allocate and configure Virtual Functions (VFs) from SR-IOV-supporting Physical Function NICs on the nodes.

For more information on deploying the Operator, see Installing the SR-IOV Network Operator. For more information on configuring an SR-IOV network device, see Configuring an SR-IOV network device.

There are some differences between running Data Plane Development Kit (DPDK) workloads on Intel VFs and Mellanox VFs. This section provides object configuration examples for both VF types. The following is an example of an sriovNetworkNodePolicy object used to run DPDK applications on Intel NICs:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci 
1

  needVhostNet: true 
2

  nicSelector:
    pfNames: ["ens3f0"]
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 10
  priority: 99
  resourceName: dpdk_nic_1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  needVhostNet: true
  nicSelector:
    pfNames: ["ens3f1"]
  nodeSelector:
  node-role.kubernetes.io/worker-cnf: ""
  numVfs: 10
  priority: 99
  resourceName: dpdk_nic_2
Copy to Clipboard Toggle word wrap
1
For Intel NICs, deviceType must be vfio-pci.
2
If kernel communication with DPDK workloads is required, add needVhostNet: true. This mounts the /dev/net/tun and /dev/vhost-net devices into the container so the application can create a tap device and connect the tap device to the DPDK workload.

The following is an example of an sriovNetworkNodePolicy object for Mellanox NICs:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice 
1

  isRdma: true 
2

  nicSelector:
    rootDevices:
      - "0000:5e:00.1"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 5
  priority: 99
  resourceName: dpdk_nic_1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-2
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  isRdma: true
  nicSelector:
    rootDevices:
      - "0000:5e:00.0"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 5
  priority: 99
  resourceName: dpdk_nic_2
Copy to Clipboard Toggle word wrap
1
For Mellanox devices the deviceType must be netdevice.
2
For Mellanox devices isRdma must be true. Mellanox cards are connected to DPDK applications using Flow Bifurcation. This mechanism splits traffic between Linux user space and kernel space, and can enhance line rate processing capability.
21.10.4.2. Example SR-IOV network operator

The following is an example definition of an sriovNetwork object. In this case, Intel and Mellanox configurations are identical:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: dpdk-network-1
  namespace: openshift-sriov-network-operator
spec:
  ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.1.0/24"}]],"dataDir":
   "/run/my-orchestrator/container-ipam-state-1"}' 
1

  networkNamespace: dpdk-test 
2

  spoofChk: "off"
  trust: "on"
  resourceName: dpdk_nic_1 
3

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: dpdk-network-2
  namespace: openshift-sriov-network-operator
spec:
  ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.2.0/24"}]],"dataDir":
   "/run/my-orchestrator/container-ipam-state-1"}'
  networkNamespace: dpdk-test
  spoofChk: "off"
  trust: "on"
  resourceName: dpdk_nic_2
Copy to Clipboard Toggle word wrap
1
You can use a different IP Address Management (IPAM) implementation, such as Whereabouts. For more information, see Dynamic IP address assignment configuration with Whereabouts.
2
You must request the networkNamespace where the network attachment definition will be created. You must create the sriovNetwork CR under the openshift-sriov-network-operator namespace.
3
The resourceName value must match that of the resourceName created under the sriovNetworkNodePolicy.
21.10.4.3. Example DPDK base workload

The following is an example of a Data Plane Development Kit (DPDK) container:

apiVersion: v1
kind: Namespace
metadata:
  name: dpdk-test
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: '[ 
1

     {
      "name": "dpdk-network-1",
      "namespace": "dpdk-test"
     },
     {
      "name": "dpdk-network-2",
      "namespace": "dpdk-test"
     }
   ]'
    irq-load-balancing.crio.io: "disable" 
2

    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
  labels:
    app: dpdk
  name: testpmd
  namespace: dpdk-test
spec:
  runtimeClassName: performance-performance 
3

  containers:
    - command:
        - /bin/bash
        - -c
        - sleep INF
      image: registry.redhat.io/openshift4/dpdk-base-rhel8
      imagePullPolicy: Always
      name: dpdk
      resources: 
4

        limits:
          cpu: "16"
          hugepages-1Gi: 8Gi
          memory: 2Gi
        requests:
          cpu: "16"
          hugepages-1Gi: 8Gi
          memory: 2Gi
      securityContext:
        capabilities:
          add:
            - IPC_LOCK
            - SYS_RESOURCE
            - NET_RAW
            - NET_ADMIN
        runAsUser: 0
      volumeMounts:
        - mountPath: /mnt/huge
          name: hugepages
  terminationGracePeriodSeconds: 5
  volumes:
    - emptyDir:
        medium: HugePages
      name: hugepages
Copy to Clipboard Toggle word wrap
1
Request the SR-IOV networks you need. Resources for the devices will be injected automatically.
2
Disable the CPU and IRQ load balancing base. See Disabling interrupt processing for individual pods for more information.
3
Set the runtimeClass to performance-performance. Do not set the runtimeClass to HostNetwork or privileged.
4
Request an equal number of resources for requests and limits to start the pod with Guaranteed Quality of Service (QoS).
Note

Do not start the pod with SLEEP and then exec into the pod to start the testpmd or the DPDK workload. This can add additional interrupts as the exec process is not pinned to any CPU.

21.10.4.4. Example testpmd script

The following is an example script for running testpmd:

#!/bin/bash
set -ex
export CPU=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus)
echo ${CPU}

dpdk-testpmd -l ${CPU} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_1} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_2} -n 4 -- -i --nb-cores=15 --rxd=4096 --txd=4096 --rxq=7 --txq=7 --forward-mode=mac --eth-peer=0,50:00:00:00:00:01 --eth-peer=1,50:00:00:00:00:02
Copy to Clipboard Toggle word wrap

This example uses two different sriovNetwork CRs. The environment variable contains the Virtual Function (VF) PCI address that was allocated for the pod. If you use the same network in the pod definition, you must split the pciAddress. It is important to configure the correct MAC addresses of the traffic generator. This example uses custom MAC addresses.

Important

RDMA over Converged Ethernet (RoCE) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

RDMA over Converged Ethernet (RoCE) is the only supported mode when using RDMA on OpenShift Container Platform.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Install the SR-IOV Network Operator.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the mlx-rdma-node-policy.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: mlx-rdma-node-policy
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: mlxnics
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      priority: <priority>
      numVfs: <num>
      nicSelector:
        vendor: "15b3"
        deviceID: "1015" 
    1
    
        pfNames: ["<pf_name>", ...]
        rootDevices: ["<pci_bus_id>", "..."]
      deviceType: netdevice 
    2
    
      isRdma: true 
    3
    Copy to Clipboard Toggle word wrap
    1
    Specify the device hex code of the SR-IOV network device.
    2
    Specify the driver type for the virtual functions to netdevice.
    3
    Enable RDMA mode.
    Note

    See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-rdma-node-policy.yaml
    Copy to Clipboard Toggle word wrap
  3. Create the following SriovNetwork object, and then save the YAML in the mlx-rdma-network.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: mlx-rdma-network
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: <target_namespace>
      ipam: |- 
    1
    
    # ...
      vlan: <vlan>
      resourceName: mlxnics
    Copy to Clipboard Toggle word wrap
    1
    Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    Note

    See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-rdma-network.yaml
    Copy to Clipboard Toggle word wrap
  5. Create the following Pod spec, and then save the YAML in the mlx-rdma-pod.yaml file.

    apiVersion: v1
    kind: Pod
    metadata:
      name: rdma-app
      namespace: <target_namespace> 
    1
    
      annotations:
        k8s.v1.cni.cncf.io/networks: mlx-rdma-network
    spec:
      containers:
      - name: testpmd
        image: <RDMA_image> 
    2
    
        securityContext:
          runAsUser: 0
          capabilities:
            add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 
    3
    
        volumeMounts:
        - mountPath: /mnt/huge 
    4
    
          name: hugepage
        resources:
          limits:
            memory: "1Gi"
            cpu: "4" 
    5
    
            hugepages-1Gi: "4Gi" 
    6
    
          requests:
            memory: "1Gi"
            cpu: "4"
            hugepages-1Gi: "4Gi"
        command: ["sleep", "infinity"]
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
    Copy to Clipboard Toggle word wrap
    1
    Specify the same target_namespace where SriovNetwork object mlx-rdma-network is created. If you would like to create the pod in a different namespace, change target_namespace in both Pod spec and SriovNetwork object.
    2
    Specify the RDMA image which includes your application and RDMA library used by application.
    3
    Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4
    Mount the hugepage volume to RDMA pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5
    Specify number of CPUs. The RDMA pod usually requires exclusive CPUs be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and create pod with Guaranteed QoS.
    6
    Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the RDMA pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes.
  6. Create the RDMA pod by running the following command:

    $ oc create -f mlx-rdma-pod.yaml
    Copy to Clipboard Toggle word wrap

The following testpmd pod demonstrates container creation with huge pages, reserved CPUs, and the SR-IOV port.

An example testpmd pod

apiVersion: v1
kind: Pod
metadata:
  name: testpmd-dpdk
  namespace: mynamespace
  annotations:
    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
# ...
spec:
  containers:
  - name: testpmd
    command: ["sleep", "99999"]
    image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
    securityContext:
      capabilities:
        add: ["IPC_LOCK","SYS_ADMIN"]
      privileged: true
      runAsUser: 0
    resources:
      requests:
        memory: 1000Mi
        hugepages-1Gi: 1Gi
        cpu: '2'
        openshift.io/dpdk1: 1 
1

      limits:
        hugepages-1Gi: 1Gi
        cpu: '2'
        memory: 1000Mi
        openshift.io/dpdk1: 1
    volumeMounts:
      - mountPath: /mnt/huge
        name: hugepage
        readOnly: False
  runtimeClassName: performance-cnf-performanceprofile 
2

  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
Copy to Clipboard Toggle word wrap

1
The name dpdk1 in this example is a user-created SriovNetworkNodePolicy resource. You can substitute this name for that of a resource that you create.
2
If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.

The following testpmd pod demonstrates Open vSwitch (OVS) hardware offloading on Red Hat OpenStack Platform (RHOSP).

An example testpmd pod

apiVersion: v1
kind: Pod
metadata:
  name: testpmd-sriov
  namespace: mynamespace
  annotations:
    k8s.v1.cni.cncf.io/networks: hwoffload1
spec:
  runtimeClassName: performance-cnf-performanceprofile 
1

  containers:
  - name: testpmd
    command: ["sleep", "99999"]
    image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
    securityContext:
      capabilities:
        add: ["IPC_LOCK","SYS_ADMIN"]
      privileged: true
      runAsUser: 0
    resources:
      requests:
        memory: 1000Mi
        hugepages-1Gi: 1Gi
        cpu: '2'
      limits:
        hugepages-1Gi: 1Gi
        cpu: '2'
        memory: 1000Mi
    volumeMounts:
      - mountPath: /mnt/huge
        name: hugepage
        readOnly: False
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
Copy to Clipboard Toggle word wrap

1
If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.

21.11. Using pod-level bonding

Bonding at the pod level is vital to enable workloads inside pods that require high availability and more throughput. With pod-level bonding, you can create a bond interface from multiple single root I/O virtualization (SR-IOV) virtual function interfaces in a kernel mode interface. The SR-IOV virtual functions are passed into the pod and attached to a kernel driver.

One scenario where pod level bonding is required is creating a bond interface from multiple SR-IOV virtual functions on different physical functions. Creating a bond interface from two different physical functions on the host can be used to achieve high availability and throughput at pod level.

For guidance on tasks such as creating a SR-IOV network, network policies, network attachment definitions and pods, see Configuring an SR-IOV network device.

Bonding enables multiple network interfaces to be aggregated into a single logical "bonded" interface. Bond Container Network Interface (Bond-CNI) brings bond capability into containers.

Bond-CNI can be created using Single Root I/O Virtualization (SR-IOV) virtual functions and placing them in the container network namespace.

OpenShift Container Platform only supports Bond-CNI using SR-IOV virtual functions. The SR-IOV Network Operator provides the SR-IOV CNI plugin needed to manage the virtual functions. Other CNIs or types of interfaces are not supported.

Prerequisites

  • The SR-IOV Network Operator must be installed and configured to obtain virtual functions in a container.
  • To configure SR-IOV interfaces, an SR-IOV network and policy must be created for each interface.
  • The SR-IOV Network Operator creates a network attachment definition for each SR-IOV interface, based on the SR-IOV network and policy defined.
  • The linkState is set to the default value auto for the SR-IOV virtual function.

Now that the SR-IOV virtual functions are available, you can create a bond network attachment definition.

apiVersion: "k8s.cni.cncf.io/v1"
    kind: NetworkAttachmentDefinition
    metadata:
      name: bond-net1
      namespace: demo
    spec:
      config: '{
      "type": "bond", 
1

      "cniVersion": "0.3.1",
      "name": "bond-net1",
      "mode": "active-backup", 
2

      "failOverMac": 1, 
3

      "linksInContainer": true, 
4

      "miimon": "100",
      "mtu": 1500,
      "links": [ 
5

            {"name": "net1"},
            {"name": "net2"}
        ],
      "ipam": {
            "type": "host-local",
            "subnet": "10.56.217.0/24",
            "routes": [{
            "dst": "0.0.0.0/0"
            }],
            "gateway": "10.56.217.1"
        }
      }'
Copy to Clipboard Toggle word wrap
1
The cni-type is always set to bond.
2
The mode attribute specifies the bonding mode.
Note

The bonding modes supported are:

  • balance-rr - 0
  • active-backup - 1
  • balance-xor - 2

For balance-rr or balance-xor modes, you must set the trust mode to on for the SR-IOV virtual function.

3
The failover attribute is mandatory for active-backup mode and must be set to 1.
4
The linksInContainer=true flag informs the Bond CNI that the required interfaces are to be found inside the container. By default, Bond CNI looks for these interfaces on the host which does not work for integration with SRIOV and Multus.
5
The links section defines which interfaces will be used to create the bond. By default, Multus names the attached interfaces as: "net", plus a consecutive number, starting with one.
21.11.1.2. Creating a pod using a bond interface
  1. Test the setup by creating a pod with a YAML file named for example podbonding.yaml with content similar to the following:

    apiVersion: v1
        kind: Pod
        metadata:
          name: bondpod1
          namespace: demo
          annotations:
            k8s.v1.cni.cncf.io/networks: demo/sriovnet1, demo/sriovnet2, demo/bond-net1 
    1
    
        spec:
          containers:
          - name: podexample
            image: quay.io/openshift/origin-network-interface-bond-cni:4.11.0
            command: ["/bin/bash", "-c", "sleep INF"]
    Copy to Clipboard Toggle word wrap
    1
    Note the network annotation: it contains two SR-IOV network attachments, and one bond network attachment. The bond attachment uses the two SR-IOV interfaces as bonded port interfaces.
  2. Apply the yaml by running the following command:

    $ oc apply -f podbonding.yaml
    Copy to Clipboard Toggle word wrap
  3. Inspect the pod interfaces with the following command:

    $ oc rsh -n demo bondpod1
    sh-4.4#
    sh-4.4# ip a
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
    3: eth0@if150: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 62:b1:b5:c8:fb:7a brd ff:ff:ff:ff:ff:ff
    inet 10.244.1.122/24 brd 10.244.1.255 scope global eth0
    valid_lft forever preferred_lft forever
    4: net3: <BROADCAST,MULTICAST,UP,LOWER_UP400> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 9e:23:69:42:fb:8a brd ff:ff:ff:ff:ff:ff 
    1
    
    inet 10.56.217.66/24 scope global bond0
    valid_lft forever preferred_lft forever
    43: net1: <BROADCAST,MULTICAST,UP,LOWER_UP800> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 9e:23:69:42:fb:8a brd ff:ff:ff:ff:ff:ff 
    2
    
    44: net2: <BROADCAST,MULTICAST,UP,LOWER_UP800> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 9e:23:69:42:fb:8a brd ff:ff:ff:ff:ff:ff 
    3
    Copy to Clipboard Toggle word wrap
    1
    The bond interface is automatically named net3. To set a specific interface name add @name suffix to the pod’s k8s.v1.cni.cncf.io/networks annotation.
    2
    The net1 interface is based on an SR-IOV virtual function.
    3
    The net2 interface is based on an SR-IOV virtual function.
    Note

    If no interface names are configured in the pod annotation, interface names are assigned automatically as net<n>, with <n> starting at 1.

  4. Optional: If you want to set a specific interface name for example bond0, edit the k8s.v1.cni.cncf.io/networks annotation and set bond0 as the interface name as follows:

    annotations:
            k8s.v1.cni.cncf.io/networks: demo/sriovnet1, demo/sriovnet2, demo/bond-net1@bond0
    Copy to Clipboard Toggle word wrap

21.12. Configuring hardware offloading

As a cluster administrator, you can configure hardware offloading on compatible nodes to increase data processing performance and reduce load on host CPUs.

21.12.1. About hardware offloading

Open vSwitch hardware offloading is a method of processing network tasks by diverting them away from the CPU and offloading them to a dedicated processor on a network interface controller. As a result, clusters can benefit from faster data transfer speeds, reduced CPU workloads, and lower computing costs.

The key element for this feature is a modern class of network interface controllers known as SmartNICs. A SmartNIC is a network interface controller that is able to handle computationally-heavy network processing tasks. In the same way that a dedicated graphics card can improve graphics performance, a SmartNIC can improve network performance. In each case, a dedicated processor improves performance for a specific type of processing task.

In OpenShift Container Platform, you can configure hardware offloading for bare metal nodes that have a compatible SmartNIC. Hardware offloading is configured and enabled by the SR-IOV Network Operator.

Hardware offloading is not compatible with all workloads or application types. Only the following two communication types are supported:

  • pod-to-pod
  • pod-to-service, where the service is a ClusterIP service backed by a regular pod

In all cases, hardware offloading takes place only when those pods and services are assigned to nodes that have a compatible SmartNIC. Suppose, for example, that a pod on a node with hardware offloading tries to communicate with a service on a regular node. On the regular node, all the processing takes place in the kernel, so the overall performance of the pod-to-service communication is limited to the maximum performance of that regular node. Hardware offloading is not compatible with DPDK applications.

Enabling hardware offloading on a node, but not configuring pods to use, it can result in decreased throughput performance for pod traffic. You cannot configure hardware offloading for pods that are managed by OpenShift Container Platform.

21.12.2. Supported devices

Hardware offloading is supported on the following network interface controllers:

Expand
Table 21.15. Supported network interface controllers
ManufacturerModelVendor IDDevice ID

Mellanox

MT27800 Family [ConnectX‑5]

15b3

1017

Mellanox

MT28880 Family [ConnectX‑5 Ex]

15b3

1019

21.12.3. Prerequisites

To enable hardware offloading, you must first create a dedicated machine config pool and configure it to work with the SR-IOV Network Operator.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You have access to the cluster as a user with the cluster-admin role.

Procedure

  1. Create a machine config pool for machines you want to use hardware offloading on.

    1. Create a file, such as mcp-offloading.yaml, with content like the following example:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: mcp-offloading 
      1
      
      spec:
        machineConfigSelector:
          matchExpressions:
            - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,mcp-offloading]} 
      2
      
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/mcp-offloading: "" 
      3
      Copy to Clipboard Toggle word wrap
      1 2
      The name of your machine config pool for hardware offloading.
      3
      This node role label is used to add nodes to the machine config pool.
    2. Apply the configuration for the machine config pool:

      $ oc create -f mcp-offloading.yaml
      Copy to Clipboard Toggle word wrap
  2. Add nodes to the machine config pool. Label each node with the node role label of your pool:

    $ oc label node worker-2 node-role.kubernetes.io/mcp-offloading=""
    Copy to Clipboard Toggle word wrap
  3. Optional: To verify that the new pool is created, run the following command:

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

    Example output

    NAME       STATUS   ROLES                   AGE   VERSION
    master-0   Ready    master                  2d    v1.24.0
    master-1   Ready    master                  2d    v1.24.0
    master-2   Ready    master                  2d    v1.24.0
    worker-0   Ready    worker                  2d    v1.24.0
    worker-1   Ready    worker                  2d    v1.24.0
    worker-2   Ready    mcp-offloading,worker   47h   v1.24.0
    worker-3   Ready    mcp-offloading,worker   47h   v1.24.0
    Copy to Clipboard Toggle word wrap

  4. Add this machine config pool to the SriovNetworkPoolConfig custom resource:

    1. Create a file, such as sriov-pool-config.yaml, with content like the following example:

      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkPoolConfig
      metadata:
        name: sriovnetworkpoolconfig-offload
        namespace: openshift-sriov-network-operator
      spec:
        ovsHardwareOffloadConfig:
          name: mcp-offloading 
      1
      Copy to Clipboard Toggle word wrap
      1
      The name of your machine config pool for hardware offloading.
    2. Apply the configuration:

      $ oc create -f <SriovNetworkPoolConfig_name>.yaml
      Copy to Clipboard Toggle word wrap
      Note

      When you apply the configuration specified in a SriovNetworkPoolConfig object, the SR-IOV Operator drains and restarts the nodes in the machine config pool.

      It might take several minutes for a configuration changes to apply.

You can create an SR-IOV network device configuration for a node by creating an SR-IOV network node policy. To enable hardware offloading, you must define the .spec.eSwitchMode field with the value "switchdev".

The following procedure creates an SR-IOV interface for a network interface controller with hardware offloading.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You have access to the cluster as a user with the cluster-admin role.

Procedure

  1. Create a file, such as sriov-node-policy.yaml, with content like the following example:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: sriov-node-policy <.>
      namespace: openshift-sriov-network-operator
    spec:
      deviceType: netdevice <.>
      eSwitchMode: "switchdev" <.>
      nicSelector:
        deviceID: "1019"
        rootDevices:
        - 0000:d8:00.0
        vendor: "15b3"
        pfNames:
        - ens8f0
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      numVfs: 6
      priority: 5
      resourceName: mlxnics
    Copy to Clipboard Toggle word wrap

    <.> The name for the custom resource object. <.> Required. Hardware offloading is not supported with vfio-pci. <.> Required.

  2. Apply the configuration for the policy:

    $ oc create -f sriov-node-policy.yaml
    Copy to Clipboard Toggle word wrap
    Note

    When you apply the configuration specified in a SriovNetworkPoolConfig object, the SR-IOV Operator drains and restarts the nodes in the machine config pool.

    It might take several minutes for a configuration change to apply.

The following example describes an SR-IOV interface for a network interface controller (NIC) with hardware offloading on Red Hat OpenStack Platform (RHOSP).

An SR-IOV interface for a NIC with hardware offloading on RHOSP

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ${name}
  namespace: openshift-sriov-network-operator
spec:
  deviceType: switchdev
  isRdma: true
  nicSelector:
    netFilter: openstack/NetworkID:${net_id}
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  numVfs: 1
  priority: 99
  resourceName: ${name}
Copy to Clipboard Toggle word wrap

21.12.6. Creating a network attachment definition

After you define the machine config pool and the SR-IOV network node policy, you can create a network attachment definition for the network interface card you specified.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You have access to the cluster as a user with the cluster-admin role.

Procedure

  1. Create a file, such as net-attach-def.yaml, with content like the following example:

    apiVersion: "k8s.cni.cncf.io/v1"
    kind: NetworkAttachmentDefinition
    metadata:
      name: net-attach-def <.>
      namespace: net-attach-def <.>
      annotations:
        k8s.v1.cni.cncf.io/resourceName: openshift.io/mlxnics <.>
    spec:
      config: '{"cniVersion":"0.3.1","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{}}'
    Copy to Clipboard Toggle word wrap

    <.> The name for your network attachment definition. <.> The namespace for your network attachment definition. <.> This is the value of the spec.resourceName field you specified in the SriovNetworkNodePolicy object.

  2. Apply the configuration for the network attachment definition:

    $ oc create -f net-attach-def.yaml
    Copy to Clipboard Toggle word wrap

Verification

  • Run the following command to see whether the new definition is present:

    $ oc get net-attach-def -A
    Copy to Clipboard Toggle word wrap

    Example output

    NAMESPACE         NAME             AGE
    net-attach-def    net-attach-def   43h
    Copy to Clipboard Toggle word wrap

After you create the machine config pool, the SriovNetworkPoolConfig and SriovNetworkNodePolicy custom resources, and the network attachment definition, you can apply these configurations to your pods by adding the network attachment definition to your pod specifications.

Procedure

  • In the pod specification, add the .metadata.annotations.k8s.v1.cni.cncf.io/networks field and specify the network attachment definition you created for hardware offloading:

    ....
    metadata:
      annotations:
        v1.multus-cni.io/default-network: net-attach-def/net-attach-def <.>
    Copy to Clipboard Toggle word wrap

    <.> The value must be the name and namespace of the network attachment definition you created for hardware offloading.

21.13. Uninstalling the SR-IOV Network Operator

To uninstall the SR-IOV Network Operator, you must delete any running SR-IOV workloads, uninstall the Operator, and delete the webhooks that the Operator used.

21.13.1. Uninstalling the SR-IOV Network Operator

As a cluster administrator, you can uninstall the SR-IOV Network Operator.

Prerequisites

  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
  • You have the SR-IOV Network Operator installed.

Procedure

  1. Delete all SR-IOV custom resources (CRs):

    $ oc delete sriovnetwork -n openshift-sriov-network-operator --all
    Copy to Clipboard Toggle word wrap
    $ oc delete sriovnetworknodepolicy -n openshift-sriov-network-operator --all
    Copy to Clipboard Toggle word wrap
    $ oc delete sriovibnetwork -n openshift-sriov-network-operator --all
    Copy to Clipboard Toggle word wrap
  2. Follow the instructions in the "Deleting Operators from a cluster" section to remove the SR-IOV Network Operator from your cluster.
  3. Delete the SR-IOV custom resource definitions that remain in the cluster after the SR-IOV Network Operator is uninstalled:

    $ oc delete crd sriovibnetworks.sriovnetwork.openshift.io
    Copy to Clipboard Toggle word wrap
    $ oc delete crd sriovnetworknodepolicies.sriovnetwork.openshift.io
    Copy to Clipboard Toggle word wrap
    $ oc delete crd sriovnetworknodestates.sriovnetwork.openshift.io
    Copy to Clipboard Toggle word wrap
    $ oc delete crd sriovnetworkpoolconfigs.sriovnetwork.openshift.io
    Copy to Clipboard Toggle word wrap
    $ oc delete crd sriovnetworks.sriovnetwork.openshift.io
    Copy to Clipboard Toggle word wrap
    $ oc delete crd sriovoperatorconfigs.sriovnetwork.openshift.io
    Copy to Clipboard Toggle word wrap
  4. Delete the SR-IOV webhooks:

    $ oc delete mutatingwebhookconfigurations network-resources-injector-config
    Copy to Clipboard Toggle word wrap
    $ oc delete MutatingWebhookConfiguration sriov-operator-webhook-config
    Copy to Clipboard Toggle word wrap
    $ oc delete ValidatingWebhookConfiguration sriov-operator-webhook-config
    Copy to Clipboard Toggle word wrap
  5. Delete the SR-IOV Network Operator namespace:

    $ oc delete namespace openshift-sriov-network-operator
    Copy to Clipboard Toggle word wrap

OpenShift Container Platform uses a software-defined networking (SDN) approach to provide a unified cluster network that enables communication between pods across the OpenShift Container Platform cluster. This pod network is established and maintained by the OpenShift SDN, which configures an overlay network using Open vSwitch (OVS).

22.1.1. OpenShift SDN network isolation modes

OpenShift SDN provides three SDN modes for configuring the pod network:

  • Network policy mode allows project administrators to configure their own isolation policies using NetworkPolicy objects. Network policy is the default mode in OpenShift Container Platform 4.11.
  • Multitenant mode provides project-level isolation for pods and services. Pods from different projects cannot send packets to or receive packets from pods and services of a different project. You can disable isolation for a project, allowing it to send network traffic to all pods and services in the entire cluster and receive network traffic from those pods and services.
  • Subnet mode provides a flat pod network where every pod can communicate with every other pod and service. The network policy mode provides the same functionality as subnet mode.

OpenShift Container Platform offers two supported choices, OpenShift SDN and OVN-Kubernetes, for the default Container Network Interface (CNI) network provider. The following table summarizes the current feature support for both network providers:

Expand
Table 22.1. Default CNI network provider feature comparison
FeatureOpenShift SDNOVN-Kubernetes

Egress IPs

Supported

Supported

Egress firewall [1]

Supported

Supported

Egress router

Supported

Supported [2]

Hybrid networking

Not supported

Supported

IPsec encryption

Not supported

Supported

IPv6

Not supported

Supported [3] [4]

Kubernetes network policy

Supported

Supported

Kubernetes network policy logs

Not supported

Supported

Multicast

Supported

Supported

Hardware offloading

Not supported

Supported

  1. Egress firewall is also known as egress network policy in OpenShift SDN. This is not the same as network policy egress.
  2. Egress router for OVN-Kubernetes supports only redirect mode.
  3. IPv6 is supported only on bare metal clusters.
  4. IPv6 single stack does not support Kubernetes NMState.

22.2. Configuring egress IPs for a project

As a cluster administrator, you can configure the OpenShift SDN Container Network Interface (CNI) cluster network provider to assign one or more egress IP addresses to a project.

The OpenShift Container Platform egress IP address functionality allows you to ensure that the traffic from one or more pods in one or more namespaces has a consistent source IP address for services outside the cluster network.

For example, you might have a pod that periodically queries a database that is hosted on a server outside of your cluster. To enforce access requirements for the server, a packet filtering device is configured to allow traffic only from specific IP addresses. To ensure that you can reliably allow access to the server from only that specific pod, you can configure a specific egress IP address for the pod that makes the requests to the server.

An egress IP address assigned to a namespace is different from an egress router, which is used to send traffic to specific destinations.

In some cluster configurations, application pods and ingress router pods run on the same node. If you configure an egress IP address for an application project in this scenario, the IP address is not used when you send a request to a route from the application project.

An egress IP address is implemented as an additional IP address on the primary network interface of a node and must be in the same subnet as the primary IP address of the node. The additional IP address must not be assigned to any other node in the cluster.

Important

Egress IP addresses must not be configured in any Linux network configuration files, such as ifcfg-eth0.

22.2.1.1. Platform support

Support for the egress IP address functionality on various platforms is summarized in the following table:

Expand
PlatformSupported

Bare metal

Yes

VMware vSphere

Yes

Red Hat OpenStack Platform (RHOSP)

No

Amazon Web Services (AWS)

Yes

Google Cloud Platform (GCP)

Yes

Microsoft Azure

Yes

Important

The assignment of egress IP addresses to control plane nodes with the EgressIP feature is not supported on a cluster provisioned on Amazon Web Services (AWS). (BZ#2039656)

22.2.1.2. Public cloud platform considerations

For clusters provisioned on public cloud infrastructure, there is a constraint on the absolute number of assignable IP addresses per node. The maximum number of assignable IP addresses per node, or the IP capacity, can be described in the following formula:

IP capacity = public cloud default capacity - sum(current IP assignments)
Copy to Clipboard Toggle word wrap

While the Egress IPs capability manages the IP address capacity per node, it is important to plan for this constraint in your deployments. For example, for a cluster installed on bare-metal infrastructure with 8 nodes you can configure 150 egress IP addresses. However, if a public cloud provider limits IP address capacity to 10 IP addresses per node, the total number of assignable IP addresses is only 80. To achieve the same IP address capacity in this example cloud provider, you would need to allocate 7 additional nodes.

To confirm the IP capacity and subnets for any node in your public cloud environment, you can enter the oc get node <node_name> -o yaml command. The cloud.network.openshift.io/egress-ipconfig annotation includes capacity and subnet information for the node.

The annotation value is an array with a single object with fields that provide the following information for the primary network interface:

  • interface: Specifies the interface ID on AWS and Azure and the interface name on GCP.
  • ifaddr: Specifies the subnet mask for one or both IP address families.
  • capacity: Specifies the IP address capacity for the node. On AWS, the IP address capacity is provided per IP address family. On Azure and GCP, the IP address capacity includes both IPv4 and IPv6 addresses.

The following examples illustrate the annotation from nodes on several public cloud providers. The annotations are indented for readability.

Example cloud.network.openshift.io/egress-ipconfig annotation on AWS

cloud.network.openshift.io/egress-ipconfig: [
  {
    "interface":"eni-078d267045138e436",
    "ifaddr":{"ipv4":"10.0.128.0/18"},
    "capacity":{"ipv4":14,"ipv6":15}
  }
]
Copy to Clipboard Toggle word wrap

Example cloud.network.openshift.io/egress-ipconfig annotation on GCP

cloud.network.openshift.io/egress-ipconfig: [
  {
    "interface":"nic0",
    "ifaddr":{"ipv4":"10.0.128.0/18"},
    "capacity":{"ip":14}
  }
]
Copy to Clipboard Toggle word wrap

The following sections describe the IP address capacity for supported public cloud environments for use in your capacity calculation.

On AWS, constraints on IP address assignments depend on the instance type configured. For more information, see IP addresses per network interface per instance type

On GCP, the networking model implements additional node IP addresses through IP address aliasing, rather than IP address assignments. However, IP address capacity maps directly to IP aliasing capacity.

The following capacity limits exist for IP aliasing assignment:

  • Per node, the maximum number of IP aliases, both IPv4 and IPv6, is 10.
  • Per VPC, the maximum number of IP aliases is unspecified, but OpenShift Container Platform scalability testing reveals the maximum to be approximately 15,000.

For more information, see Per instance quotas and Alias IP ranges overview.

On Azure, the following capacity limits exist for IP address assignment:

  • Per NIC, the maximum number of assignable IP addresses, for both IPv4 and IPv6, is 256.
  • Per virtual network, the maximum number of assigned IP addresses cannot exceed 65,536.

For more information, see Networking limits.

22.2.1.3. Limitations

The following limitations apply when using egress IP addresses with the OpenShift SDN cluster network provider:

  • You cannot use manually assigned and automatically assigned egress IP addresses on the same nodes.
  • If you manually assign egress IP addresses from an IP address range, you must not make that range available for automatic IP assignment.
  • You cannot share egress IP addresses across multiple namespaces using the OpenShift SDN egress IP address implementation.

If you need to share IP addresses across namespaces, the OVN-Kubernetes cluster network provider egress IP address implementation allows you to span IP addresses across multiple namespaces.

Note

If you use OpenShift SDN in multitenant mode, you cannot use egress IP addresses with any namespace that is joined to another namespace by the projects that are associated with them. For example, if project1 and project2 are joined by running the oc adm pod-network join-projects --to=project1 project2 command, neither project can use an egress IP address. For more information, see BZ#1645577.

22.2.1.4. IP address assignment approaches

You can assign egress IP addresses to namespaces by setting the egressIPs parameter of the NetNamespace object. After an egress IP address is associated with a project, OpenShift SDN allows you to assign egress IP addresses to hosts in two ways:

  • In the automatically assigned approach, an egress IP address range is assigned to a node.
  • In the manually assigned approach, a list of one or more egress IP address is assigned to a node.

Namespaces that request an egress IP address are matched with nodes that can host those egress IP addresses, and then the egress IP addresses are assigned to those nodes. If the egressIPs parameter is set on a NetNamespace object, but no node hosts that egress IP address, then egress traffic from the namespace will be dropped.

High availability of nodes is automatic. If a node that hosts an egress IP address is unreachable and there are nodes that are able to host that egress IP address, then the egress IP address will move to a new node. When the unreachable node comes back online, the egress IP address automatically moves to balance egress IP addresses across nodes.

When using the automatic assignment approach for egress IP addresses the following considerations apply:

  • You set the egressCIDRs parameter of each node’s HostSubnet resource to indicate the range of egress IP addresses that can be hosted by a node. OpenShift Container Platform sets the egressIPs parameter of the HostSubnet resource based on the IP address range you specify.

If the node hosting the namespace’s egress IP address is unreachable, OpenShift Container Platform will reassign the egress IP address to another node with a compatible egress IP address range. The automatic assignment approach works best for clusters installed in environments with flexibility in associating additional IP addresses with nodes.

This approach allows you to control which nodes can host an egress IP address.

Note

If your cluster is installed on public cloud infrastructure, you must ensure that each node that you assign egress IP addresses to has sufficient spare capacity to host the IP addresses. For more information, see "Platform considerations" in a previous section.

When using the manual assignment approach for egress IP addresses the following considerations apply:

  • You set the egressIPs parameter of each node’s HostSubnet resource to indicate the IP addresses that can be hosted by a node.
  • Multiple egress IP addresses per namespace are supported.

If a namespace has multiple egress IP addresses and those addresses are hosted on multiple nodes, the following additional considerations apply:

  • If a pod is on a node that is hosting an egress IP address, that pod always uses the egress IP address on the node.
  • If a pod is not on a node that is hosting an egress IP address, that pod uses an egress IP address at random.

In OpenShift Container Platform you can enable automatic assignment of an egress IP address for a specific namespace across one or more nodes.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the OpenShift CLI (oc).

Procedure

  1. Update the NetNamespace object with the egress IP address using the following JSON:

     $ oc patch netnamespace <project_name> --type=merge -p \
      '{
        "egressIPs": [
          "<ip_address>"
        ]
      }'
    Copy to Clipboard Toggle word wrap

    where:

    <project_name>
    Specifies the name of the project.
    <ip_address>
    Specifies one or more egress IP addresses for the egressIPs array.

    For example, to assign project1 to an IP address of 192.168.1.100 and project2 to an IP address of 192.168.1.101:

    $ oc patch netnamespace project1 --type=merge -p \
      '{"egressIPs": ["192.168.1.100"]}'
    $ oc patch netnamespace project2 --type=merge -p \
      '{"egressIPs": ["192.168.1.101"]}'
    Copy to Clipboard Toggle word wrap
    Note

    Because OpenShift SDN manages the NetNamespace object, you can make changes only by modifying the existing NetNamespace object. Do not create a new NetNamespace object.

  2. Indicate which nodes can host egress IP addresses by setting the egressCIDRs parameter for each host using the following JSON:

    $ oc patch hostsubnet <node_name> --type=merge -p \
      '{
        "egressCIDRs": [
          "<ip_address_range>", "<ip_address_range>"
        ]
      }'
    Copy to Clipboard Toggle word wrap

    where:

    <node_name>
    Specifies a node name.
    <ip_address_range>
    Specifies an IP address range in CIDR format. You can specify more than one address range for the egressCIDRs array.

    For example, to set node1 and node2 to host egress IP addresses in the range 192.168.1.0 to 192.168.1.255:

    $ oc patch hostsubnet node1 --type=merge -p \
      '{"egressCIDRs": ["192.168.1.0/24"]}'
    $ oc patch hostsubnet node2 --type=merge -p \
      '{"egressCIDRs": ["192.168.1.0/24"]}'
    Copy to Clipboard Toggle word wrap

    OpenShift Container Platform automatically assigns specific egress IP addresses to available nodes in a balanced way. In this case, it assigns the egress IP address 192.168.1.100 to node1 and the egress IP address 192.168.1.101 to node2 or vice versa.

In OpenShift Container Platform you can associate one or more egress IP addresses with a namespace.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the OpenShift CLI (oc).

Procedure

  1. Update the NetNamespace object by specifying the following JSON object with the desired IP addresses:

     $ oc patch netnamespace <project_name> --type=merge -p \
      '{
        "egressIPs": [
          "<ip_address>"
        ]
      }'
    Copy to Clipboard Toggle word wrap

    where:

    <project_name>
    Specifies the name of the project.
    <ip_address>
    Specifies one or more egress IP addresses for the egressIPs array.

    For example, to assign the project1 project to the IP addresses 192.168.1.100 and 192.168.1.101:

    $ oc patch netnamespace project1 --type=merge \
      -p '{"egressIPs": ["192.168.1.100","192.168.1.101"]}'
    Copy to Clipboard Toggle word wrap

    To provide high availability, set the egressIPs value to two or more IP addresses on different nodes. If multiple egress IP addresses are set, then pods use all egress IP addresses roughly equally.

    Note

    Because OpenShift SDN manages the NetNamespace object, you can make changes only by modifying the existing NetNamespace object. Do not create a new NetNamespace object.

  2. Manually assign the egress IP address to the node hosts.

    If your cluster is installed on public cloud infrastructure, you must confirm that the node has available IP address capacity.

    Set the egressIPs parameter on the HostSubnet object on the node host. Using the following JSON, include as many IP addresses as you want to assign to that node host:

    $ oc patch hostsubnet <node_name> --type=merge -p \
      '{
        "egressIPs": [
          "<ip_address>",
          "<ip_address>"
          ]
      }'
    Copy to Clipboard Toggle word wrap

    where:

    <node_name>
    Specifies a node name.
    <ip_address>
    Specifies an IP address. You can specify more than one IP address for the egressIPs array.

    For example, to specify that node1 should have the egress IPs 192.168.1.100, 192.168.1.101, and 192.168.1.102:

    $ oc patch hostsubnet node1 --type=merge -p \
      '{"egressIPs": ["192.168.1.100", "192.168.1.101", "192.168.1.102"]}'
    Copy to Clipboard Toggle word wrap

    In the previous example, all egress traffic for project1 will be routed to the node hosting the specified egress IP, and then connected through Network Address Translation (NAT) to that IP address.

22.2.4. Additional resources

  • If you are configuring manual egress IP address assignment, see Platform considerations for information about IP capacity planning.

22.3. Configuring an egress firewall for a project

As a cluster administrator, you can create an egress firewall for a project that restricts egress traffic leaving your OpenShift Container Platform cluster.

22.3.1. How an egress firewall works in a project

As a cluster administrator, you can use an egress firewall to limit the external hosts that some or all pods can access from within the cluster. An egress firewall supports the following scenarios:

  • A pod can only connect to internal hosts and cannot initiate connections to the public internet.
  • A pod can only connect to the public internet and cannot initiate connections to internal hosts that are outside the OpenShift Container Platform cluster.
  • A pod cannot reach specified internal subnets or hosts outside the OpenShift Container Platform cluster.
  • A pod can connect to only specific external hosts.

For example, you can allow one project access to a specified IP range but deny the same access to a different project. Or you can restrict application developers from updating from Python pip mirrors, and force updates to come only from approved sources.

Note

Egress firewall does not apply to the host network namespace. Pods with host networking enabled are unaffected by egress firewall rules.

You configure an egress firewall policy by creating an EgressNetworkPolicy custom resource (CR) object. The egress firewall matches network traffic that meets any of the following criteria:

  • An IP address range in CIDR format
  • A DNS name that resolves to an IP address
Important

If your egress firewall includes a deny rule for 0.0.0.0/0, access to your OpenShift Container Platform API servers is blocked. To ensure that pods can access the OpenShift Container Platform API servers, you must include the built-in join network 100.64.0.0/16 of Open Virtual Network (OVN) to allow access when using node ports together with an EgressFirewall. You must also include the IP address range that the API servers listen on in your egress firewall rules, as in the following example:

apiVersion: network.openshift.io/v1
kind: EgressNetworkPolicy
metadata:
  name: default
  namespace: <namespace> 
1

spec:
  egress:
  - to:
      cidrSelector: <api_server_address_range> 
2

    type: Allow
# ...
  - to:
      cidrSelector: 0.0.0.0/0 
3

    type: Deny
Copy to Clipboard Toggle word wrap
1
The namespace for the egress firewall.
2
The IP address range that includes your OpenShift Container Platform API servers.
3
A global deny rule prevents access to the OpenShift Container Platform API servers.

To find the IP address for your API servers, run oc get ep kubernetes -n default.

For more information, see BZ#1988324.

Important

You must have OpenShift SDN configured to use either the network policy or multitenant mode to configure an egress firewall.

If you use network policy mode, an egress firewall is compatible with only one policy per namespace and will not work with projects that share a network, such as global projects.

Warning

Egress firewall rules do not apply to traffic that goes through routers. Any user with permission to create a Route CR object can bypass egress firewall policy rules by creating a route that points to a forbidden destination.

22.3.1.1. Limitations of an egress firewall

An egress firewall has the following limitations:

  • No project can have more than one EgressNetworkPolicy object.

    Important

    The creation of more than one EgressNetworkPolicy object is allowed, however it should not be done. When you create more than one EgressNetworkPolicy object, the following message is returned: dropping all rules. In actuality, all external traffic is dropped, which can cause security risks for your organization.

  • A maximum of one EgressNetworkPolicy object with a maximum of 1,000 rules can be defined per project.
  • The default project cannot use an egress firewall.
  • When using the OpenShift SDN default Container Network Interface (CNI) network provider in multitenant mode, the following limitations apply:

    • Global projects cannot use an egress firewall. You can make a project global by using the oc adm pod-network make-projects-global command.
    • Projects merged by using the oc adm pod-network join-projects command cannot use an egress firewall in any of the joined projects.

Violating any of these restrictions results in a broken egress firewall for the project. Consequently, all external network traffic is dropped, which can cause security risks for your organization.

An Egress Firewall resource can be created in the kube-node-lease, kube-public, kube-system, openshift and openshift- projects.

The egress firewall policy rules are evaluated in the order that they are defined, from first to last. The first rule that matches an egress connection from a pod applies. Any subsequent rules are ignored for that connection.

If you use DNS names in any of your egress firewall policy rules, proper resolution of the domain names is subject to the following restrictions:

  • Domain name updates are polled based on a time-to-live (TTL) duration. By default, the duration is 30 seconds. When the egress firewall controller queries the local name servers for a domain name, if the response includes a TTL that is less than 30 seconds, the controller sets the duration to the returned value. If the TTL in the response is greater than 30 minutes, the controller sets the duration to 30 minutes. If the TTL is between 30 seconds and 30 minutes, the controller ignores the value and sets the duration to 30 seconds.
  • The pod must resolve the domain from the same local name servers when necessary. Otherwise the IP addresses for the domain known by the egress firewall controller and the pod can be different. If the IP addresses for a hostname differ, the egress firewall might not be enforced consistently.
  • Because the egress firewall controller and pods asynchronously poll the same local name server, the pod might obtain the updated IP address before the egress controller does, which causes a race condition. Due to this current limitation, domain name usage in EgressNetworkPolicy objects is only recommended for domains with infrequent IP address changes.
Note

The egress firewall always allows pods access to the external interface of the node that the pod is on for DNS resolution.

If you use domain names in your egress firewall policy and your DNS resolution is not handled by a DNS server on the local node, then you must add egress firewall rules that allow access to your DNS server’s IP addresses. if you are using domain names in your pods.

You can define one or more rules for an egress firewall. A rule is either an Allow rule or a Deny rule, with a specification for the traffic that the rule applies to.

The following YAML describes an EgressNetworkPolicy CR object:

EgressNetworkPolicy object

apiVersion: network.openshift.io/v1
kind: EgressNetworkPolicy
metadata:
  name: <name> 
1

spec:
  egress: 
2

    ...
Copy to Clipboard Toggle word wrap

1
A name for your egress firewall policy.
2
A collection of one or more egress network policy rules as described in the following section.
22.3.2.1. EgressNetworkPolicy rules

The following YAML describes an egress firewall rule object. The egress stanza expects an array of one or more objects.

Egress policy rule stanza

egress:
- type: <type> 
1

  to: 
2

    cidrSelector: <cidr> 
3

    dnsName: <dns_name> 
4
Copy to Clipboard Toggle word wrap

1
The type of rule. The value must be either Allow or Deny.
2
A stanza describing an egress traffic match rule. A value for either the cidrSelector field or the dnsName field for the rule. You cannot use both fields in the same rule.
3
An IP address range in CIDR format.
4
A domain name.
22.3.2.2. Example EgressNetworkPolicy CR objects

The following example defines several egress firewall policy rules:

apiVersion: network.openshift.io/v1
kind: EgressNetworkPolicy
metadata:
  name: default
spec:
  egress: 
1

  - type: Allow
    to:
      cidrSelector: 1.2.3.0/24
  - type: Allow
    to:
      dnsName: www.example.com
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0
Copy to Clipboard Toggle word wrap
1
A collection of egress firewall policy rule objects.

22.3.3. Creating an egress firewall policy object

As a cluster administrator, you can create an egress firewall policy object for a project.

Important

If the project already has an EgressNetworkPolicy object defined, you must edit the existing policy to make changes to the egress firewall rules.

Prerequisites

  • A cluster that uses the OpenShift SDN default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster as a cluster administrator.

Procedure

  1. Create a policy rule:

    1. Create a <policy_name>.yaml file where <policy_name> describes the egress policy rules.
    2. In the file you created, define an egress policy object.
  2. Enter the following command to create the policy object. Replace <policy_name> with the name of the policy and <project> with the project that the rule applies to.

    $ oc create -f <policy_name>.yaml -n <project>
    Copy to Clipboard Toggle word wrap

    In the following example, a new EgressNetworkPolicy object is created in a project named project1:

    $ oc create -f default.yaml -n project1
    Copy to Clipboard Toggle word wrap

    Example output

    egressnetworkpolicy.network.openshift.io/v1 created
    Copy to Clipboard Toggle word wrap

  3. Optional: Save the <policy_name>.yaml file so that you can make changes later.

22.4. Editing an egress firewall for a project

As a cluster administrator, you can modify network traffic rules for an existing egress firewall.

22.4.1. Viewing an EgressNetworkPolicy object

You can view an EgressNetworkPolicy object in your cluster.

Prerequisites

  • A cluster using the OpenShift SDN default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift Command-line Interface (CLI), commonly known as oc.
  • You must log in to the cluster.

Procedure

  1. Optional: To view the names of the EgressNetworkPolicy objects defined in your cluster, enter the following command:

    $ oc get egressnetworkpolicy --all-namespaces
    Copy to Clipboard Toggle word wrap
  2. To inspect a policy, enter the following command. Replace <policy_name> with the name of the policy to inspect.

    $ oc describe egressnetworkpolicy <policy_name>
    Copy to Clipboard Toggle word wrap

    Example output

    Name:		default
    Namespace:	project1
    Created:	20 minutes ago
    Labels:		<none>
    Annotations:	<none>
    Rule:		Allow to 1.2.3.0/24
    Rule:		Allow to www.example.com
    Rule:		Deny to 0.0.0.0/0
    Copy to Clipboard Toggle word wrap

22.5. Editing an egress firewall for a project

As a cluster administrator, you can modify network traffic rules for an existing egress firewall.

22.5.1. Editing an EgressNetworkPolicy object

As a cluster administrator, you can update the egress firewall for a project.

Prerequisites

  • A cluster using the OpenShift SDN default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster as a cluster administrator.

Procedure

  1. Find the name of the EgressNetworkPolicy object for the project. Replace <project> with the name of the project.

    $ oc get -n <project> egressnetworkpolicy
    Copy to Clipboard Toggle word wrap
  2. Optional: If you did not save a copy of the EgressNetworkPolicy object when you created the egress network firewall, enter the following command to create a copy.

    $ oc get -n <project> egressnetworkpolicy <name> -o yaml > <filename>.yaml
    Copy to Clipboard Toggle word wrap

    Replace <project> with the name of the project. Replace <name> with the name of the object. Replace <filename> with the name of the file to save the YAML to.

  3. After making changes to the policy rules, enter the following command to replace the EgressNetworkPolicy object. Replace <filename> with the name of the file containing the updated EgressNetworkPolicy object.

    $ oc replace -f <filename>.yaml
    Copy to Clipboard Toggle word wrap

22.6. Removing an egress firewall from a project

As a cluster administrator, you can remove an egress firewall from a project to remove all restrictions on network traffic from the project that leaves the OpenShift Container Platform cluster.

22.6.1. Removing an EgressNetworkPolicy object

As a cluster administrator, you can remove an egress firewall from a project.

Prerequisites

  • A cluster using the OpenShift SDN default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster as a cluster administrator.

Procedure

  1. Find the name of the EgressNetworkPolicy object for the project. Replace <project> with the name of the project.

    $ oc get -n <project> egressnetworkpolicy
    Copy to Clipboard Toggle word wrap
  2. Enter the following command to delete the EgressNetworkPolicy object. Replace <project> with the name of the project and <name> with the name of the object.

    $ oc delete -n <project> egressnetworkpolicy <name>
    Copy to Clipboard Toggle word wrap

22.7.1. About an egress router pod

The OpenShift Container Platform egress router pod redirects traffic to a specified remote server from a private source IP address that is not used for any other purpose. An egress router pod can send network traffic to servers that are set up to allow access only from specific IP addresses.

Note

The egress router pod is not intended for every outgoing connection. Creating large numbers of egress router pods can exceed the limits of your network hardware. For example, creating an egress router pod for every project or application could exceed the number of local MAC addresses that the network interface can handle before reverting to filtering MAC addresses in software.

Important

The egress router image is not compatible with Amazon AWS, Azure Cloud, or any other cloud platform that does not support layer 2 manipulations due to their incompatibility with macvlan traffic.

22.7.1.1. Egress router modes

In redirect mode, an egress router pod configures iptables rules to redirect traffic from its own IP address to one or more destination IP addresses. Client pods that need to use the reserved source IP address must be configured to access the service for the egress router rather than connecting directly to the destination IP. You can access the destination service and port from the application pod by using the curl command. For example:

$ curl <router_service_IP> <port>
Copy to Clipboard Toggle word wrap

In HTTP proxy mode, an egress router pod runs as an HTTP proxy on port 8080. This mode only works for clients that are connecting to HTTP-based or HTTPS-based services, but usually requires fewer changes to the client pods to get them to work. Many programs can be told to use an HTTP proxy by setting an environment variable.

In DNS proxy mode, an egress router pod runs as a DNS proxy for TCP-based services from its own IP address to one or more destination IP addresses. To make use of the reserved, source IP address, client pods must be modified to connect to the egress router pod rather than connecting directly to the destination IP address. This modification ensures that external destinations treat traffic as though it were coming from a known source.

Redirect mode works for all services except for HTTP and HTTPS. For HTTP and HTTPS services, use HTTP proxy mode. For TCP-based services with IP addresses or domain names, use DNS proxy mode.

22.7.1.2. Egress router pod implementation

The egress router pod setup is performed by an initialization container. That container runs in a privileged context so that it can configure the macvlan interface and set up iptables rules. After the initialization container finishes setting up the iptables rules, it exits. Next the egress router pod executes the container to handle the egress router traffic. The image used varies depending on the egress router mode.

The environment variables determine which addresses the egress-router image uses. The image configures the macvlan interface to use EGRESS_SOURCE as its IP address, with EGRESS_GATEWAY as the IP address for the gateway.

Network Address Translation (NAT) rules are set up so that connections to the cluster IP address of the pod on any TCP or UDP port are redirected to the same port on IP address specified by the EGRESS_DESTINATION variable.

If only some of the nodes in your cluster are capable of claiming the specified source IP address and using the specified gateway, you can specify a nodeName or nodeSelector to identify which nodes are acceptable.

22.7.1.3. Deployment considerations

An egress router pod adds an additional IP address and MAC address to the primary network interface of the node. As a result, you might need to configure your hypervisor or cloud provider to allow the additional address.

Red Hat OpenStack Platform (RHOSP)

If you deploy OpenShift Container Platform on RHOSP, you must allow traffic from the IP and MAC addresses of the egress router pod on your OpenStack environment. If you do not allow the traffic, then communication will fail:

$ openstack port set --allowed-address \
  ip_address=<ip_address>,mac_address=<mac_address> <neutron_port_uuid>
Copy to Clipboard Toggle word wrap
Red Hat Virtualization (RHV)
If you are using RHV, you must select No Network Filter for the Virtual network interface controller (vNIC).
VMware vSphere
If you are using VMware vSphere, see the VMware documentation for securing vSphere standard switches. View and change VMware vSphere default settings by selecting the host virtual switch from the vSphere Web Client.

Specifically, ensure that the following are enabled:

22.7.1.4. Failover configuration

To avoid downtime, you can deploy an egress router pod with a Deployment resource, as in the following example. To create a new Service object for the example deployment, use the oc expose deployment/egress-demo-controller command.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: egress-demo-controller
spec:
  replicas: 1 
1

  selector:
    matchLabels:
      name: egress-router
  template:
    metadata:
      name: egress-router
      labels:
        name: egress-router
      annotations:
        pod.network.openshift.io/assign-macvlan: "true"
    spec: 
2

      initContainers:
        ...
      containers:
        ...
Copy to Clipboard Toggle word wrap
1
Ensure that replicas is set to 1, because only one pod can use a given egress source IP address at any time. This means that only a single copy of the router runs on a node.
2
Specify the Pod object template for the egress router pod.

As a cluster administrator, you can deploy an egress router pod that is configured to redirect traffic to specified destination IP addresses.

Define the configuration for an egress router pod in the Pod object. The following YAML describes the fields for the configuration of an egress router pod in redirect mode:

apiVersion: v1
kind: Pod
metadata:
  name: egress-1
  labels:
    name: egress-1
  annotations:
    pod.network.openshift.io/assign-macvlan: "true" 
1

spec:
  initContainers:
  - name: egress-router
    image: registry.redhat.io/openshift4/ose-egress-router
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE 
2

      value: <egress_router>
    - name: EGRESS_GATEWAY 
3

      value: <egress_gateway>
    - name: EGRESS_DESTINATION 
4

      value: <egress_destination>
    - name: EGRESS_ROUTER_MODE
      value: init
  containers:
  - name: egress-router-wait
    image: registry.redhat.io/openshift4/ose-pod
Copy to Clipboard Toggle word wrap
1
The annotation tells OpenShift Container Platform to create a macvlan network interface on the primary network interface controller (NIC) and move that macvlan interface into the pod’s network namespace. You must include the quotation marks around the "true" value. To have OpenShift Container Platform create the macvlan interface on a different NIC interface, set the annotation value to the name of that interface. For example, eth1.
2
IP address from the physical network that the node is on that is reserved for use by the egress router pod. Optional: You can include the subnet length, the /24 suffix, so that a proper route to the local subnet is set. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
3
Same value as the default gateway used by the node.
4
External server to direct traffic to. Using this example, connections to the pod are redirected to 203.0.113.25, with a source IP address of 192.168.12.99.

Example egress router pod specification

apiVersion: v1
kind: Pod
metadata:
  name: egress-multi
  labels:
    name: egress-multi
  annotations:
    pod.network.openshift.io/assign-macvlan: "true"
spec:
  initContainers:
  - name: egress-router
    image: registry.redhat.io/openshift4/ose-egress-router
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE
      value: 192.168.12.99/24
    - name: EGRESS_GATEWAY
      value: 192.168.12.1
    - name: EGRESS_DESTINATION
      value: |
        80   tcp 203.0.113.25
        8080 tcp 203.0.113.26 80
        8443 tcp 203.0.113.26 443
        203.0.113.27
    - name: EGRESS_ROUTER_MODE
      value: init
  containers:
  - name: egress-router-wait
    image: registry.redhat.io/openshift4/ose-pod
Copy to Clipboard Toggle word wrap

22.8.2. Egress destination configuration format

When an egress router pod is deployed in redirect mode, you can specify redirection rules by using one or more of the following formats:

  • <port> <protocol> <ip_address> - Incoming connections to the given <port> should be redirected to the same port on the given <ip_address>. <protocol> is either tcp or udp.
  • <port> <protocol> <ip_address> <remote_port> - As above, except that the connection is redirected to a different <remote_port> on <ip_address>.
  • <ip_address> - If the last line is a single IP address, then any connections on any other port will be redirected to the corresponding port on that IP address. If there is no fallback IP address then connections on other ports are rejected.

In the example that follows several rules are defined:

  • The first line redirects traffic from local port 80 to port 80 on 203.0.113.25.
  • The second and third lines redirect local ports 8080 and 8443 to remote ports 80 and 443 on 203.0.113.26.
  • The last line matches traffic for any ports not specified in the previous rules.

Example configuration

80   tcp 203.0.113.25
8080 tcp 203.0.113.26 80
8443 tcp 203.0.113.26 443
203.0.113.27
Copy to Clipboard Toggle word wrap

In redirect mode, an egress router pod sets up iptables rules to redirect traffic from its own IP address to one or more destination IP addresses. Client pods that need to use the reserved source IP address must be configured to access the service for the egress router rather than connecting directly to the destination IP. You can access the destination service and port from the application pod by using the curl command. For example:

$ curl <router_service_IP> <port>
Copy to Clipboard Toggle word wrap

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create an egress router pod.
  2. To ensure that other pods can find the IP address of the egress router pod, create a service to point to the egress router pod, as in the following example:

    apiVersion: v1
    kind: Service
    metadata:
      name: egress-1
    spec:
      ports:
      - name: http
        port: 80
      - name: https
        port: 443
      type: ClusterIP
      selector:
        name: egress-1
    Copy to Clipboard Toggle word wrap

    Your pods can now connect to this service. Their connections are redirected to the corresponding ports on the external server, using the reserved egress IP address.

As a cluster administrator, you can deploy an egress router pod configured to proxy traffic to specified HTTP and HTTPS-based services.

Define the configuration for an egress router pod in the Pod object. The following YAML describes the fields for the configuration of an egress router pod in HTTP mode:

apiVersion: v1
kind: Pod
metadata:
  name: egress-1
  labels:
    name: egress-1
  annotations:
    pod.network.openshift.io/assign-macvlan: "true" 
1

spec:
  initContainers:
  - name: egress-router
    image: registry.redhat.io/openshift4/ose-egress-router
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE 
2

      value: <egress-router>
    - name: EGRESS_GATEWAY 
3

      value: <egress-gateway>
    - name: EGRESS_ROUTER_MODE
      value: http-proxy
  containers:
  - name: egress-router-pod
    image: registry.redhat.io/openshift4/ose-egress-http-proxy
    env:
    - name: EGRESS_HTTP_PROXY_DESTINATION 
4

      value: |-
        ...
    ...
Copy to Clipboard Toggle word wrap
1
The annotation tells OpenShift Container Platform to create a macvlan network interface on the primary network interface controller (NIC) and move that macvlan interface into the pod’s network namespace. You must include the quotation marks around the "true" value. To have OpenShift Container Platform create the macvlan interface on a different NIC interface, set the annotation value to the name of that interface. For example, eth1.
2
IP address from the physical network that the node is on that is reserved for use by the egress router pod. Optional: You can include the subnet length, the /24 suffix, so that a proper route to the local subnet is set. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
3
Same value as the default gateway used by the node.
4
A string or YAML multi-line string specifying how to configure the proxy. Note that this is specified as an environment variable in the HTTP proxy container, not with the other environment variables in the init container.

22.9.2. Egress destination configuration format

When an egress router pod is deployed in HTTP proxy mode, you can specify redirection rules by using one or more of the following formats. Each line in the configuration specifies one group of connections to allow or deny:

  • An IP address allows connections to that IP address, such as 192.168.1.1.
  • A CIDR range allows connections to that CIDR range, such as 192.168.1.0/24.
  • A hostname allows proxying to that host, such as www.example.com.
  • A domain name preceded by *. allows proxying to that domain and all of its subdomains, such as *.example.com.
  • A ! followed by any of the previous match expressions denies the connection instead.
  • If the last line is *, then anything that is not explicitly denied is allowed. Otherwise, anything that is not allowed is denied.

You can also use * to allow connections to all remote destinations.

Example configuration

!*.example.com
!192.168.1.0/24
192.168.2.1
*
Copy to Clipboard Toggle word wrap

In HTTP proxy mode, an egress router pod runs as an HTTP proxy on port 8080. This mode only works for clients that are connecting to HTTP-based or HTTPS-based services, but usually requires fewer changes to the client pods to get them to work. Many programs can be told to use an HTTP proxy by setting an environment variable.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create an egress router pod.
  2. To ensure that other pods can find the IP address of the egress router pod, create a service to point to the egress router pod, as in the following example:

    apiVersion: v1
    kind: Service
    metadata:
      name: egress-1
    spec:
      ports:
      - name: http-proxy
        port: 8080 
    1
    
      type: ClusterIP
      selector:
        name: egress-1
    Copy to Clipboard Toggle word wrap
    1
    Ensure the http port is set to 8080.
  3. To configure the client pod (not the egress proxy pod) to use the HTTP proxy, set the http_proxy or https_proxy variables:

    apiVersion: v1
    kind: Pod
    metadata:
      name: app-1
      labels:
        name: app-1
    spec:
      containers:
        env:
        - name: http_proxy
          value: http://egress-1:8080/ 
    1
    
        - name: https_proxy
          value: http://egress-1:8080/
        ...
    Copy to Clipboard Toggle word wrap
    1
    The service created in the previous step.
    Note

    Using the http_proxy and https_proxy environment variables is not necessary for all setups. If the above does not create a working setup, then consult the documentation for the tool or software you are running in the pod.

As a cluster administrator, you can deploy an egress router pod configured to proxy traffic to specified DNS names and IP addresses.

Define the configuration for an egress router pod in the Pod object. The following YAML describes the fields for the configuration of an egress router pod in DNS mode:

apiVersion: v1
kind: Pod
metadata:
  name: egress-1
  labels:
    name: egress-1
  annotations:
    pod.network.openshift.io/assign-macvlan: "true" 
1

spec:
  initContainers:
  - name: egress-router
    image: registry.redhat.io/openshift4/ose-egress-router
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE 
2

      value: <egress-router>
    - name: EGRESS_GATEWAY 
3

      value: <egress-gateway>
    - name: EGRESS_ROUTER_MODE
      value: dns-proxy
  containers:
  - name: egress-router-pod
    image: registry.redhat.io/openshift4/ose-egress-dns-proxy
    securityContext:
      privileged: true
    env:
    - name: EGRESS_DNS_PROXY_DESTINATION 
4

      value: |-
        ...
    - name: EGRESS_DNS_PROXY_DEBUG 
5

      value: "1"
    ...
Copy to Clipboard Toggle word wrap
1
The annotation tells OpenShift Container Platform to create a macvlan network interface on the primary network interface controller (NIC) and move that macvlan interface into the pod’s network namespace. You must include the quotation marks around the "true" value. To have OpenShift Container Platform create the macvlan interface on a different NIC interface, set the annotation value to the name of that interface. For example, eth1.
2
IP address from the physical network that the node is on that is reserved for use by the egress router pod. Optional: You can include the subnet length, the /24 suffix, so that a proper route to the local subnet is set. If you do not specify a subnet length, then the egress router can access only the host specified with the EGRESS_GATEWAY variable and no other hosts on the subnet.
3
Same value as the default gateway used by the node.
4
Specify a list of one or more proxy destinations.
5
Optional: Specify to output the DNS proxy log output to stdout.

22.10.2. Egress destination configuration format

When the router is deployed in DNS proxy mode, you specify a list of port and destination mappings. A destination may be either an IP address or a DNS name.

An egress router pod supports the following formats for specifying port and destination mappings:

Port and remote address
You can specify a source port and a destination host by using the two field format: <port> <remote_address>.

The host can be an IP address or a DNS name. If a DNS name is provided, DNS resolution occurs at runtime. For a given host, the proxy connects to the specified source port on the destination host when connecting to the destination host IP address.

Port and remote address pair example

80 172.16.12.11
100 example.com
Copy to Clipboard Toggle word wrap

Port, remote address, and remote port
You can specify a source port, a destination host, and a destination port by using the three field format: <port> <remote_address> <remote_port>.

The three field format behaves identically to the two field version, with the exception that the destination port can be different than the source port.

Port, remote address, and remote port example

8080 192.168.60.252 80
8443 web.example.com 443
Copy to Clipboard Toggle word wrap

In DNS proxy mode, an egress router pod acts as a DNS proxy for TCP-based services from its own IP address to one or more destination IP addresses.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create an egress router pod.
  2. Create a service for the egress router pod:

    1. Create a file named egress-router-service.yaml that contains the following YAML. Set spec.ports to the list of ports that you defined previously for the EGRESS_DNS_PROXY_DESTINATION environment variable.

      apiVersion: v1
      kind: Service
      metadata:
        name: egress-dns-svc
      spec:
        ports:
          ...
        type: ClusterIP
        selector:
          name: egress-dns-proxy
      Copy to Clipboard Toggle word wrap

      For example:

      apiVersion: v1
      kind: Service
      metadata:
        name: egress-dns-svc
      spec:
        ports:
        - name: con1
          protocol: TCP
          port: 80
          targetPort: 80
        - name: con2
          protocol: TCP
          port: 100
          targetPort: 100
        type: ClusterIP
        selector:
          name: egress-dns-proxy
      Copy to Clipboard Toggle word wrap
    2. To create the service, enter the following command:

      $ oc create -f egress-router-service.yaml
      Copy to Clipboard Toggle word wrap

      Pods can now connect to this service. The connections are proxied to the corresponding ports on the external server, using the reserved egress IP address.

As a cluster administrator, you can define a ConfigMap object that specifies destination mappings for an egress router pod. The specific format of the configuration depends on the type of egress router pod. For details on the format, refer to the documentation for the specific egress router pod.

For a large or frequently-changing set of destination mappings, you can use a config map to externally maintain the list. An advantage of this approach is that permission to edit the config map can be delegated to users without cluster-admin privileges. Because the egress router pod requires a privileged container, it is not possible for users without cluster-admin privileges to edit the pod definition directly.

Note

The egress router pod does not automatically update when the config map changes. You must restart the egress router pod to get updates.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a file containing the mapping data for the egress router pod, as in the following example:

    # Egress routes for Project "Test", version 3
    
    80   tcp 203.0.113.25
    
    8080 tcp 203.0.113.26 80
    8443 tcp 203.0.113.26 443
    
    # Fallback
    203.0.113.27
    Copy to Clipboard Toggle word wrap

    You can put blank lines and comments into this file.

  2. Create a ConfigMap object from the file:

    $ oc delete configmap egress-routes --ignore-not-found
    Copy to Clipboard Toggle word wrap
    $ oc create configmap egress-routes \
      --from-file=destination=my-egress-destination.txt
    Copy to Clipboard Toggle word wrap

    In the previous command, the egress-routes value is the name of the ConfigMap object to create and my-egress-destination.txt is the name of the file that the data is read from.

    Tip

    You can alternatively apply the following YAML to create the config map:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: egress-routes
    data:
      destination: |
        # Egress routes for Project "Test", version 3
    
        80   tcp 203.0.113.25
    
        8080 tcp 203.0.113.26 80
        8443 tcp 203.0.113.26 443
    
        # Fallback
        203.0.113.27
    Copy to Clipboard Toggle word wrap
  3. Create an egress router pod definition and specify the configMapKeyRef stanza for the EGRESS_DESTINATION field in the environment stanza:

    ...
    env:
    - name: EGRESS_DESTINATION
      valueFrom:
        configMapKeyRef:
          name: egress-routes
          key: destination
    ...
    Copy to Clipboard Toggle word wrap

22.12. Enabling multicast for a project

22.12.1. About multicast

With IP multicast, data is broadcast to many IP addresses simultaneously.

Important
  • At this time, multicast is best used for low-bandwidth coordination or service discovery and not a high-bandwidth solution.
  • By default, network policies affect all connections in a namespace. However, multicast is unaffected by network policies. If multicast is enabled in the same namespace as your network policies, it is always allowed, even if there is a deny-all network policy. Cluster administrators should consider the implications to the exemption of multicast from network policies before enabling it.

Multicast traffic between OpenShift Container Platform pods is disabled by default. If you are using the OpenShift SDN default Container Network Interface (CNI) network provider, you can enable multicast on a per-project basis.

When using the OpenShift SDN network plugin in networkpolicy isolation mode:

  • Multicast packets sent by a pod will be delivered to all other pods in the project, regardless of NetworkPolicy objects. Pods might be able to communicate over multicast even when they cannot communicate over unicast.
  • Multicast packets sent by a pod in one project will never be delivered to pods in any other project, even if there are NetworkPolicy objects that allow communication between the projects.

When using the OpenShift SDN network plugin in multitenant isolation mode:

  • Multicast packets sent by a pod will be delivered to all other pods in the project.
  • Multicast packets sent by a pod in one project will be delivered to pods in other projects only if each project is joined together and multicast is enabled in each joined project.

22.12.2. Enabling multicast between pods

You can enable multicast between pods for your project.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  • Run the following command to enable multicast for a project. Replace <namespace> with the namespace for the project you want to enable multicast for.

    $ oc annotate netnamespace <namespace> \
        netnamespace.network.openshift.io/multicast-enabled=true
    Copy to Clipboard Toggle word wrap

Verification

To verify that multicast is enabled for a project, complete the following procedure:

  1. Change your current project to the project that you enabled multicast for. Replace <project> with the project name.

    $ oc project <project>
    Copy to Clipboard Toggle word wrap
  2. Create a pod to act as a multicast receiver:

    $ cat <<EOF| oc create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: mlistener
      labels:
        app: multicast-verify
    spec:
      containers:
        - name: mlistener
          image: registry.access.redhat.com/ubi8
          command: ["/bin/sh", "-c"]
          args:
            ["dnf -y install socat hostname && sleep inf"]
          ports:
            - containerPort: 30102
              name: mlistener
              protocol: UDP
    EOF
    Copy to Clipboard Toggle word wrap
  3. Create a pod to act as a multicast sender:

    $ cat <<EOF| oc create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: msender
      labels:
        app: multicast-verify
    spec:
      containers:
        - name: msender
          image: registry.access.redhat.com/ubi8
          command: ["/bin/sh", "-c"]
          args:
            ["dnf -y install socat && sleep inf"]
    EOF
    Copy to Clipboard Toggle word wrap
  4. In a new terminal window or tab, start the multicast listener.

    1. Get the IP address for the Pod:

      $ POD_IP=$(oc get pods mlistener -o jsonpath='{.status.podIP}')
      Copy to Clipboard Toggle word wrap
    2. Start the multicast listener by entering the following command:

      $ oc exec mlistener -i -t -- \
          socat UDP4-RECVFROM:30102,ip-add-membership=224.1.0.1:$POD_IP,fork EXEC:hostname
      Copy to Clipboard Toggle word wrap
  5. Start the multicast transmitter.

    1. Get the pod network IP address range:

      $ CIDR=$(oc get Network.config.openshift.io cluster \
          -o jsonpath='{.status.clusterNetwork[0].cidr}')
      Copy to Clipboard Toggle word wrap
    2. To send a multicast message, enter the following command:

      $ oc exec msender -i -t -- \
          /bin/bash -c "echo | socat STDIO UDP4-DATAGRAM:224.1.0.1:30102,range=$CIDR,ip-multicast-ttl=64"
      Copy to Clipboard Toggle word wrap

      If multicast is working, the previous command returns the following output:

      mlistener
      Copy to Clipboard Toggle word wrap

22.13. Disabling multicast for a project

22.13.1. Disabling multicast between pods

You can disable multicast between pods for your project.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  • Disable multicast by running the following command:

    $ oc annotate netnamespace <namespace> \ 
    1
    
        netnamespace.network.openshift.io/multicast-enabled-
    Copy to Clipboard Toggle word wrap
    1
    The namespace for the project you want to disable multicast for.

When your cluster is configured to use the multitenant isolation mode for the OpenShift SDN CNI plugin, each project is isolated by default. Network traffic is not allowed between pods or services in different projects in multitenant isolation mode.

You can change the behavior of multitenant isolation for a project in two ways:

  • You can join one or more projects, allowing network traffic between pods and services in different projects.
  • You can disable network isolation for a project. It will be globally accessible, accepting network traffic from pods and services in all other projects. A globally accessible project can access pods and services in all other projects.

22.14.1. Prerequisites

  • You must have a cluster configured to use the OpenShift SDN Container Network Interface (CNI) plugin in multitenant isolation mode.

22.14.2. Joining projects

You can join two or more projects to allow network traffic between pods and services in different projects.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  1. Use the following command to join projects to an existing project network:

    $ oc adm pod-network join-projects --to=<project1> <project2> <project3>
    Copy to Clipboard Toggle word wrap

    Alternatively, instead of specifying specific project names, you can use the --selector=<project_selector> option to specify projects based upon an associated label.

  2. Optional: Run the following command to view the pod networks that you have joined together:

    $ oc get netnamespaces
    Copy to Clipboard Toggle word wrap

    Projects in the same pod-network have the same network ID in the NETID column.

22.14.3. Isolating a project

You can isolate a project so that pods and services in other projects cannot access its pods and services.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  • To isolate the projects in the cluster, run the following command:

    $ oc adm pod-network isolate-projects <project1> <project2>
    Copy to Clipboard Toggle word wrap

    Alternatively, instead of specifying specific project names, you can use the --selector=<project_selector> option to specify projects based upon an associated label.

22.14.4. Disabling network isolation for a project

You can disable network isolation for a project.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  • Run the following command for the project:

    $ oc adm pod-network make-projects-global <project1> <project2>
    Copy to Clipboard Toggle word wrap

    Alternatively, instead of specifying specific project names, you can use the --selector=<project_selector> option to specify projects based upon an associated label.

22.15. Configuring kube-proxy

The Kubernetes network proxy (kube-proxy) runs on each node and is managed by the Cluster Network Operator (CNO). kube-proxy maintains network rules for forwarding connections for endpoints associated with services.

22.15.1. About iptables rules synchronization

The synchronization period determines how frequently the Kubernetes network proxy (kube-proxy) syncs the iptables rules on a node.

A sync begins when either of the following events occurs:

  • An event occurs, such as service or endpoint is added to or removed from the cluster.
  • The time since the last sync exceeds the sync period defined for kube-proxy.

22.15.2. kube-proxy configuration parameters

You can modify the following kubeProxyConfig parameters.

Note

Because of performance improvements introduced in OpenShift Container Platform 4.3 and greater, adjusting the iptablesSyncPeriod parameter is no longer necessary.

Expand
Table 22.2. Parameters
ParameterDescriptionValuesDefault

iptablesSyncPeriod

The refresh period for iptables rules.

A time interval, such as 30s or 2m. Valid suffixes include s, m, and h and are described in the Go time package documentation.

30s

proxyArguments.iptables-min-sync-period

The minimum duration before refreshing iptables rules. This parameter ensures that the refresh does not happen too frequently. By default, a refresh starts as soon as a change that affects iptables rules occurs.

A time interval, such as 30s or 2m. Valid suffixes include s, m, and h and are described in the Go time package

0s

22.15.3. Modifying the kube-proxy configuration

You can modify the Kubernetes network proxy configuration for your cluster.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to a running cluster with the cluster-admin role.

Procedure

  1. Edit the Network.operator.openshift.io custom resource (CR) by running the following command:

    $ oc edit network.operator.openshift.io cluster
    Copy to Clipboard Toggle word wrap
  2. Modify the kubeProxyConfig parameter in the CR with your changes to the kube-proxy configuration, such as in the following example CR:

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      kubeProxyConfig:
        iptablesSyncPeriod: 30s
        proxyArguments:
          iptables-min-sync-period: ["30s"]
    Copy to Clipboard Toggle word wrap
  3. Save the file and exit the text editor.

    The syntax is validated by the oc command when you save the file and exit the editor. If your modifications contain a syntax error, the editor opens the file and displays an error message.

  4. Enter the following command to confirm the configuration update:

    $ oc get networks.operator.openshift.io -o yaml
    Copy to Clipboard Toggle word wrap

    Example output

    apiVersion: v1
    items:
    - apiVersion: operator.openshift.io/v1
      kind: Network
      metadata:
        name: cluster
      spec:
        clusterNetwork:
        - cidr: 10.128.0.0/14
          hostPrefix: 23
        defaultNetwork:
          type: OpenShiftSDN
        kubeProxyConfig:
          iptablesSyncPeriod: 30s
          proxyArguments:
            iptables-min-sync-period:
            - 30s
        serviceNetwork:
        - 172.30.0.0/16
      status: {}
    kind: List
    Copy to Clipboard Toggle word wrap

  5. Optional: Enter the following command to confirm that the Cluster Network Operator accepted the configuration change:

    $ oc get clusteroperator network
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      VERSION     AVAILABLE   PROGRESSING   DEGRADED   SINCE
    network   4.1.0-0.9   True        False         False      1m
    Copy to Clipboard Toggle word wrap

    The AVAILABLE field is True when the configuration update is applied successfully.

The OpenShift Container Platform cluster uses a virtualized network for pod and service networks. The OVN-Kubernetes Container Network Interface (CNI) plugin is a network provider for the default cluster network. OVN-Kubernetes is based on Open Virtual Network (OVN) and provides an overlay-based networking implementation. A cluster that uses the OVN-Kubernetes network provider also runs Open vSwitch (OVS) on each node. OVN configures OVS on each node to implement the declared network configuration.

OVN-Kubernetes is the default networking solution for single-node OpenShift deployments only.

23.1.1. OVN-Kubernetes features

The OVN-Kubernetes Container Network Interface (CNI) cluster network provider implements the following features:

  • Uses OVN (Open Virtual Network) to manage network traffic flows. OVN is a community developed, vendor-agnostic network virtualization solution.
  • Implements Kubernetes network policy support, including ingress and egress rules.
  • Uses the Geneve (Generic Network Virtualization Encapsulation) protocol rather than VXLAN to create an overlay network between nodes.

OpenShift Container Platform offers two supported choices, OpenShift SDN and OVN-Kubernetes, for the default Container Network Interface (CNI) network provider. The following table summarizes the current feature support for both network providers:

Expand
Table 23.1. Default CNI network provider feature comparison
FeatureOVN-KubernetesOpenShift SDN

Egress IPs

Supported

Supported

Egress firewall [1]

Supported

Supported

Egress router

Supported [2]

Supported

Hybrid networking

Supported

Not supported

IPsec encryption for intra-cluster communication

Supported

Not supported

IPv6

Supported [3] [4]

Not supported

Kubernetes network policy

Supported

Supported

Kubernetes network policy logs

Supported

Not supported

Hardware offloading

Supported

Not supported

Multicast

Supported

Supported

  1. Egress firewall is also known as egress network policy in OpenShift SDN. This is not the same as network policy egress.
  2. Egress router for OVN-Kubernetes supports only redirect mode.
  3. IPv6 is supported only on bare metal clusters.
  4. IPv6 single stack does not support Kubernetes NMState.

23.1.3. OVN-Kubernetes limitations

The OVN-Kubernetes Container Network Interface (CNI) cluster network provider has the following limitations:

  • The sessionAffinityConfig.clientIP.timeoutSeconds service has no effect in an OpenShift OVN environment, but does in an OpenShift SDN environment. This incompatibility can make it difficult for users to migrate from OpenShift SDN to OVN.
  • For clusters configured for dual-stack networking, both IPv4 and IPv6 traffic must use the same network interface as the default gateway. If this requirement is not met, pods on the host in the ovnkube-node daemon set enter the CrashLoopBackOff state. If you display a pod with a command such as oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-node -o yaml, the status field contains more than one message about the default gateway, as shown in the following output:

    I1006 16:09:50.985852   60651 helper_linux.go:73] Found default gateway interface br-ex 192.168.127.1
    I1006 16:09:50.985923   60651 helper_linux.go:73] Found default gateway interface ens4 fe80::5054:ff:febe:bcd4
    F1006 16:09:50.985939   60651 ovnkube.go:130] multiple gateway interfaces detected: br-ex ens4
    Copy to Clipboard Toggle word wrap

    The only resolution is to reconfigure the host networking so that both IP families use the same network interface for the default gateway.

  • For clusters configured for dual-stack networking, both the IPv4 and IPv6 routing tables must contain the default gateway. If this requirement is not met, pods on the host in the ovnkube-node daemon set enter the CrashLoopBackOff state. If you display a pod with a command such as oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-node -o yaml, the status field contains more than one message about the default gateway, as shown in the following output:

    I0512 19:07:17.589083  108432 helper_linux.go:74] Found default gateway interface br-ex 192.168.123.1
    F0512 19:07:17.589141  108432 ovnkube.go:133] failed to get default gateway interface
    Copy to Clipboard Toggle word wrap

    The only resolution is to reconfigure the host networking so that both IP families contain the default gateway.

As a cluster administrator, you can migrate to the OVN-Kubernetes Container Network Interface (CNI) cluster network provider from the OpenShift SDN CNI cluster network provider.

To learn more about OVN-Kubernetes, read About the OVN-Kubernetes network provider.

Migrating to the OVN-Kubernetes Container Network Interface (CNI) cluster network provider is a manual process that includes some downtime during which your cluster is unreachable. Although a rollback procedure is provided, the migration is intended to be a one-way process.

A migration to the OVN-Kubernetes cluster network provider is supported on the following platforms:

  • Bare metal hardware
  • Amazon Web Services (AWS)
  • Google Cloud Platform (GCP)
  • Microsoft Azure
  • Red Hat OpenStack Platform (RHOSP)
  • Red Hat Virtualization (RHV)
  • VMware vSphere
Important

Migrating to or from the OVN-Kubernetes network plugin is not supported for managed OpenShift cloud services such as Red Hat OpenShift Dedicated, Azure Red Hat OpenShift(ARO), and Red Hat OpenShift Service on AWS (ROSA).

Migrating from OpenShift SDN network plugin to OVN-Kubernetes network plugin is not supported on Nutanix.

If you have more than 150 nodes in your OpenShift Container Platform cluster, then open a support case for consultation on your migration to the OVN-Kubernetes network plugin.

The subnets assigned to nodes and the IP addresses assigned to individual pods are not preserved during the migration.

While the OVN-Kubernetes network provider implements many of the capabilities present in the OpenShift SDN network provider, the configuration is not the same.

  • If your cluster uses any of the following OpenShift SDN capabilities, you must manually configure the same capability in OVN-Kubernetes:

    • Namespace isolation
    • Egress IP addresses
    • Egress network policies
    • Egress router pods
    • Multicast
  • If your cluster uses any part of the 100.64.0.0/16 IP address range, you cannot migrate to OVN-Kubernetes because it uses this IP address range internally.

The following sections highlight the differences in configuration between the aforementioned capabilities in OVN-Kubernetes and OpenShift SDN.

Namespace isolation

OVN-Kubernetes supports only the network policy isolation mode.

Important

If your cluster uses OpenShift SDN configured in either the multitenant or subnet isolation modes, you cannot migrate to the OVN-Kubernetes network provider.

Egress IP addresses

The differences in configuring an egress IP address between OVN-Kubernetes and OpenShift SDN is described in the following table:

Expand
Table 23.2. Differences in egress IP address configuration
OVN-KubernetesOpenShift SDN
  • Create an EgressIPs object
  • Add an annotation on a Node object
  • Patch a NetNamespace object
  • Patch a HostSubnet object

For more information on using egress IP addresses in OVN-Kubernetes, see "Configuring an egress IP address".

Egress network policies

The difference in configuring an egress network policy, also known as an egress firewall, between OVN-Kubernetes and OpenShift SDN is described in the following table:

Expand
Table 23.3. Differences in egress network policy configuration
OVN-KubernetesOpenShift SDN
  • Create an EgressFirewall object in a namespace
  • Create an EgressNetworkPolicy object in a namespace

For more information on using an egress firewall in OVN-Kubernetes, see "Configuring an egress firewall for a project".

Egress router pods

OVN-Kubernetes supports egress router pods in redirect mode. OVN-Kubernetes does not support egress router pods in HTTP proxy mode or DNS proxy mode.

When you deploy an egress router with the Cluster Network Operator, you cannot specify a node selector to control which node is used to host the egress router pod.

Multicast

The difference between enabling multicast traffic on OVN-Kubernetes and OpenShift SDN is described in the following table:

Expand
Table 23.4. Differences in multicast configuration
OVN-KubernetesOpenShift SDN
  • Add an annotation on a Namespace object
  • Add an annotation on a NetNamespace object

For more information on using multicast in OVN-Kubernetes, see "Enabling multicast for a project".

Network policies

OVN-Kubernetes fully supports the Kubernetes NetworkPolicy API in the networking.k8s.io/v1 API group. No changes are necessary in your network policies when migrating from OpenShift SDN.

23.2.1.2. How the migration process works

The following table summarizes the migration process by segmenting between the user-initiated steps in the process and the actions that the migration performs in response.

Expand
Table 23.5. Migrating to OVN-Kubernetes from OpenShift SDN
User-initiated stepsMigration activity

Set the migration field of the Network.operator.openshift.io custom resource (CR) named cluster to OVNKubernetes. Make sure the migration field is null before setting it to a value.

Cluster Network Operator (CNO)
Updates the status of the Network.config.openshift.io CR named cluster accordingly.
Machine Config Operator (MCO)
Rolls out an update to the systemd configuration necessary for OVN-Kubernetes; The MCO updates a single machine per pool at a time by default, causing the total time the migration takes to increase with the size of the cluster.

Update the networkType field of the Network.config.openshift.io CR.

CNO

Performs the following actions:

  • Destroys the OpenShift SDN control plane pods.
  • Deploys the OVN-Kubernetes control plane pods.
  • Updates the Multus objects to reflect the new cluster network provider.

Reboot each node in the cluster.

Cluster
As nodes reboot, the cluster assigns IP addresses to pods on the OVN-Kubernetes cluster network.

If a rollback to OpenShift SDN is required, the following table describes the process.

Expand
Table 23.6. Performing a rollback to OpenShift SDN
User-initiated stepsMigration activity

Suspend the MCO to ensure that it does not interrupt the migration.

The MCO stops.

Set the migration field of the Network.operator.openshift.io custom resource (CR) named cluster to OpenShiftSDN. Make sure the migration field is null before setting it to a value.

CNO
Updates the status of the Network.config.openshift.io CR named cluster accordingly.

Update the networkType field.

CNO

Performs the following actions:

  • Destroys the OVN-Kubernetes control plane pods.
  • Deploys the OpenShift SDN control plane pods.
  • Updates the Multus objects to reflect the new cluster network provider.

Reboot each node in the cluster.

Cluster
As nodes reboot, the cluster assigns IP addresses to pods on the OpenShift-SDN network.

Enable the MCO after all nodes in the cluster reboot.

MCO
Rolls out an update to the systemd configuration necessary for OpenShift SDN; The MCO updates a single machine per pool at a time by default, so the total time the migration takes increases with the size of the cluster.

As a cluster administrator, you can change the default Container Network Interface (CNI) network provider for your cluster to OVN-Kubernetes. During the migration, you must reboot every node in your cluster.

Important

While performing the migration, your cluster is unavailable and workloads might be interrupted. Perform the migration only when an interruption in service is acceptable.

Prerequisites

  • A cluster configured with the OpenShift SDN CNI cluster network provider in the network policy isolation mode.
  • Install the OpenShift CLI (oc).
  • Access to the cluster as a user with the cluster-admin role.
  • A recent backup of the etcd database is available.
  • A reboot can be triggered manually for each node.
  • The cluster is in a known good state, without any errors.
  • On all cloud platforms after updating software, a security group rule must be in place to allow UDP packets on port 6081 for all nodes.

Procedure

  1. To backup the configuration for the cluster network, enter the following command:

    $ oc get Network.config.openshift.io cluster -o yaml > cluster-openshift-sdn.yaml
    Copy to Clipboard Toggle word wrap
  2. To prepare all the nodes for the migration, set the migration field on the Cluster Network Operator configuration object by entering the following command:

    $ oc patch Network.operator.openshift.io cluster --type='merge' \
      --patch '{ "spec": { "migration": {"networkType": "OVNKubernetes" } } }'
    Copy to Clipboard Toggle word wrap
    Note

    This step does not deploy OVN-Kubernetes immediately. Instead, specifying the migration field triggers the Machine Config Operator (MCO) to apply new machine configs to all the nodes in the cluster in preparation for the OVN-Kubernetes deployment.

  3. Optional: You can customize the following settings for OVN-Kubernetes to meet your network infrastructure requirements:

    • Maximum transmission unit (MTU). Consider the following before customizing the MTU for this optional step:

      • If you use the default MTU, and you want to keep the default MTU during migration, this step can be ignored.
      • If you used a custom MTU, and you want to keep the custom MTU during migration, you must declare the custom MTU value in this step.
      • This step does not work if you want to change the MTU value during migration. Instead, you must first follow the instructions for "Changing the cluster MTU". You can then keep the custom MTU value by performing this procedure and declaring the custom MTU value in this step.

        Note

        OpenShift-SDN and OVN-Kubernetes have different overlay overhead. MTU values should be selected by following the guidelines found on the "MTU value selection" page.

    • Geneve (Generic Network Virtualization Encapsulation) overlay network port

    To customize either of the previously noted settings, enter and customize the following command. If you do not need to change the default value, omit the key from the patch.

    $ oc patch Network.operator.openshift.io cluster --type=merge \
      --patch '{
        "spec":{
          "defaultNetwork":{
            "ovnKubernetesConfig":{
              "mtu":<mtu>,
              "genevePort":<port>
        }}}}'
    Copy to Clipboard Toggle word wrap
    mtu
    The MTU for the Geneve overlay network. This value is normally configured automatically, but if the nodes in your cluster do not all use the same MTU, then you must set this explicitly to 100 less than the smallest node MTU value.
    port
    The UDP port for the Geneve overlay network. If a value is not specified, the default is 6081. The port cannot be the same as the VXLAN port that is used by OpenShift SDN. The default value for the VXLAN port is 4789.

    Example patch command to update mtu field

    $ oc patch Network.operator.openshift.io cluster --type=merge \
      --patch '{
        "spec":{
          "defaultNetwork":{
            "ovnKubernetesConfig":{
              "mtu":1200
        }}}}'
    Copy to Clipboard Toggle word wrap

  4. As the MCO updates machines in each machine config pool, it reboots each node one by one. You must wait until all the nodes are updated. Check the machine config pool status by entering the following command:

    $ oc get mcp
    Copy to Clipboard Toggle word wrap

    A successfully updated node has the following status: UPDATED=true, UPDATING=false, DEGRADED=false.

    Note

    By default, the MCO updates one machine per pool at a time, causing the total time the migration takes to increase with the size of the cluster.

  5. Confirm the status of the new machine configuration on the hosts:

    1. To list the machine configuration state and the name of the applied machine configuration, enter the following command:

      $ oc describe node | egrep "hostname|machineconfig"
      Copy to Clipboard Toggle word wrap

      Example output

      kubernetes.io/hostname=master-0
      machineconfiguration.openshift.io/currentConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/desiredConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/reason:
      machineconfiguration.openshift.io/state: Done
      Copy to Clipboard Toggle word wrap

      Verify that the following statements are true:

      • The value of machineconfiguration.openshift.io/state field is Done.
      • The value of the machineconfiguration.openshift.io/currentConfig field is equal to the value of the machineconfiguration.openshift.io/desiredConfig field.
    2. To confirm that the machine config is correct, enter the following command:

      $ oc get machineconfig <config_name> -o yaml | grep ExecStart
      Copy to Clipboard Toggle word wrap

      where <config_name> is the name of the machine config from the machineconfiguration.openshift.io/currentConfig field.

      The machine config must include the following update to the systemd configuration:

      ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes
      Copy to Clipboard Toggle word wrap
    3. If a node is stuck in the NotReady state, investigate the machine config daemon pod logs and resolve any errors.

      1. To list the pods, enter the following command:

        $ oc get pod -n openshift-machine-config-operator
        Copy to Clipboard Toggle word wrap

        Example output

        NAME                                         READY   STATUS    RESTARTS   AGE
        machine-config-controller-75f756f89d-sjp8b   1/1     Running   0          37m
        machine-config-daemon-5cf4b                  2/2     Running   0          43h
        machine-config-daemon-7wzcd                  2/2     Running   0          43h
        machine-config-daemon-fc946                  2/2     Running   0          43h
        machine-config-daemon-g2v28                  2/2     Running   0          43h
        machine-config-daemon-gcl4f                  2/2     Running   0          43h
        machine-config-daemon-l5tnv                  2/2     Running   0          43h
        machine-config-operator-79d9c55d5-hth92      1/1     Running   0          37m
        machine-config-server-bsc8h                  1/1     Running   0          43h
        machine-config-server-hklrm                  1/1     Running   0          43h
        machine-config-server-k9rtx                  1/1     Running   0          43h
        Copy to Clipboard Toggle word wrap

        The names for the config daemon pods are in the following format: machine-config-daemon-<seq>. The <seq> value is a random five character alphanumeric sequence.

      2. Display the pod log for the first machine config daemon pod shown in the previous output by enter the following command:

        $ oc logs <pod> -n openshift-machine-config-operator
        Copy to Clipboard Toggle word wrap

        where pod is the name of a machine config daemon pod.

      3. Resolve any errors in the logs shown by the output from the previous command.
  6. To start the migration, configure the OVN-Kubernetes cluster network provider by using one of the following commands:

    • To specify the network provider without changing the cluster network IP address block, enter the following command:

      $ oc patch Network.config.openshift.io cluster \
        --type='merge' --patch '{ "spec": { "networkType": "OVNKubernetes" } }'
      Copy to Clipboard Toggle word wrap
    • To specify a different cluster network IP address block, enter the following command:

      $ oc patch Network.config.openshift.io cluster \
        --type='merge' --patch '{
          "spec": {
            "clusterNetwork": [
              {
                "cidr": "<cidr>",
                "hostPrefix": <prefix>
              }
            ],
            "networkType": "OVNKubernetes"
          }
        }'
      Copy to Clipboard Toggle word wrap

      where cidr is a CIDR block and prefix is the slice of the CIDR block apportioned to each node in your cluster. You cannot use any CIDR block that overlaps with the 100.64.0.0/16 CIDR block because the OVN-Kubernetes network provider uses this block internally.

      Important

      You cannot change the service network address block during the migration.

  7. Verify that the Multus daemon set rollout is complete before continuing with subsequent steps:

    $ oc -n openshift-multus rollout status daemonset/multus
    Copy to Clipboard Toggle word wrap

    The name of the Multus pods is in the form of multus-<xxxxx> where <xxxxx> is a random sequence of letters. It might take several moments for the pods to restart.

    Example output

    Waiting for daemon set "multus" rollout to finish: 1 out of 6 new pods have been updated...
    ...
    Waiting for daemon set "multus" rollout to finish: 5 of 6 updated pods are available...
    daemon set "multus" successfully rolled out
    Copy to Clipboard Toggle word wrap

  8. To complete the migration, reboot each node in your cluster. For example, you can use a bash script similar to the following example. The script assumes that you can connect to each host by using ssh and that you have configured sudo to not prompt for a password.

    #!/bin/bash
    
    for ip in $(oc get nodes  -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}')
    do
       echo "reboot node $ip"
       ssh -o StrictHostKeyChecking=no core@$ip sudo shutdown -r -t 3
    done
    Copy to Clipboard Toggle word wrap

    If ssh access is not available, you might be able to reboot each node through the management portal for your infrastructure provider.

  9. Confirm that the migration succeeded:

    1. To confirm that the CNI cluster network provider is OVN-Kubernetes, enter the following command. The value of status.networkType must be OVNKubernetes.

      $ oc get network.config/cluster -o jsonpath='{.status.networkType}{"\n"}'
      Copy to Clipboard Toggle word wrap
    2. To confirm that the cluster nodes are in the Ready state, enter the following command:

      $ oc get nodes
      Copy to Clipboard Toggle word wrap
    3. To confirm that your pods are not in an error state, enter the following command:

      $ oc get pods --all-namespaces -o wide --sort-by='{.spec.nodeName}'
      Copy to Clipboard Toggle word wrap

      If pods on a node are in an error state, reboot that node.

    4. To confirm that all of the cluster Operators are not in an abnormal state, enter the following command:

      $ oc get co
      Copy to Clipboard Toggle word wrap

      The status of every cluster Operator must be the following: AVAILABLE="True", PROGRESSING="False", DEGRADED="False". If a cluster Operator is not available or degraded, check the logs for the cluster Operator for more information.

  10. Complete the following steps only if the migration succeeds and your cluster is in a good state:

    1. To remove the migration configuration from the CNO configuration object, enter the following command:

      $ oc patch Network.operator.openshift.io cluster --type='merge' \
        --patch '{ "spec": { "migration": null } }'
      Copy to Clipboard Toggle word wrap
    2. To remove custom configuration for the OpenShift SDN network provider, enter the following command:

      $ oc patch Network.operator.openshift.io cluster --type='merge' \
        --patch '{ "spec": { "defaultNetwork": { "openshiftSDNConfig": null } } }'
      Copy to Clipboard Toggle word wrap
    3. To remove the OpenShift SDN network provider namespace, enter the following command:

      $ oc delete namespace openshift-sdn
      Copy to Clipboard Toggle word wrap

As a cluster administrator, you can rollback to the OpenShift SDN Container Network Interface (CNI) cluster network provider from the OVN-Kubernetes CNI cluster network provider if the migration to OVN-Kubernetes is unsuccessful.

As a cluster administrator, you can rollback your cluster to the OpenShift SDN Container Network Interface (CNI) cluster network provider. During the rollback, you must reboot every node in your cluster.

Important

Only rollback to OpenShift SDN if the migration to OVN-Kubernetes fails.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Access to the cluster as a user with the cluster-admin role.
  • A cluster installed on infrastructure configured with the OVN-Kubernetes CNI cluster network provider.

Procedure

  1. Stop all of the machine configuration pools managed by the Machine Config Operator (MCO):

    • Stop the master configuration pool:

      $ oc patch MachineConfigPool master --type='merge' --patch \
        '{ "spec": { "paused": true } }'
      Copy to Clipboard Toggle word wrap
    • Stop the worker machine configuration pool:

      $ oc patch MachineConfigPool worker --type='merge' --patch \
        '{ "spec":{ "paused" :true } }'
      Copy to Clipboard Toggle word wrap
  2. To start the migration, set the cluster network provider back to OpenShift SDN by entering the following commands:

    $ oc patch Network.operator.openshift.io cluster --type='merge' \
      --patch '{ "spec": { "migration": { "networkType": "OpenShiftSDN" } } }'
    
    $ oc patch Network.config.openshift.io cluster --type='merge' \
      --patch '{ "spec": { "networkType": "OpenShiftSDN" } }'
    Copy to Clipboard Toggle word wrap
  3. Optional: You can customize the following settings for OpenShift SDN to meet your network infrastructure requirements:

    • Maximum transmission unit (MTU)
    • VXLAN port

    To customize either or both of the previously noted settings, customize and enter the following command. If you do not need to change the default value, omit the key from the patch.

    $ oc patch Network.operator.openshift.io cluster --type=merge \
      --patch '{
        "spec":{
          "defaultNetwork":{
            "openshiftSDNConfig":{
              "mtu":<mtu>,
              "vxlanPort":<port>
        }}}}'
    Copy to Clipboard Toggle word wrap
    mtu
    The MTU for the VXLAN overlay network. This value is normally configured automatically, but if the nodes in your cluster do not all use the same MTU, then you must set this explicitly to 50 less than the smallest node MTU value.
    port
    The UDP port for the VXLAN overlay network. If a value is not specified, the default is 4789. The port cannot be the same as the Geneve port that is used by OVN-Kubernetes. The default value for the Geneve port is 6081.

    Example patch command

    $ oc patch Network.operator.openshift.io cluster --type=merge \
      --patch '{
        "spec":{
          "defaultNetwork":{
            "openshiftSDNConfig":{
              "mtu":1200
        }}}}'
    Copy to Clipboard Toggle word wrap

  4. Wait until the Multus daemon set rollout completes.

    $ oc -n openshift-multus rollout status daemonset/multus
    Copy to Clipboard Toggle word wrap

    The name of the Multus pods is in form of multus-<xxxxx> where <xxxxx> is a random sequence of letters. It might take several moments for the pods to restart.

    Example output

    Waiting for daemon set "multus" rollout to finish: 1 out of 6 new pods have been updated...
    ...
    Waiting for daemon set "multus" rollout to finish: 5 of 6 updated pods are available...
    daemon set "multus" successfully rolled out
    Copy to Clipboard Toggle word wrap

  5. To complete the rollback, reboot each node in your cluster. For example, you could use a bash script similar to the following. The script assumes that you can connect to each host by using ssh and that you have configured sudo to not prompt for a password.

    #!/bin/bash
    
    for ip in $(oc get nodes  -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}')
    do
       echo "reboot node $ip"
       ssh -o StrictHostKeyChecking=no core@$ip sudo shutdown -r -t 3
    done
    Copy to Clipboard Toggle word wrap

    If ssh access is not available, you might be able to reboot each node through the management portal for your infrastructure provider.

  6. After the nodes in your cluster have rebooted, start all of the machine configuration pools:

    • Start the master configuration pool:

      $ oc patch MachineConfigPool master --type='merge' --patch \
        '{ "spec": { "paused": false } }'
      Copy to Clipboard Toggle word wrap
    • Start the worker configuration pool:

      $ oc patch MachineConfigPool worker --type='merge' --patch \
        '{ "spec": { "paused": false } }'
      Copy to Clipboard Toggle word wrap

    As the MCO updates machines in each config pool, it reboots each node.

    By default the MCO updates a single machine per pool at a time, so the time that the migration requires to complete grows with the size of the cluster.

  7. Confirm the status of the new machine configuration on the hosts:

    1. To list the machine configuration state and the name of the applied machine configuration, enter the following command:

      $ oc describe node | egrep "hostname|machineconfig"
      Copy to Clipboard Toggle word wrap

      Example output

      kubernetes.io/hostname=master-0
      machineconfiguration.openshift.io/currentConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/desiredConfig: rendered-master-c53e221d9d24e1c8bb6ee89dd3d8ad7b
      machineconfiguration.openshift.io/reason:
      machineconfiguration.openshift.io/state: Done
      Copy to Clipboard Toggle word wrap

      Verify that the following statements are true:

      • The value of machineconfiguration.openshift.io/state field is Done.
      • The value of the machineconfiguration.openshift.io/currentConfig field is equal to the value of the machineconfiguration.openshift.io/desiredConfig field.
    2. To confirm that the machine config is correct, enter the following command:

      $ oc get machineconfig <config_name> -o yaml
      Copy to Clipboard Toggle word wrap

      where <config_name> is the name of the machine config from the machineconfiguration.openshift.io/currentConfig field.

  8. Confirm that the migration succeeded:

    1. To confirm that the default CNI network provider is OpenShift SDN, enter the following command. The value of status.networkType must be OpenShiftSDN.

      $ oc get network.config/cluster -o jsonpath='{.status.networkType}{"\n"}'
      Copy to Clipboard Toggle word wrap
    2. To confirm that the cluster nodes are in the Ready state, enter the following command:

      $ oc get nodes
      Copy to Clipboard Toggle word wrap
    3. If a node is stuck in the NotReady state, investigate the machine config daemon pod logs and resolve any errors.

      1. To list the pods, enter the following command:

        $ oc get pod -n openshift-machine-config-operator
        Copy to Clipboard Toggle word wrap

        Example output

        NAME                                         READY   STATUS    RESTARTS   AGE
        machine-config-controller-75f756f89d-sjp8b   1/1     Running   0          37m
        machine-config-daemon-5cf4b                  2/2     Running   0          43h
        machine-config-daemon-7wzcd                  2/2     Running   0          43h
        machine-config-daemon-fc946                  2/2     Running   0          43h
        machine-config-daemon-g2v28                  2/2     Running   0          43h
        machine-config-daemon-gcl4f                  2/2     Running   0          43h
        machine-config-daemon-l5tnv                  2/2     Running   0          43h
        machine-config-operator-79d9c55d5-hth92      1/1     Running   0          37m
        machine-config-server-bsc8h                  1/1     Running   0          43h
        machine-config-server-hklrm                  1/1     Running   0          43h
        machine-config-server-k9rtx                  1/1     Running   0          43h
        Copy to Clipboard Toggle word wrap

        The names for the config daemon pods are in the following format: machine-config-daemon-<seq>. The <seq> value is a random five character alphanumeric sequence.

      2. To display the pod log for each machine config daemon pod shown in the previous output, enter the following command:

        $ oc logs <pod> -n openshift-machine-config-operator
        Copy to Clipboard Toggle word wrap

        where pod is the name of a machine config daemon pod.

      3. Resolve any errors in the logs shown by the output from the previous command.
    4. To confirm that your pods are not in an error state, enter the following command:

      $ oc get pods --all-namespaces -o wide --sort-by='{.spec.nodeName}'
      Copy to Clipboard Toggle word wrap

      If pods on a node are in an error state, reboot that node.

  9. Complete the following steps only if the migration succeeds and your cluster is in a good state:

    1. To remove the migration configuration from the Cluster Network Operator configuration object, enter the following command:

      $ oc patch Network.operator.openshift.io cluster --type='merge' \
        --patch '{ "spec": { "migration": null } }'
      Copy to Clipboard Toggle word wrap
    2. To remove the OVN-Kubernetes configuration, enter the following command:

      $ oc patch Network.operator.openshift.io cluster --type='merge' \
        --patch '{ "spec": { "defaultNetwork": { "ovnKubernetesConfig":null } } }'
      Copy to Clipboard Toggle word wrap
    3. To remove the OVN-Kubernetes network provider namespace, enter the following command:

      $ oc delete namespace openshift-ovn-kubernetes
      Copy to Clipboard Toggle word wrap

As a cluster administrator, you can convert your IPv4 single-stack cluster to a dual-network cluster network that supports IPv4 and IPv6 address families. After converting to dual-stack, all newly created pods are dual-stack enabled.

Note

A dual-stack network is supported on clusters provisioned on bare metal, IBM Power infrastructure, and single node OpenShift clusters.

Note

While using dual-stack networking, you cannot use IPv4-mapped IPv6 addresses, such as ::FFFF:198.51.100.1, where IPv6 is required.

23.4.1. Converting to a dual-stack cluster network

As a cluster administrator, you can convert your single-stack cluster network to a dual-stack cluster network.

Note

After converting to dual-stack networking only newly created pods are assigned IPv6 addresses. Any pods created before the conversion must be recreated to receive an IPv6 address.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • Your cluster uses the OVN-Kubernetes cluster network provider.
  • The cluster nodes have IPv6 addresses.
  • You have configured an IPv6-enabled router based on your infrastructure.

Procedure

  1. To specify IPv6 address blocks for the cluster and service networks, create a file containing the following YAML:

    - op: add
      path: /spec/clusterNetwork/-
      value: 
    1
    
        cidr: fd01::/48
        hostPrefix: 64
    - op: add
      path: /spec/serviceNetwork/-
      value: fd02::/112 
    2
    Copy to Clipboard Toggle word wrap
    1
    Specify an object with the cidr and hostPrefix fields. The host prefix must be 64 or greater. The IPv6 CIDR prefix must be large enough to accommodate the specified host prefix.
    2
    Specify an IPv6 CIDR with a prefix of 112. Kubernetes uses only the lowest 16 bits. For a prefix of 112, IP addresses are assigned from 112 to 128 bits.
  2. To patch the cluster network configuration, enter the following command:

    $ oc patch network.config.openshift.io cluster \
      --type='json' --patch-file <file>.yaml
    Copy to Clipboard Toggle word wrap

    where:

    file
    Specifies the name of the file you created in the previous step.

    Example output

    network.config.openshift.io/cluster patched
    Copy to Clipboard Toggle word wrap

Verification

Complete the following step to verify that the cluster network recognizes the IPv6 address blocks that you specified in the previous procedure.

  1. Display the network configuration:

    $ oc describe network
    Copy to Clipboard Toggle word wrap

    Example output

    Status:
      Cluster Network:
        Cidr:               10.128.0.0/14
        Host Prefix:        23
        Cidr:               fd01::/48
        Host Prefix:        64
      Cluster Network MTU:  1400
      Network Type:         OVNKubernetes
      Service Network:
        172.30.0.0/16
        fd02::/112
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can convert your dual-stack cluster network to a single-stack cluster network.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • Your cluster uses the OVN-Kubernetes cluster network provider.
  • The cluster nodes have IPv6 addresses.
  • You have enabled dual-stack networking.

Procedure

  1. Edit the networks.config.openshift.io custom resource (CR) by running the following command:

    $ oc edit networks.config.openshift.io
    Copy to Clipboard Toggle word wrap
  2. Remove the IPv6 specific configuration that you have added to the cidr and hostPrefix fields in the previous procedure.

23.5. Configuring IPsec encryption

With IPsec enabled, all pod-to-pod network traffic between nodes on the OVN-Kubernetes cluster network is encrypted with IPsec Transport mode.

IPsec is disabled by default. It can be enabled either during or after installing the cluster. For information about cluster installation, see OpenShift Container Platform installation overview. If you need to enable IPsec after cluster installation, you must first resize your cluster MTU to account for the overhead of the IPsec ESP IP header.

The following documentation describes how to enable and disable IPSec after cluster installation.

23.5.1. Prerequisites

  • You have decreased the size of the cluster MTU by 46 bytes to allow for the additional overhead of the IPsec ESP header. For more information on resizing the MTU that your cluster uses, see Changing the MTU for the cluster network.

With IPsec enabled, only the following network traffic flows between pods are encrypted:

  • Traffic between pods on different nodes on the cluster network
  • Traffic from a pod on the host network to a pod on the cluster network

The following traffic flows are not encrypted:

  • Traffic between pods on the same node on the cluster network
  • Traffic between pods on the host network
  • Traffic from a pod on the cluster network to a pod on the host network

The encrypted and unencrypted flows are illustrated in the following diagram:

You must configure the network connectivity between machines to allow OpenShift Container Platform cluster components to communicate. Each machine must be able to resolve the hostnames of all other machines in the cluster.

Expand
Table 23.7. Ports used for all-machine to all-machine communications
ProtocolPortDescription

UDP

500

IPsec IKE packets

4500

IPsec NAT-T packets

ESP

N/A

IPsec Encapsulating Security Payload (ESP)

23.5.3. Encryption protocol and IPsec mode

The encrypt cipher used is AES-GCM-16-256. The integrity check value (ICV) is 16 bytes. The key length is 256 bits.

The IPsec mode used is Transport mode, a mode that encrypts end-to-end communication by adding an Encapsulated Security Payload (ESP) header to the IP header of the original packet and encrypts the packet data. OpenShift Container Platform does not currently use or support IPsec Tunnel mode for pod-to-pod communication.

The Cluster Network Operator (CNO) generates a self-signed X.509 certificate authority (CA) that is used by IPsec for encryption. Certificate signing requests (CSRs) from each node are automatically fulfilled by the CNO.

The CA is valid for 10 years. The individual node certificates are valid for 5 years and are automatically rotated after 4 1/2 years elapse.

23.5.5. Enabling IPsec encryption

As a cluster administrator, you can enable IPsec encryption after cluster installation.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.
  • You have reduced the size of your cluster MTU by 46 bytes to allow for the overhead of the IPsec ESP header.

Procedure

  • To enable IPsec encryption, enter the following command:

    $ oc patch networks.operator.openshift.io cluster --type=merge \
    -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'
    Copy to Clipboard Toggle word wrap

23.5.6. Verifying that IPsec is enabled

As a cluster administrator, you can verify that IPsec is enabled.

Verification

  1. To find the names of the OVN-Kubernetes control plane pods, enter the following command:

    $ oc get pods -n openshift-ovn-kubernetes | grep ovnkube-master
    Copy to Clipboard Toggle word wrap

    Example output

    ovnkube-master-4496s        1/1     Running   0          6h39m
    ovnkube-master-d6cht        1/1     Running   0          6h42m
    ovnkube-master-skblc        1/1     Running   0          6h51m
    ovnkube-master-vf8rf        1/1     Running   0          6h51m
    ovnkube-master-w7hjr        1/1     Running   0          6h51m
    ovnkube-master-zsk7x        1/1     Running   0          6h42m
    Copy to Clipboard Toggle word wrap

  2. Verify that IPsec is enabled on your cluster:

    $ oc -n openshift-ovn-kubernetes -c nbdb rsh ovnkube-master-<XXXXX> \
      ovn-nbctl --no-leader-only get nb_global . ipsec
    Copy to Clipboard Toggle word wrap

    where:

    <XXXXX>
    Specifies the random sequence of letters for a pod from the previous step.

    Example output

    true
    Copy to Clipboard Toggle word wrap

23.5.7. Disabling IPsec encryption

As a cluster administrator, you can disable IPsec encryption only if you enabled IPsec after cluster installation.

Note

If you enabled IPsec when you installed your cluster, you cannot disable IPsec with this procedure.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster with a user with cluster-admin privileges.

Procedure

  1. To disable IPsec encryption, enter the following command:

    $ oc patch networks.operator.openshift.io/cluster --type=json \
      -p='[{"op":"remove", "path":"/spec/defaultNetwork/ovnKubernetesConfig/ipsecConfig"}]'
    Copy to Clipboard Toggle word wrap
  2. Optional: You can increase the size of your cluster MTU by 46 bytes because there is no longer any overhead from the IPsec ESP header in IP packets.

23.5.8. Additional resources

23.6. Configuring an egress firewall for a project

As a cluster administrator, you can create an egress firewall for a project that restricts egress traffic leaving your OpenShift Container Platform cluster.

23.6.1. How an egress firewall works in a project

As a cluster administrator, you can use an egress firewall to limit the external hosts that some or all pods can access from within the cluster. An egress firewall supports the following scenarios:

  • A pod can only connect to internal hosts and cannot initiate connections to the public internet.
  • A pod can only connect to the public internet and cannot initiate connections to internal hosts that are outside the OpenShift Container Platform cluster.
  • A pod cannot reach specified internal subnets or hosts outside the OpenShift Container Platform cluster.
  • A pod can connect to only specific external hosts.

For example, you can allow one project access to a specified IP range but deny the same access to a different project. Or you can restrict application developers from updating from Python pip mirrors, and force updates to come only from approved sources.

Note

Egress firewall does not apply to the host network namespace. Pods with host networking enabled are unaffected by egress firewall rules.

You configure an egress firewall policy by creating an EgressFirewall custom resource (CR) object. The egress firewall matches network traffic that meets any of the following criteria:

  • An IP address range in CIDR format
  • A DNS name that resolves to an IP address
  • A port number
  • A protocol that is one of the following protocols: TCP, UDP, and SCTP
Important

If your egress firewall includes a deny rule for 0.0.0.0/0, access to your OpenShift Container Platform API servers is blocked. To ensure that pods can access the OpenShift Container Platform API servers, you must include the built-in join network 100.64.0.0/16 of Open Virtual Network (OVN) to allow access when using node ports together with an EgressFirewall. You must also include the IP address range that the API servers listen on in your egress firewall rules, as in the following example:

apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
  namespace: <namespace> 
1

spec:
  egress:
  - to:
      cidrSelector: <api_server_address_range> 
2

    type: Allow
# ...
  - to:
      cidrSelector: 0.0.0.0/0 
3

    type: Deny
Copy to Clipboard Toggle word wrap
1
The namespace for the egress firewall.
2
The IP address range that includes your OpenShift Container Platform API servers.
3
A global deny rule prevents access to the OpenShift Container Platform API servers.

To find the IP address for your API servers, run oc get ep kubernetes -n default.

For more information, see BZ#1988324.

Warning

Egress firewall rules do not apply to traffic that goes through routers. Any user with permission to create a Route CR object can bypass egress firewall policy rules by creating a route that points to a forbidden destination.

23.6.1.1. Limitations of an egress firewall

An egress firewall has the following limitations:

  • No project can have more than one EgressFirewall object.
  • A maximum of one EgressFirewall object with a maximum of 8,000 rules can be defined per project.
  • If you are using the OVN-Kubernetes network plugin with shared gateway mode in Red Hat OpenShift Networking, return ingress replies are affected by egress firewall rules. If the egress firewall rules drop the ingress reply destination IP, the traffic is dropped.

Violating any of these restrictions results in a broken egress firewall for the project. Consequently, all external network traffic is dropped, which can cause security risks for your organization.

An Egress Firewall resource can be created in the kube-node-lease, kube-public, kube-system, openshift and openshift- projects.

The egress firewall policy rules are evaluated in the order that they are defined, from first to last. The first rule that matches an egress connection from a pod applies. Any subsequent rules are ignored for that connection.

If you use DNS names in any of your egress firewall policy rules, proper resolution of the domain names is subject to the following restrictions:

  • Domain name updates are polled based on a time-to-live (TTL) duration. By default, the duration is 30 minutes. When the egress firewall controller queries the local name servers for a domain name, if the response includes a TTL and the TTL is less than 30 minutes, the controller sets the duration for that DNS name to the returned value. Each DNS name is queried after the TTL for the DNS record expires.
  • The pod must resolve the domain from the same local name servers when necessary. Otherwise the IP addresses for the domain known by the egress firewall controller and the pod can be different. If the IP addresses for a hostname differ, the egress firewall might not be enforced consistently.
  • Because the egress firewall controller and pods asynchronously poll the same local name server, the pod might obtain the updated IP address before the egress controller does, which causes a race condition. Due to this current limitation, domain name usage in EgressFirewall objects is only recommended for domains with infrequent IP address changes.
Note

The egress firewall always allows pods access to the external interface of the node that the pod is on for DNS resolution.

If you use domain names in your egress firewall policy and your DNS resolution is not handled by a DNS server on the local node, then you must add egress firewall rules that allow access to your DNS server’s IP addresses. if you are using domain names in your pods.

23.6.2. EgressFirewall custom resource (CR) object

You can define one or more rules for an egress firewall. A rule is either an Allow rule or a Deny rule, with a specification for the traffic that the rule applies to.

The following YAML describes an EgressFirewall CR object:

EgressFirewall object

apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: <name> 
1

spec:
  egress: 
2

    ...
Copy to Clipboard Toggle word wrap

1
The name for the object must be default.
2
A collection of one or more egress network policy rules as described in the following section.
23.6.2.1. EgressFirewall rules

The following YAML describes an egress firewall rule object. The egress stanza expects an array of one or more objects.

Egress policy rule stanza

egress:
- type: <type> 
1

  to: 
2

    cidrSelector: <cidr> 
3

    dnsName: <dns_name> 
4

  ports: 
5

      ...
Copy to Clipboard Toggle word wrap

1
The type of rule. The value must be either Allow or Deny.
2
A stanza describing an egress traffic match rule that specifies the cidrSelector field or the dnsName field. You cannot use both fields in the same rule.
3
An IP address range in CIDR format.
4
A DNS domain name.
5
Optional: A stanza describing a collection of network ports and protocols for the rule.

Ports stanza

ports:
- port: <port> 
1

  protocol: <protocol> 
2
Copy to Clipboard Toggle word wrap

1
A network port, such as 80 or 443. If you specify a value for this field, you must also specify a value for protocol.
2
A network protocol. The value must be either TCP, UDP, or SCTP.
23.6.2.2. Example EgressFirewall CR objects

The following example defines several egress firewall policy rules:

apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress: 
1

  - type: Allow
    to:
      cidrSelector: 1.2.3.0/24
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0
Copy to Clipboard Toggle word wrap
1
A collection of egress firewall policy rule objects.

The following example defines a policy rule that denies traffic to the host at the 172.16.1.1 IP address, if the traffic is using either the TCP protocol and destination port 80 or any protocol and destination port 443.

apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress:
  - type: Deny
    to:
      cidrSelector: 172.16.1.1
    ports:
    - port: 80
      protocol: TCP
    - port: 443
Copy to Clipboard Toggle word wrap

23.6.3. Creating an egress firewall policy object

As a cluster administrator, you can create an egress firewall policy object for a project.

Important

If the project already has an EgressFirewall object defined, you must edit the existing policy to make changes to the egress firewall rules.

Prerequisites

  • A cluster that uses the OVN-Kubernetes default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster as a cluster administrator.

Procedure

  1. Create a policy rule:

    1. Create a <policy_name>.yaml file where <policy_name> describes the egress policy rules.
    2. In the file you created, define an egress policy object.
  2. Enter the following command to create the policy object. Replace <policy_name> with the name of the policy and <project> with the project that the rule applies to.

    $ oc create -f <policy_name>.yaml -n <project>
    Copy to Clipboard Toggle word wrap

    In the following example, a new EgressFirewall object is created in a project named project1:

    $ oc create -f default.yaml -n project1
    Copy to Clipboard Toggle word wrap

    Example output

    egressfirewall.k8s.ovn.org/v1 created
    Copy to Clipboard Toggle word wrap

  3. Optional: Save the <policy_name>.yaml file so that you can make changes later.

23.7. Viewing an egress firewall for a project

As a cluster administrator, you can list the names of any existing egress firewalls and view the traffic rules for a specific egress firewall.

23.7.1. Viewing an EgressFirewall object

You can view an EgressFirewall object in your cluster.

Prerequisites

  • A cluster using the OVN-Kubernetes default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift Command-line Interface (CLI), commonly known as oc.
  • You must log in to the cluster.

Procedure

  1. Optional: To view the names of the EgressFirewall objects defined in your cluster, enter the following command:

    $ oc get egressfirewall --all-namespaces
    Copy to Clipboard Toggle word wrap
  2. To inspect a policy, enter the following command. Replace <policy_name> with the name of the policy to inspect.

    $ oc describe egressfirewall <policy_name>
    Copy to Clipboard Toggle word wrap

    Example output

    Name:		default
    Namespace:	project1
    Created:	20 minutes ago
    Labels:		<none>
    Annotations:	<none>
    Rule:		Allow to 1.2.3.0/24
    Rule:		Allow to www.example.com
    Rule:		Deny to 0.0.0.0/0
    Copy to Clipboard Toggle word wrap

23.8. Editing an egress firewall for a project

As a cluster administrator, you can modify network traffic rules for an existing egress firewall.

23.8.1. Editing an EgressFirewall object

As a cluster administrator, you can update the egress firewall for a project.

Prerequisites

  • A cluster using the OVN-Kubernetes default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster as a cluster administrator.

Procedure

  1. Find the name of the EgressFirewall object for the project. Replace <project> with the name of the project.

    $ oc get -n <project> egressfirewall
    Copy to Clipboard Toggle word wrap
  2. Optional: If you did not save a copy of the EgressFirewall object when you created the egress network firewall, enter the following command to create a copy.

    $ oc get -n <project> egressfirewall <name> -o yaml > <filename>.yaml
    Copy to Clipboard Toggle word wrap

    Replace <project> with the name of the project. Replace <name> with the name of the object. Replace <filename> with the name of the file to save the YAML to.

  3. After making changes to the policy rules, enter the following command to replace the EgressFirewall object. Replace <filename> with the name of the file containing the updated EgressFirewall object.

    $ oc replace -f <filename>.yaml
    Copy to Clipboard Toggle word wrap

23.9. Removing an egress firewall from a project

As a cluster administrator, you can remove an egress firewall from a project to remove all restrictions on network traffic from the project that leaves the OpenShift Container Platform cluster.

23.9.1. Removing an EgressFirewall object

As a cluster administrator, you can remove an egress firewall from a project.

Prerequisites

  • A cluster using the OVN-Kubernetes default Container Network Interface (CNI) network provider plugin.
  • Install the OpenShift CLI (oc).
  • You must log in to the cluster as a cluster administrator.

Procedure

  1. Find the name of the EgressFirewall object for the project. Replace <project> with the name of the project.

    $ oc get -n <project> egressfirewall
    Copy to Clipboard Toggle word wrap
  2. Enter the following command to delete the EgressFirewall object. Replace <project> with the name of the project and <name> with the name of the object.

    $ oc delete -n <project> egressfirewall <name>
    Copy to Clipboard Toggle word wrap

23.10. Configuring an egress IP address

As a cluster administrator, you can configure the OVN-Kubernetes Container Network Interface (CNI) cluster network provider to assign one or more egress IP addresses to a namespace, or to specific pods in a namespace.

The OpenShift Container Platform egress IP address functionality allows you to ensure that the traffic from one or more pods in one or more namespaces has a consistent source IP address for services outside the cluster network.

For example, you might have a pod that periodically queries a database that is hosted on a server outside of your cluster. To enforce access requirements for the server, a packet filtering device is configured to allow traffic only from specific IP addresses. To ensure that you can reliably allow access to the server from only that specific pod, you can configure a specific egress IP address for the pod that makes the requests to the server.

An egress IP address assigned to a namespace is different from an egress router, which is used to send traffic to specific destinations.

In some cluster configurations, application pods and ingress router pods run on the same node. If you configure an egress IP address for an application project in this scenario, the IP address is not used when you send a request to a route from the application project.

Important

Egress IP addresses must not be configured in any Linux network configuration files, such as ifcfg-eth0.

23.10.1.1. Platform support

Support for the egress IP address functionality on various platforms is summarized in the following table:

Expand
PlatformSupported

Bare metal

Yes

VMware vSphere

Yes

Red Hat OpenStack Platform (RHOSP)

No

Amazon Web Services (AWS)

Yes

Google Cloud Platform (GCP)

Yes

Microsoft Azure

Yes

Important

The assignment of egress IP addresses to control plane nodes with the EgressIP feature is not supported on a cluster provisioned on Amazon Web Services (AWS). (BZ#2039656)

23.10.1.2. Public cloud platform considerations

For clusters provisioned on public cloud infrastructure, there is a constraint on the absolute number of assignable IP addresses per node. The maximum number of assignable IP addresses per node, or the IP capacity, can be described in the following formula:

IP capacity = public cloud default capacity - sum(current IP assignments)
Copy to Clipboard Toggle word wrap

While the Egress IPs capability manages the IP address capacity per node, it is important to plan for this constraint in your deployments. For example, for a cluster installed on bare-metal infrastructure with 8 nodes you can configure 150 egress IP addresses. However, if a public cloud provider limits IP address capacity to 10 IP addresses per node, the total number of assignable IP addresses is only 80. To achieve the same IP address capacity in this example cloud provider, you would need to allocate 7 additional nodes.

To confirm the IP capacity and subnets for any node in your public cloud environment, you can enter the oc get node <node_name> -o yaml command. The cloud.network.openshift.io/egress-ipconfig annotation includes capacity and subnet information for the node.

The annotation value is an array with a single object with fields that provide the following information for the primary network interface:

  • interface: Specifies the interface ID on AWS and Azure and the interface name on GCP.
  • ifaddr: Specifies the subnet mask for one or both IP address families.
  • capacity: Specifies the IP address capacity for the node. On AWS, the IP address capacity is provided per IP address family. On Azure and GCP, the IP address capacity includes both IPv4 and IPv6 addresses.

The following examples illustrate the annotation from nodes on several public cloud providers. The annotations are indented for readability.

Example cloud.network.openshift.io/egress-ipconfig annotation on AWS

cloud.network.openshift.io/egress-ipconfig: [
  {
    "interface":"eni-078d267045138e436",
    "ifaddr":{"ipv4":"10.0.128.0/18"},
    "capacity":{"ipv4":14,"ipv6":15}
  }
]
Copy to Clipboard Toggle word wrap

Example cloud.network.openshift.io/egress-ipconfig annotation on GCP

cloud.network.openshift.io/egress-ipconfig: [
  {
    "interface":"nic0",
    "ifaddr":{"ipv4":"10.0.128.0/18"},
    "capacity":{"ip":14}
  }
]
Copy to Clipboard Toggle word wrap

The following sections describe the IP address capacity for supported public cloud environments for use in your capacity calculation.

On AWS, constraints on IP address assignments depend on the instance type configured. For more information, see IP addresses per network interface per instance type

On GCP, the networking model implements additional node IP addresses through IP address aliasing, rather than IP address assignments. However, IP address capacity maps directly to IP aliasing capacity.

The following capacity limits exist for IP aliasing assignment:

  • Per node, the maximum number of IP aliases, both IPv4 and IPv6, is 10.
  • Per VPC, the maximum number of IP aliases is unspecified, but OpenShift Container Platform scalability testing reveals the maximum to be approximately 15,000.

For more information, see Per instance quotas and Alias IP ranges overview.

On Azure, the following capacity limits exist for IP address assignment:

  • Per NIC, the maximum number of assignable IP addresses, for both IPv4 and IPv6, is 256.
  • Per virtual network, the maximum number of assigned IP addresses cannot exceed 65,536.

For more information, see Networking limits.

23.10.1.3. Assignment of egress IPs to pods

To assign one or more egress IPs to a namespace or specific pods in a namespace, the following conditions must be satisfied:

  • At least one node in your cluster must have the k8s.ovn.org/egress-assignable: "" label.
  • An EgressIP object exists that defines one or more egress IP addresses to use as the source IP address for traffic leaving the cluster from pods in a namespace.
Important

If you create EgressIP objects prior to labeling any nodes in your cluster for egress IP assignment, OpenShift Container Platform might assign every egress IP address to the first node with the k8s.ovn.org/egress-assignable: "" label.

To ensure that egress IP addresses are widely distributed across nodes in the cluster, always apply the label to the nodes you intent to host the egress IP addresses before creating any EgressIP objects.

23.10.1.4. Assignment of egress IPs to nodes

When creating an EgressIP object, the following conditions apply to nodes that are labeled with the k8s.ovn.org/egress-assignable: "" label:

  • An egress IP address is never assigned to more than one node at a time.
  • An egress IP address is equally balanced between available nodes that can host the egress IP address.
  • If the spec.EgressIPs array in an EgressIP object specifies more than one IP address, the following conditions apply:

    • No node will ever host more than one of the specified IP addresses.
    • Traffic is balanced roughly equally between the specified IP addresses for a given namespace.
  • If a node becomes unavailable, any egress IP addresses assigned to it are automatically reassigned, subject to the previously described conditions.

When a pod matches the selector for multiple EgressIP objects, there is no guarantee which of the egress IP addresses that are specified in the EgressIP objects is assigned as the egress IP address for the pod.

Additionally, if an EgressIP object specifies multiple egress IP addresses, there is no guarantee which of the egress IP addresses might be used. For example, if a pod matches a selector for an EgressIP object with two egress IP addresses, 10.10.20.1 and 10.10.20.2, either might be used for each TCP connection or UDP conversation.

The following diagram depicts an egress IP address configuration. The diagram describes four pods in two different namespaces running on three nodes in a cluster. The nodes are assigned IP addresses from the 192.168.126.0/18 CIDR block on the host network.

Both Node 1 and Node 3 are labeled with k8s.ovn.org/egress-assignable: "" and thus available for the assignment of egress IP addresses.

The dashed lines in the diagram depict the traffic flow from pod1, pod2, and pod3 traveling through the pod network to egress the cluster from Node 1 and Node 3. When an external service receives traffic from any of the pods selected by the example EgressIP object, the source IP address is either 192.168.126.10 or 192.168.126.102. The traffic is balanced roughly equally between these two nodes.

The following resources from the diagram are illustrated in detail:

Namespace objects

The namespaces are defined in the following manifest:

Namespace objects

apiVersion: v1
kind: Namespace
metadata:
  name: namespace1
  labels:
    env: prod
---
apiVersion: v1
kind: Namespace
metadata:
  name: namespace2
  labels:
    env: prod
Copy to Clipboard Toggle word wrap

EgressIP object

The following EgressIP object describes a configuration that selects all pods in any namespace with the env label set to prod. The egress IP addresses for the selected pods are 192.168.126.10 and 192.168.126.102.

EgressIP object

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressips-prod
spec:
  egressIPs:
  - 192.168.126.10
  - 192.168.126.102
  namespaceSelector:
    matchLabels:
      env: prod
status:
  items:
  - node: node1
    egressIP: 192.168.126.10
  - node: node3
    egressIP: 192.168.126.102
Copy to Clipboard Toggle word wrap

For the configuration in the previous example, OpenShift Container Platform assigns both egress IP addresses to the available nodes. The status field reflects whether and where the egress IP addresses are assigned.

23.10.2. EgressIP object

The following YAML describes the API for the EgressIP object. The scope of the object is cluster-wide; it is not created in a namespace.

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: <name> 
1

spec:
  egressIPs: 
2

  - <ip_address>
  namespaceSelector: 
3

    ...
  podSelector: 
4

    ...
Copy to Clipboard Toggle word wrap
1
The name for the EgressIPs object.
2
An array of one or more IP addresses.
3
One or more selectors for the namespaces to associate the egress IP addresses with.
4
Optional: One or more selectors for pods in the specified namespaces to associate egress IP addresses with. Applying these selectors allows for the selection of a subset of pods within a namespace.

The following YAML describes the stanza for the namespace selector:

Namespace selector stanza

namespaceSelector: 
1

  matchLabels:
    <label_name>: <label_value>
Copy to Clipboard Toggle word wrap

1
One or more matching rules for namespaces. If more than one match rule is provided, all matching namespaces are selected.

The following YAML describes the optional stanza for the pod selector:

Pod selector stanza

podSelector: 
1

  matchLabels:
    <label_name>: <label_value>
Copy to Clipboard Toggle word wrap

1
Optional: One or more matching rules for pods in the namespaces that match the specified namespaceSelector rules. If specified, only pods that match are selected. Others pods in the namespace are not selected.

In the following example, the EgressIP object associates the 192.168.126.11 and 192.168.126.102 egress IP addresses with pods that have the app label set to web and are in the namespaces that have the env label set to prod:

Example EgressIP object

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-group1
spec:
  egressIPs:
  - 192.168.126.11
  - 192.168.126.102
  podSelector:
    matchLabels:
      app: web
  namespaceSelector:
    matchLabels:
      env: prod
Copy to Clipboard Toggle word wrap

In the following example, the EgressIP object associates the 192.168.127.30 and 192.168.127.40 egress IP addresses with any pods that do not have the environment label set to development:

Example EgressIP object

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-group2
spec:
  egressIPs:
  - 192.168.127.30
  - 192.168.127.40
  namespaceSelector:
    matchExpressions:
    - key: environment
      operator: NotIn
      values:
      - development
Copy to Clipboard Toggle word wrap

You can apply the k8s.ovn.org/egress-assignable="" label to a node in your cluster so that OpenShift Container Platform can assign one or more egress IP addresses to the node.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster as a cluster administrator.

Procedure

  • To label a node so that it can host one or more egress IP addresses, enter the following command:

    $ oc label nodes <node_name> k8s.ovn.org/egress-assignable="" 
    1
    Copy to Clipboard Toggle word wrap
    1
    The name of the node to label.
    Tip

    You can alternatively apply the following YAML to add the label to a node:

    apiVersion: v1
    kind: Node
    metadata:
      labels:
        k8s.ovn.org/egress-assignable: ""
      name: <node_name>
    Copy to Clipboard Toggle word wrap

23.10.4. Next steps

23.11. Assigning an egress IP address

As a cluster administrator, you can assign an egress IP address for traffic leaving the cluster from a namespace or from specific pods in a namespace.

You can assign one or more egress IP addresses to a namespace or to specific pods in a namespace.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster as a cluster administrator.
  • Configure at least one node to host an egress IP address.

Procedure

  1. Create an EgressIP object:

    1. Create a <egressips_name>.yaml file where <egressips_name> is the name of the object.
    2. In the file that you created, define an EgressIP object, as in the following example:

      apiVersion: k8s.ovn.org/v1
      kind: EgressIP
      metadata:
        name: egress-project1
      spec:
        egressIPs:
        - 192.168.127.10
        - 192.168.127.11
        namespaceSelector:
          matchLabels:
            env: qa
      Copy to Clipboard Toggle word wrap
  2. To create the object, enter the following command.

    $ oc apply -f <egressips_name>.yaml 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <egressips_name> with the name of the object.

    Example output

    egressips.k8s.ovn.org/<egressips_name> created
    Copy to Clipboard Toggle word wrap

  3. Optional: Save the <egressips_name>.yaml file so that you can make changes later.
  4. Add labels to the namespace that requires egress IP addresses. To add a label to the namespace of an EgressIP object defined in step 1, run the following command:

    $ oc label ns <namespace> env=qa 
    1
    Copy to Clipboard Toggle word wrap
    1
    Replace <namespace> with the namespace that requires egress IP addresses.

23.12.1. About an egress router pod

The OpenShift Container Platform egress router pod redirects traffic to a specified remote server from a private source IP address that is not used for any other purpose. An egress router pod can send network traffic to servers that are set up to allow access only from specific IP addresses.

Note

The egress router pod is not intended for every outgoing connection. Creating large numbers of egress router pods can exceed the limits of your network hardware. For example, creating an egress router pod for every project or application could exceed the number of local MAC addresses that the network interface can handle before reverting to filtering MAC addresses in software.

Important

The egress router image is not compatible with Amazon AWS, Azure Cloud, or any other cloud platform that does not support layer 2 manipulations due to their incompatibility with macvlan traffic.

23.12.1.1. Egress router modes

In redirect mode, an egress router pod configures iptables rules to redirect traffic from its own IP address to one or more destination IP addresses. Client pods that need to use the reserved source IP address must be configured to access the service for the egress router rather than connecting directly to the destination IP. You can access the destination service and port from the application pod by using the curl command. For example:

$ curl <router_service_IP> <port>
Copy to Clipboard Toggle word wrap
Note

The egress router CNI plugin supports redirect mode only. This is a difference with the egress router implementation that you can deploy with OpenShift SDN. Unlike the egress router for OpenShift SDN, the egress router CNI plugin does not support HTTP proxy mode or DNS proxy mode.

23.12.1.2. Egress router pod implementation

The egress router implementation uses the egress router Container Network Interface (CNI) plugin. The plugin adds a secondary network interface to a pod.

An egress router is a pod that has two network interfaces. For example, the pod can have eth0 and net1 network interfaces. The eth0 interface is on the cluster network and the pod continues to use the interface for ordinary cluster-related network traffic. The net1 interface is on a secondary network and has an IP address and gateway for that network. Other pods in the OpenShift Container Platform cluster can access the egress router service and the service enables the pods to access external services. The egress router acts as a bridge between pods and an external system.

Traffic that leaves the egress router exits through a node, but the packets have the MAC address of the net1 interface from the egress router pod.

When you add an egress router custom resource, the Cluster Network Operator creates the following objects:

  • The network attachment definition for the net1 secondary network interface of the pod.
  • A deployment for the egress router.

If you delete an egress router custom resource, the Operator deletes the two objects in the preceding list that are associated with the egress router.

23.12.1.3. Deployment considerations

An egress router pod adds an additional IP address and MAC address to the primary network interface of the node. As a result, you might need to configure your hypervisor or cloud provider to allow the additional address.

Red Hat OpenStack Platform (RHOSP)

If you deploy OpenShift Container Platform on RHOSP, you must allow traffic from the IP and MAC addresses of the egress router pod on your OpenStack environment. If you do not allow the traffic, then communication will fail:

$ openstack port set --allowed-address \
  ip_address=<ip_address>,mac_address=<mac_address> <neutron_port_uuid>
Copy to Clipboard Toggle word wrap
Red Hat Virtualization (RHV)
If you are using RHV, you must select No Network Filter for the Virtual network interface controller (vNIC).
VMware vSphere
If you are using VMware vSphere, see the VMware documentation for securing vSphere standard switches. View and change VMware vSphere default settings by selecting the host virtual switch from the vSphere Web Client.

Specifically, ensure that the following are enabled:

23.12.1.4. Failover configuration

To avoid downtime, the Cluster Network Operator deploys the egress router pod as a deployment resource. The deployment name is egress-router-cni-deployment. The pod that corresponds to the deployment has a label of app=egress-router-cni.

To create a new service for the deployment, use the oc expose deployment/egress-router-cni-deployment --port <port_number> command or create a file like the following example:

apiVersion: v1
kind: Service
metadata:
  name: app-egress
spec:
  ports:
  - name: tcp-8080
    protocol: TCP
    port: 8080
  - name: tcp-8443
    protocol: TCP
    port: 8443
  - name: udp-80
    protocol: UDP
    port: 80
  type: ClusterIP
  selector:
    app: egress-router-cni
Copy to Clipboard Toggle word wrap

As a cluster administrator, you can deploy an egress router pod to redirect traffic to specified destination IP addresses from a reserved source IP address.

The egress router implementation uses the egress router Container Network Interface (CNI) plugin.

23.13.1. Egress router custom resource

Define the configuration for an egress router pod in an egress router custom resource. The following YAML describes the fields for the configuration of an egress router in redirect mode:

apiVersion: network.operator.openshift.io/v1
kind: EgressRouter
metadata:
  name: <egress_router_name>
  namespace: <namespace>  <.>
spec:
  addresses: [  <.>
    {
      ip: "<egress_router>",  <.>
      gateway: "<egress_gateway>"  <.>
    }
  ]
  mode: Redirect
  redirect: {
    redirectRules: [  <.>
      {
        destinationIP: "<egress_destination>",
        port: <egress_router_port>,
        targetPort: <target_port>,  <.>
        protocol: <network_protocol>  <.>
      },
      ...
    ],
    fallbackIP: "<egress_destination>" <.>
  }
Copy to Clipboard Toggle word wrap

<.> Optional: The namespace field specifies the namespace to create the egress router in. If you do not specify a value in the file or on the command line, the default namespace is used.

<.> The addresses field specifies the IP addresses to configure on the secondary network interface.

<.> The ip field specifies the reserved source IP address and netmask from the physical network that the node is on to use with egress router pod. Use CIDR notation to specify the IP address and netmask.

<.> The gateway field specifies the IP address of the network gateway.

<.> Optional: The redirectRules field specifies a combination of egress destination IP address, egress router port, and protocol. Incoming connections to the egress router on the specified port and protocol are routed to the destination IP address.

<.> Optional: The targetPort field specifies the network port on the destination IP address. If this field is not specified, traffic is routed to the same network port that it arrived on.

<.> The protocol field supports TCP, UDP, or SCTP.

<.> Optional: The fallbackIP field specifies a destination IP address. If you do not specify any redirect rules, the egress router sends all traffic to this fallback IP address. If you specify redirect rules, any connections to network ports that are not defined in the rules are sent by the egress router to this fallback IP address. If you do not specify this field, the egress router rejects connections to network ports that are not defined in the rules.

Example egress router specification

apiVersion: network.operator.openshift.io/v1
kind: EgressRouter
metadata:
  name: egress-router-redirect
spec:
  networkInterface: {
    macvlan: {
      mode: "Bridge"
    }
  }
  addresses: [
    {
      ip: "192.168.12.99/24",
      gateway: "192.168.12.1"
    }
  ]
  mode: Redirect
  redirect: {
    redirectRules: [
      {
        destinationIP: "10.0.0.99",
        port: 80,
        protocol: UDP
      },
      {
        destinationIP: "203.0.113.26",
        port: 8080,
        targetPort: 80,
        protocol: TCP
      },
      {
        destinationIP: "203.0.113.27",
        port: 8443,
        targetPort: 443,
        protocol: TCP
      }
    ]
  }
Copy to Clipboard Toggle word wrap

You can deploy an egress router to redirect traffic from its own reserved source IP address to one or more destination IP addresses.

After you add an egress router, the client pods that need to use the reserved source IP address must be modified to connect to the egress router rather than connecting directly to the destination IP.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create an egress router definition.
  2. To ensure that other pods can find the IP address of the egress router pod, create a service that uses the egress router, as in the following example:

    apiVersion: v1
    kind: Service
    metadata:
      name: egress-1
    spec:
      ports:
      - name: web-app
        protocol: TCP
        port: 8080
      type: ClusterIP
      selector:
        app: egress-router-cni <.>
    Copy to Clipboard Toggle word wrap

    <.> Specify the label for the egress router. The value shown is added by the Cluster Network Operator and is not configurable.

    After you create the service, your pods can connect to the service. The egress router pod redirects traffic to the corresponding port on the destination IP address. The connections originate from the reserved source IP address.

Verification

To verify that the Cluster Network Operator started the egress router, complete the following procedure:

  1. View the network attachment definition that the Operator created for the egress router:

    $ oc get network-attachment-definition egress-router-cni-nad
    Copy to Clipboard Toggle word wrap

    The name of the network attachment definition is not configurable.

    Example output

    NAME                    AGE
    egress-router-cni-nad   18m
    Copy to Clipboard Toggle word wrap

  2. View the deployment for the egress router pod:

    $ oc get deployment egress-router-cni-deployment
    Copy to Clipboard Toggle word wrap

    The name of the deployment is not configurable.

    Example output

    NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
    egress-router-cni-deployment   1/1     1            1           18m
    Copy to Clipboard Toggle word wrap

  3. View the status of the egress router pod:

    $ oc get pods -l app=egress-router-cni
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                            READY   STATUS    RESTARTS   AGE
    egress-router-cni-deployment-575465c75c-qkq6m   1/1     Running   0          18m
    Copy to Clipboard Toggle word wrap

  4. View the logs and the routing table for the egress router pod.
  1. Get the node name for the egress router pod:

    $ POD_NODENAME=$(oc get pod -l app=egress-router-cni -o jsonpath="{.items[0].spec.nodeName}")
    Copy to Clipboard Toggle word wrap
  2. Enter into a debug session on the target node. This step instantiates a debug pod called <node_name>-debug:

    $ oc debug node/$POD_NODENAME
    Copy to Clipboard Toggle word wrap
  3. Set /host as the root directory within the debug shell. The debug pod mounts the root file system of the host in /host within the pod. By changing the root directory to /host, you can run binaries from the executable paths of the host:

    # chroot /host
    Copy to Clipboard Toggle word wrap
  4. From within the chroot environment console, display the egress router logs:

    # cat /tmp/egress-router-log
    Copy to Clipboard Toggle word wrap

    Example output

    2021-04-26T12:27:20Z [debug] Called CNI ADD
    2021-04-26T12:27:20Z [debug] Gateway: 192.168.12.1
    2021-04-26T12:27:20Z [debug] IP Source Addresses: [192.168.12.99/24]
    2021-04-26T12:27:20Z [debug] IP Destinations: [80 UDP 10.0.0.99/30 8080 TCP 203.0.113.26/30 80 8443 TCP 203.0.113.27/30 443]
    2021-04-26T12:27:20Z [debug] Created macvlan interface
    2021-04-26T12:27:20Z [debug] Renamed macvlan to "net1"
    2021-04-26T12:27:20Z [debug] Adding route to gateway 192.168.12.1 on macvlan interface
    2021-04-26T12:27:20Z [debug] deleted default route {Ifindex: 3 Dst: <nil> Src: <nil> Gw: 10.128.10.1 Flags: [] Table: 254}
    2021-04-26T12:27:20Z [debug] Added new default route with gateway 192.168.12.1
    2021-04-26T12:27:20Z [debug] Added iptables rule: iptables -t nat PREROUTING -i eth0 -p UDP --dport 80 -j DNAT --to-destination 10.0.0.99
    2021-04-26T12:27:20Z [debug] Added iptables rule: iptables -t nat PREROUTING -i eth0 -p TCP --dport 8080 -j DNAT --to-destination 203.0.113.26:80
    2021-04-26T12:27:20Z [debug] Added iptables rule: iptables -t nat PREROUTING -i eth0 -p TCP --dport 8443 -j DNAT --to-destination 203.0.113.27:443
    2021-04-26T12:27:20Z [debug] Added iptables rule: iptables -t nat -o net1 -j SNAT --to-source 192.168.12.99
    Copy to Clipboard Toggle word wrap

    The logging file location and logging level are not configurable when you start the egress router by creating an EgressRouter object as described in this procedure.

  5. From within the chroot environment console, get the container ID:

    # crictl ps --name egress-router-cni-pod | awk '{print $1}'
    Copy to Clipboard Toggle word wrap

    Example output

    CONTAINER
    bac9fae69ddb6
    Copy to Clipboard Toggle word wrap

  6. Determine the process ID of the container. In this example, the container ID is bac9fae69ddb6:

    # crictl inspect -o yaml bac9fae69ddb6 | grep 'pid:' | awk '{print $2}'
    Copy to Clipboard Toggle word wrap

    Example output

    68857
    Copy to Clipboard Toggle word wrap

  7. Enter the network namespace of the container:

    # nsenter -n -t 68857
    Copy to Clipboard Toggle word wrap
  8. Display the routing table:

    # ip route
    Copy to Clipboard Toggle word wrap

    In the following example output, the net1 network interface is the default route. Traffic for the cluster network uses the eth0 network interface. Traffic for the 192.168.12.0/24 network uses the net1 network interface and originates from the reserved source IP address 192.168.12.99. The pod routes all other traffic to the gateway at IP address 192.168.12.1. Routing for the service network is not shown.

    Example output

    default via 192.168.12.1 dev net1
    10.128.10.0/23 dev eth0 proto kernel scope link src 10.128.10.18
    192.168.12.0/24 dev net1 proto kernel scope link src 192.168.12.99
    192.168.12.1 dev net1
    Copy to Clipboard Toggle word wrap

23.14. Enabling multicast for a project

23.14.1. About multicast

With IP multicast, data is broadcast to many IP addresses simultaneously.

Important
  • At this time, multicast is best used for low-bandwidth coordination or service discovery and not a high-bandwidth solution.
  • By default, network policies affect all connections in a namespace. However, multicast is unaffected by network policies. If multicast is enabled in the same namespace as your network policies, it is always allowed, even if there is a deny-all network policy. Cluster administrators should consider the implications to the exemption of multicast from network policies before enabling it.

Multicast traffic between OpenShift Container Platform pods is disabled by default. If you are using the OVN-Kubernetes default Container Network Interface (CNI) network provider, you can enable multicast on a per-project basis.

23.14.2. Enabling multicast between pods

You can enable multicast between pods for your project.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  • Run the following command to enable multicast for a project. Replace <namespace> with the namespace for the project you want to enable multicast for.

    $ oc annotate namespace <namespace> \
        k8s.ovn.org/multicast-enabled=true
    Copy to Clipboard Toggle word wrap
    Tip

    You can alternatively apply the following YAML to add the annotation:

    apiVersion: v1
    kind: Namespace
    metadata:
      name: <namespace>
      annotations:
        k8s.ovn.org/multicast-enabled: "true"
    Copy to Clipboard Toggle word wrap

Verification

To verify that multicast is enabled for a project, complete the following procedure:

  1. Change your current project to the project that you enabled multicast for. Replace <project> with the project name.

    $ oc project <project>
    Copy to Clipboard Toggle word wrap
  2. Create a pod to act as a multicast receiver:

    $ cat <<EOF| oc create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: mlistener
      labels:
        app: multicast-verify
    spec:
      containers:
        - name: mlistener
          image: registry.access.redhat.com/ubi8
          command: ["/bin/sh", "-c"]
          args:
            ["dnf -y install socat hostname && sleep inf"]
          ports:
            - containerPort: 30102
              name: mlistener
              protocol: UDP
    EOF
    Copy to Clipboard Toggle word wrap
  3. Create a pod to act as a multicast sender:

    $ cat <<EOF| oc create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: msender
      labels:
        app: multicast-verify
    spec:
      containers:
        - name: msender
          image: registry.access.redhat.com/ubi8
          command: ["/bin/sh", "-c"]
          args:
            ["dnf -y install socat && sleep inf"]
    EOF
    Copy to Clipboard Toggle word wrap
  4. In a new terminal window or tab, start the multicast listener.

    1. Get the IP address for the Pod:

      $ POD_IP=$(oc get pods mlistener -o jsonpath='{.status.podIP}')
      Copy to Clipboard Toggle word wrap
    2. Start the multicast listener by entering the following command:

      $ oc exec mlistener -i -t -- \
          socat UDP4-RECVFROM:30102,ip-add-membership=224.1.0.1:$POD_IP,fork EXEC:hostname
      Copy to Clipboard Toggle word wrap
  5. Start the multicast transmitter.

    1. Get the pod network IP address range:

      $ CIDR=$(oc get Network.config.openshift.io cluster \
          -o jsonpath='{.status.clusterNetwork[0].cidr}')
      Copy to Clipboard Toggle word wrap
    2. To send a multicast message, enter the following command:

      $ oc exec msender -i -t -- \
          /bin/bash -c "echo | socat STDIO UDP4-DATAGRAM:224.1.0.1:30102,range=$CIDR,ip-multicast-ttl=64"
      Copy to Clipboard Toggle word wrap

      If multicast is working, the previous command returns the following output:

      mlistener
      Copy to Clipboard Toggle word wrap

23.15. Disabling multicast for a project

23.15.1. Disabling multicast between pods

You can disable multicast between pods for your project.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  • Disable multicast by running the following command:

    $ oc annotate namespace <namespace> \ 
    1
    
        k8s.ovn.org/multicast-enabled-
    Copy to Clipboard Toggle word wrap
    1
    The namespace for the project you want to disable multicast for.
    Tip

    You can alternatively apply the following YAML to delete the annotation:

    apiVersion: v1
    kind: Namespace
    metadata:
      name: <namespace>
      annotations:
        k8s.ovn.org/multicast-enabled: null
    Copy to Clipboard Toggle word wrap

23.16. Tracking network flows

As a cluster administrator, you can collect information about pod network flows from your cluster to assist with the following areas:

  • Monitor ingress and egress traffic on the pod network.
  • Troubleshoot performance issues.
  • Gather data for capacity planning and security audits.

When you enable the collection of the network flows, only the metadata about the traffic is collected. For example, packet data is not collected, but the protocol, source address, destination address, port numbers, number of bytes, and other packet-level information is collected.

The data is collected in one or more of the following record formats:

  • NetFlow
  • sFlow
  • IPFIX

When you configure the Cluster Network Operator (CNO) with one or more collector IP addresses and port numbers, the Operator configures Open vSwitch (OVS) on each node to send the network flows records to each collector.

You can configure the Operator to send records to more than one type of network flow collector. For example, you can send records to NetFlow collectors and also send records to sFlow collectors.

When OVS sends data to the collectors, each type of collector receives identical records. For example, if you configure two NetFlow collectors, OVS on a node sends identical records to the two collectors. If you also configure two sFlow collectors, the two sFlow collectors receive identical records. However, each collector type has a unique record format.

Collecting the network flows data and sending the records to collectors affects performance. Nodes process packets at a slower rate. If the performance impact is too great, you can delete the destinations for collectors to disable collecting network flows data and restore performance.

Note

Enabling network flow collectors might have an impact on the overall performance of the cluster network.

The fields for configuring network flows collectors in the Cluster Network Operator (CNO) are shown in the following table:

Expand
Table 23.8. Network flows configuration
FieldTypeDescription

metadata.name

string

The name of the CNO object. This name is always cluster.

spec.exportNetworkFlows

object

One or more of netFlow, sFlow, or ipfix.

spec.exportNetworkFlows.netFlow.collectors

array

A list of IP address and network port pairs for up to 10 collectors.

spec.exportNetworkFlows.sFlow.collectors

array

A list of IP address and network port pairs for up to 10 collectors.

spec.exportNetworkFlows.ipfix.collectors

array

A list of IP address and network port pairs for up to 10 collectors.

After applying the following manifest to the CNO, the Operator configures Open vSwitch (OVS) on each node in the cluster to send network flows records to the NetFlow collector that is listening at 192.168.1.99:2056.

Example configuration for tracking network flows

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  exportNetworkFlows:
    netFlow:
      collectors:
        - 192.168.1.99:2056
Copy to Clipboard Toggle word wrap

As a cluster administrator, you can configure the Cluster Network Operator (CNO) to send network flows metadata about the pod network to a network flows collector.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.
  • You have a network flows collector and know the IP address and port that it listens on.

Procedure

  1. Create a patch file that specifies the network flows collector type and the IP address and port information of the collectors:

    spec:
      exportNetworkFlows:
        netFlow:
          collectors:
            - 192.168.1.99:2056
    Copy to Clipboard Toggle word wrap
  2. Configure the CNO with the network flows collectors:

    $ oc patch network.operator cluster --type merge -p "$(cat <file_name>.yaml)"
    Copy to Clipboard Toggle word wrap

    Example output

    network.operator.openshift.io/cluster patched
    Copy to Clipboard Toggle word wrap

Verification

Verification is not typically necessary. You can run the following command to confirm that Open vSwitch (OVS) on each node is configured to send network flows records to one or more collectors.

  1. View the Operator configuration to confirm that the exportNetworkFlows field is configured:

    $ oc get network.operator cluster -o jsonpath="{.spec.exportNetworkFlows}"
    Copy to Clipboard Toggle word wrap

    Example output

    {"netFlow":{"collectors":["192.168.1.99:2056"]}}
    Copy to Clipboard Toggle word wrap

  2. View the network flows configuration in OVS from each node:

    $ for pod in $(oc get pods -n openshift-ovn-kubernetes -l app=ovnkube-node -o jsonpath='{range@.items[*]}{.metadata.name}{"\n"}{end}');
      do ;
        echo;
        echo $pod;
        oc -n openshift-ovn-kubernetes exec -c ovnkube-node $pod \
          -- bash -c 'for type in ipfix sflow netflow ; do ovs-vsctl find $type ; done';
    done
    Copy to Clipboard Toggle word wrap

    Example output

    ovnkube-node-xrn4p
    _uuid               : a4d2aaca-5023-4f3d-9400-7275f92611f9
    active_timeout      : 60
    add_id_to_interface : false
    engine_id           : []
    engine_type         : []
    external_ids        : {}
    targets             : ["192.168.1.99:2056"]
    
    ovnkube-node-z4vq9
    _uuid               : 61d02fdb-9228-4993-8ff5-b27f01a29bd6
    active_timeout      : 60
    add_id_to_interface : false
    engine_id           : []
    engine_type         : []
    external_ids        : {}
    targets             : ["192.168.1.99:2056"]-
    
    ...
    Copy to Clipboard Toggle word wrap

As a cluster administrator, you can configure the Cluster Network Operator (CNO) to stop sending network flows metadata to a network flows collector.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in to the cluster with a user with cluster-admin privileges.

Procedure

  1. Remove all network flows collectors:

    $ oc patch network.operator cluster --type='json' \
        -p='[{"op":"remove", "path":"/spec/exportNetworkFlows"}]'
    Copy to Clipboard Toggle word wrap

    Example output

    network.operator.openshift.io/cluster patched
    Copy to Clipboard Toggle word wrap

23.17. Configuring hybrid networking

As a cluster administrator, you can configure the OVN-Kubernetes Container Network Interface (CNI) cluster network provider to allow Linux and Windows nodes to host Linux and Windows workloads, respectively.

You can configure your cluster to use hybrid networking with OVN-Kubernetes. This allows a hybrid cluster that supports different node networking configurations. For example, this is necessary to run both Linux and Windows nodes in a cluster.

Important

You must configure hybrid networking with OVN-Kubernetes during the installation of your cluster. You cannot switch to hybrid networking after the installation process.

Prerequisites

  • You defined OVNKubernetes for the networking.networkType parameter in the install-config.yaml file. See the installation documentation for configuring OpenShift Container Platform network customizations on your chosen cloud provider for more information.

Procedure

  1. Change to the directory that contains the installation program and create the manifests:

    $ ./openshift-install create manifests --dir <installation_directory>
    Copy to Clipboard Toggle word wrap

    where:

    <installation_directory>
    Specifies the name of the directory that contains the install-config.yaml file for your cluster.
  2. Create a stub manifest file for the advanced network configuration that is named cluster-network-03-config.yml in the <installation_directory>/manifests/ directory:

    $ cat <<EOF > <installation_directory>/manifests/cluster-network-03-config.yml
    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
    EOF
    Copy to Clipboard Toggle word wrap

    where:

    <installation_directory>
    Specifies the directory name that contains the manifests/ directory for your cluster.
  3. Open the cluster-network-03-config.yml file in an editor and configure OVN-Kubernetes with hybrid networking, such as in the following example:

    Specify a hybrid networking configuration

    apiVersion: operator.openshift.io/v1
    kind: Network
    metadata:
      name: cluster
    spec:
      defaultNetwork:
        ovnKubernetesConfig:
          hybridOverlayConfig:
            hybridClusterNetwork: 
    1
    
            - cidr: 10.132.0.0/14
              hostPrefix: 23
            hybridOverlayVXLANPort: 9898 
    2
    Copy to Clipboard Toggle word wrap

    1
    Specify the CIDR configuration used for nodes on the additional overlay network. The hybridClusterNetwork CIDR cannot overlap with the clusterNetwork CIDR.
    2
    Specify a custom VXLAN port for the additional overlay network. This is required for running Windows nodes in a cluster installed on vSphere, and must not be configured for any other cloud provider. The custom port can be any open port excluding the default 4789 port. For more information on this requirement, see the Microsoft documentation on Pod-to-pod connectivity between hosts is broken.
    Note

    Windows Server Long-Term Servicing Channel (LTSC): Windows Server 2019 is not supported on clusters with a custom hybridOverlayVXLANPort value because this Windows server version does not support selecting a custom VXLAN port.

  4. Save the cluster-network-03-config.yml file and quit the text editor.
  5. Optional: Back up the manifests/cluster-network-03-config.yml file. The installation program deletes the manifests/ directory when creating the cluster.

Complete any further installation configurations, and then create your cluster. Hybrid networking is enabled when the installation process is finished.

Chapter 24. Configuring Routes

24.1. Route configuration

24.1.1. Creating an HTTP-based route

A route allows you to host your application at a public URL. It can either be secure or unsecured, depending on the network security configuration of your application. An HTTP-based route is an unsecured route that uses the basic HTTP routing protocol and exposes a service on an unsecured application port.

The following procedure describes how to create a simple HTTP-based route to a web application, using the hello-openshift application as an example.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in as an administrator.
  • You have a web application that exposes a port and a TCP endpoint listening for traffic on the port.

Procedure

  1. Create a project called hello-openshift by running the following command:

    $ oc new-project hello-openshift
    Copy to Clipboard Toggle word wrap
  2. Create a pod in the project by running the following command:

    $ oc create -f https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json
    Copy to Clipboard Toggle word wrap
  3. Create a service called hello-openshift by running the following command:

    $ oc expose pod/hello-openshift
    Copy to Clipboard Toggle word wrap
  4. Create an unsecured route to the hello-openshift application by running the following command:

    $ oc expose svc hello-openshift
    Copy to Clipboard Toggle word wrap

Verification

  • To verify that the route resource that you created, run the following command:

    $ oc get routes -o yaml <name of resource> 
    1
    Copy to Clipboard Toggle word wrap
    1
    In this example, the route is named hello-openshift.

Sample YAML definition of the created unsecured route:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-openshift
spec:
  host: hello-openshift-hello-openshift.<Ingress_Domain> 
1

  port:
    targetPort: 8080 
2

  to:
    kind: Service
    name: hello-openshift
Copy to Clipboard Toggle word wrap

1
<Ingress_Domain> is the default ingress domain name. The ingresses.config/cluster object is created during the installation and cannot be changed. If you want to specify a different domain, you can specify an alternative cluster domain using the appsDomain option.
2
targetPort is the target port on pods that is selected by the service that this route points to.
Note

To display your default ingress domain, run the following command:

$ oc get ingresses.config/cluster -o jsonpath={.spec.domain}
Copy to Clipboard Toggle word wrap

A route allows you to host your application at a URL. In this case, the hostname is not set and the route uses a subdomain instead. When you specify a subdomain, you automatically use the domain of the Ingress Controller that exposes the route. For situations where a route is exposed by multiple Ingress Controllers, the route is hosted at multiple URLs.

The following procedure describes how to create a route for Ingress Controller sharding, using the hello-openshift application as an example.

Ingress Controller sharding is useful when balancing incoming traffic load among a set of Ingress Controllers and when isolating traffic to a specific Ingress Controller. For example, company A goes to one Ingress Controller and company B to another.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You are logged in as a project administrator.
  • You have a web application that exposes a port and an HTTP or TLS endpoint listening for traffic on the port.
  • You have configured the Ingress Controller for sharding.

Procedure

  1. Create a project called hello-openshift by running the following command:

    $ oc new-project hello-openshift
    Copy to Clipboard Toggle word wrap
  2. Create a pod in the project by running the following command:

    $ oc create -f https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json
    Copy to Clipboard Toggle word wrap
  3. Create a service called hello-openshift by running the following command:

    $ oc expose pod/hello-openshift
    Copy to Clipboard Toggle word wrap
  4. Create a route definition called hello-openshift-route.yaml:

    YAML definition of the created route for sharding:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      labels:
        type: sharded 
    1
    
      name: hello-openshift-edge
      namespace: hello-openshift
    spec:
      subdomain: hello-openshift 
    2
    
      tls:
        termination: edge
      to:
        kind: Service
        name: hello-openshift
    Copy to Clipboard Toggle word wrap

    1
    Both the label key and its corresponding label value must match the ones specified in the Ingress Controller. In this example, the Ingress Controller has the label key and value type: sharded.
    2
    The route will be exposed using the value of the subdomain field. When you specify the subdomain field, you must leave the hostname unset. If you specify both the host and subdomain fields, then the route will use the value of the host field, and ignore the subdomain field.
  5. Use hello-openshift-route.yaml to create a route to the hello-openshift application by running the following command:

    $ oc -n hello-openshift create -f hello-openshift-route.yaml
    Copy to Clipboard Toggle word wrap

Verification

  • Get the status of the route with the following command:

    $ oc -n hello-openshift get routes/hello-openshift-edge -o yaml
    Copy to Clipboard Toggle word wrap

    The resulting Route resource should look similar to the following:

    Example output

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      labels:
        type: sharded
      name: hello-openshift-edge
      namespace: hello-openshift
    spec:
      subdomain: hello-openshift
      tls:
        termination: edge
      to:
        kind: Service
        name: hello-openshift
    status:
      ingress:
      - host: hello-openshift.<apps-sharded.basedomain.example.net> 
    1
    
        routerCanonicalHostname: router-sharded.<apps-sharded.basedomain.example.net> 
    2
    
        routerName: sharded 
    3
    Copy to Clipboard Toggle word wrap

    1
    The hostname the Ingress Controller, or router, uses to expose the route. The value of the host field is automatically determined by the Ingress Controller, and uses its domain. In this example, the domain of the Ingress Controller is <apps-sharded.basedomain.example.net>.
    2
    The hostname of the Ingress Controller.
    3
    The name of the Ingress Controller. In this example, the Ingress Controller has the name sharded.

24.1.3. Configuring route timeouts

You can configure the default timeouts for an existing route when you have services in need of a low timeout, which is required for Service Level Availability (SLA) purposes, or a high timeout, for cases with a slow back end.

Prerequisites

  • You need a deployed Ingress Controller on a running cluster.

Procedure

  1. Using the oc annotate command, add the timeout to the route:

    $ oc annotate route <route_name> \
        --overwrite haproxy.router.openshift.io/timeout=<timeout><time_unit> 
    1
    Copy to Clipboard