Chapter 9. Remote worker nodes on the network edge
9.1. Using remote worker nodes at the network edge
You can configure OpenShift Container Platform clusters with nodes located at your network edge. In this topic, they are called remote worker nodes. A typical cluster with remote worker nodes combines on-premise master and worker nodes with worker nodes in other locations that connect to the cluster. This topic is intended to provide guidance on best practices for using remote worker nodes and does not contain specific configuration details.
There are multiple use cases across different industries, such as telecommunications, retail, manufacturing, and government, for using a deployment pattern with remote worker nodes. For example, you can separate and isolate your projects and workloads by combining the remote worker nodes into Kubernetes zones.
However, having remote worker nodes can introduce higher latency, intermittent loss of network connectivity, and other issues. Among the challenges in a cluster with remote worker node are:
- Network separation: The OpenShift Container Platform control plane and the remote worker nodes must be able communicate with each other. Because of the distance between the control plane and the remote worker nodes, network issues could prevent this communication. See Network separation with remote worker nodes for information on how OpenShift Container Platform responds to network separation and for methods to diminish the impact to your cluster.
- Power outage: Because the control plane and remote worker nodes are in separate locations, a power outage at the remote location or at any point between the two can negatively impact your cluster. See Power loss on remote worker nodes for information on how OpenShift Container Platform responds to a node losing power and for methods to diminish the impact to your cluster.
- Latency spikes or temporary reduction in throughput: As with any network, any changes in network conditions between your cluster and the remote worker nodes can negatively impact your cluster. OpenShift Container Platform offers multiple worker latency profiles that let you control the reaction of the cluster to latency issues.
Note the following limitations when planning a cluster with remote worker nodes:
- OpenShift Container Platform does not support remote worker nodes that use a different cloud provider than the on-premise cluster uses.
- Moving workloads from one Kubernetes zone to a different Kubernetes zone can be problematic due to system and environment issues, such as a specific type of memory not being available in a different zone.
- Proxies and firewalls can present additional limitations that are beyond the scope of this document. See the relevant OpenShift Container Platform documentation for how to address such limitations, such as Configuring your firewall.
- You are responsible for configuring and maintaining L2/L3-level network connectivity between the control plane and the network-edge nodes.
9.1.1. Adding remote worker nodes
Adding remote worker nodes to a cluster involves some additional considerations.
- You must ensure that a route or a default gateway is in place to route traffic between the control plane and every remote worker node.
- You must place the Ingress VIP on the control plane.
- Adding remote worker nodes with user-provisioned infrastructure is identical to adding other worker nodes.
-
To add remote worker nodes to an installer-provisioned cluster at install time, specify the subnet for each worker node in the
install-config.yaml
file before installation. There are no additional settings required for the DHCP server. You must use virtual media, because the remote worker nodes will not have access to the local provisioning network. -
To add remote worker nodes to an installer-provisioned cluster deployed with a provisioning network, ensure that
virtualMediaViaExternalNetwork
flag is set totrue
in theinstall-config.yaml
file so that it will add the nodes using virtual media. Remote worker nodes will not have access to the local provisioning network. They must be deployed with virtual media rather than PXE. Additionally, specify each subnet for each group of remote worker nodes and the control plane nodes in the DHCP server.
9.1.2. Network separation with remote worker nodes
All nodes send heartbeats to the Kubernetes Controller Manager Operator (kube controller) in the OpenShift Container Platform cluster every 10 seconds. If the cluster does not receive heartbeats from a node, OpenShift Container Platform responds using several default mechanisms.
OpenShift Container Platform is designed to be resilient to network partitions and other disruptions. You can mitigate some of the more common disruptions, such as interruptions from software upgrades, network splits, and routing issues. Mitigation strategies include ensuring that pods on remote worker nodes request the correct amount of CPU and memory resources, configuring an appropriate replication policy, using redundancy across zones, and using Pod Disruption Budgets on workloads.
If the kube controller loses contact with a node after a configured period, the node controller on the control plane updates the node health to Unhealthy
and marks the node Ready
condition as Unknown
. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable
taint with a NoExecute
effect to the node and schedules pods on the node for eviction after five minutes, by default.
If a workload controller, such as a Deployment
object or StatefulSet
object, is directing traffic to pods on the unhealthy node and other nodes can reach the cluster, OpenShift Container Platform routes the traffic away from the pods on the node. Nodes that cannot reach the cluster do not get updated with the new traffic routing. As a result, the workloads on those nodes might continue to attempt to reach the unhealthy node.
You can mitigate the effects of connection loss by:
- using daemon sets to create pods that tolerate the taints
- using static pods that automatically restart if a node goes down
- using Kubernetes zones to control pod eviction
- configuring pod tolerations to delay or avoid pod eviction
- configuring the kubelet to control the timing of when it marks nodes as unhealthy.
For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.
9.1.3. Power loss on remote worker nodes
If a remote worker node loses power or restarts ungracefully, OpenShift Container Platform responds using several default mechanisms.
If the Kubernetes Controller Manager Operator (kube controller) loses contact with a node after a configured period, the control plane updates the node health to Unhealthy
and marks the node Ready
condition as Unknown
. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable
taint with a NoExecute
effect to the node and schedules pods on the node for eviction after five minutes, by default.
On the node, the pods must be restarted when the node recovers power and reconnects with the control plane.
If you want the pods to restart immediately upon restart, use static pods.
After the node restarts, the kubelet also restarts and attempts to restart the pods that were scheduled on the node. If the connection to the control plane takes longer than the default five minutes, the control plane cannot update the node health and remove the node.kubernetes.io/unreachable
taint. On the node, the kubelet terminates any running pods. When these conditions are cleared, the scheduler can start scheduling pods to that node.
You can mitigate the effects of power loss by:
- using daemon sets to create pods that tolerate the taints
- using static pods that automatically restart with a node
- configuring pods tolerations to delay or avoid pod eviction
- configuring the kubelet to control the timing of when the node controller marks nodes as unhealthy.
For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.
9.1.4. Latency spikes or temporary reduction in throughput to remote workers
If the cluster administrator has performed latency tests for platform verification, they can discover the need to adjust the operation of the cluster to ensure stability in cases of high latency. The cluster administrator needs to change only one parameter, recorded in a file, which controls four parameters affecting how supervisory processes read status and interpret the health of the cluster. Changing only the one parameter provides cluster tuning in an easy, supportable manner.
The Kubelet
process provides the starting point for monitoring cluster health. The Kubelet
sets status values for all nodes in the OpenShift Container Platform cluster. The Kubernetes Controller Manager (kube controller
) reads the status values every 10 seconds, by default. If the kube controller
cannot read a node status value, it loses contact with that node after a configured period. The default behavior is:
-
The node controller on the control plane updates the node health to
Unhealthy
and marks the nodeReady
condition`Unknown`. - In response, the scheduler stops scheduling pods to that node.
-
The Node Lifecycle Controller adds a
node.kubernetes.io/unreachable
taint with aNoExecute
effect to the node and schedules any pods on the node for eviction after five minutes, by default.
This behavior can cause problems if your network is prone to latency issues, especially if you have nodes at the network edge. In some cases, the Kubernetes Controller Manager might not receive an update from a healthy node due to network latency. The Kubelet
evicts pods from the node even though the node is healthy.
To avoid this problem, you can use worker latency profiles to adjust the frequency that the Kubelet
and the Kubernetes Controller Manager wait for status updates before taking action. These adjustments help to ensure that your cluster runs properly if network latency between the control plane and the worker nodes is not optimal.
These worker latency profiles contain three sets of parameters that are predefined with carefully tuned values to control the reaction of the cluster to increased latency. There is no need to experimentally find the best values manually.
You can configure worker latency profiles when installing a cluster or at any time you notice increased latency in your cluster network.
9.1.5. Remote worker node strategies
If you use remote worker nodes, consider which objects to use to run your applications.
It is recommended to use daemon sets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.
- Daemon sets
- Daemon sets are the best approach to managing pods on remote worker nodes for the following reasons:
-
Daemon sets do not typically need rescheduling behavior. If a node disconnects from the cluster, pods on the node can continue to run. OpenShift Container Platform does not change the state of daemon set pods, and leaves the pods in the state they last reported. For example, if a daemon set pod is in the
Running
state, when a node stops communicating, the pod keeps running and is assumed to be running by OpenShift Container Platform. Daemon set pods, by default, are created with
NoExecute
tolerations for thenode.kubernetes.io/unreachable
andnode.kubernetes.io/not-ready
taints with notolerationSeconds
value. These default values ensure that daemon set pods are never evicted if the control plane cannot reach a node. For example:Tolerations added to daemon set pods by default
tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute - key: node.kubernetes.io/disk-pressure operator: Exists effect: NoSchedule - key: node.kubernetes.io/memory-pressure operator: Exists effect: NoSchedule - key: node.kubernetes.io/pid-pressure operator: Exists effect: NoSchedule - key: node.kubernetes.io/unschedulable operator: Exists effect: NoSchedule
- Daemon sets can use labels to ensure that a workload runs on a matching worker node.
- You can use an OpenShift Container Platform service endpoint to load balance daemon set pods.
Daemon sets do not schedule pods after a reboot of the node if OpenShift Container Platform cannot reach the node.
- Static pods
- If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.
Static pods cannot use secrets and config maps.
- Kubernetes zones
- Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.
When the control plane cannot reach a node, the node controller, by default, applies node.kubernetes.io/unreachable
taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.
If a zone is fully disrupted, where all nodes in the zone have a Ready
condition that is False
or Unknown
, the control plane does not apply the node.kubernetes.io/unreachable
taint to the nodes in that zone.
For partially disrupted zones, where more than 55% of the nodes have a False
or Unknown
condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.
You assign a node to a specific zone by applying the topology.kubernetes.io/region
label in the node specification.
Sample node labels for Kubernetes zones
kind: Node apiVersion: v1 metadata: labels: topology.kubernetes.io/region=east
KubeletConfig
objects
You can adjust the amount of time that the kubelet checks the state of each node.
To set the interval that affects the timing of when the on-premise node controller marks nodes with the Unhealthy
or Unreachable
condition, create a KubeletConfig
object that contains the node-status-update-frequency
and node-status-report-frequency
parameters.
The kubelet on each node determines the node status as defined by the node-status-update-frequency
setting and reports that status to the cluster based on the node-status-report-frequency
setting. By default, the kubelet determines the pod status every 10 seconds and reports the status every minute. However, if the node state changes, the kubelet reports the change to the cluster immediately. OpenShift Container Platform uses the node-status-report-frequency
setting only when the Node Lease feature gate is enabled, which is the default state in OpenShift Container Platform clusters. If the Node Lease feature gate is disabled, the node reports its status based on the node-status-update-frequency
setting.
Example kubelet config
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: disable-cpu-units spec: machineConfigPoolSelector: matchLabels: machineconfiguration.openshift.io/role: worker 1 kubeletConfig: node-status-update-frequency: 2 - "10s" node-status-report-frequency: 3 - "1m"
- 1
- Specify the type of node type to which this
KubeletConfig
object applies using the label from theMachineConfig
object. - 2
- Specify the frequency that the kubelet checks the status of a node associated with this
MachineConfig
object. The default value is10s
. If you change this default, thenode-status-report-frequency
value is changed to the same value. - 3
- Specify the frequency that the kubelet reports the status of a node associated with this
MachineConfig
object. The default value is1m
.
The node-status-update-frequency
parameter works with the node-monitor-grace-period
parameter.
-
The
node-monitor-grace-period
parameter specifies how long OpenShift Container Platform waits after a node associated with aMachineConfig
object is markedUnhealthy
if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster afternode-monitor-grace-period
expires, pods continue to run. New pods can be scheduled to that node. Thenode-monitor-grace-period
interval is40s
. Thenode-status-update-frequency
value must be lower than thenode-monitor-grace-period
value.
Modifying the node-monitor-grace-period
parameter is not supported.
- Tolerations
-
You can use pod tolerations to mitigate the effects if the on-premise node controller adds a
node.kubernetes.io/unreachable
taint with aNoExecute
effect to a node it cannot reach.
A taint with the NoExecute
effect affects pods that are running on the node in the following ways:
- Pods that do not tolerate the taint are queued for eviction.
-
Pods that tolerate the taint without specifying a
tolerationSeconds
value in their toleration specification remain bound forever. -
Pods that tolerate the taint with a specified
tolerationSeconds
value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.
Unless tolerations are explicitly set, Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
with tolerationSeconds=300
, meaning that pods remain bound for 5 minutes if either of these taints is detected.
You can delay or avoid pod eviction by configuring pods tolerations with the NoExecute
effect for the node.kubernetes.io/unreachable
and node.kubernetes.io/not-ready
taints.
Example toleration in a pod spec
... tolerations: - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" 1 - key: "node.kubernetes.io/not-ready" operator: "Exists" effect: "NoExecute" 2 tolerationSeconds: 600 3 ...
- 1
- The
NoExecute
effect withouttolerationSeconds
lets pods remain forever if the control plane cannot reach the node. - 2
- The
NoExecute
effect withtolerationSeconds
: 600 lets pods remain for 10 minutes if the control plane marks the node asUnhealthy
. - 3
- You can specify your own
tolerationSeconds
value.
- Other types of OpenShift Container Platform objects
- You can use replica sets, deployments, and replication controllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.
When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.
stateful sets do not get restarted when there is an outage. The pods remain in the terminating
state until the control plane can acknowledge that the pods are terminated.
To avoid scheduling a to a node that does not have access to the same type of persistent storage, OpenShift Container Platform cannot migrate pods that require persistent volumes to other zones in the case of network separation.
Additional resources
- For more information on Daemonesets, see DaemonSets.
- For more information on taints and tolerations, see Controlling pod placement using node taints.
-
For more information on configuring
KubeletConfig
objects, see Creating a KubeletConfig CRD. - For more information on replica sets, see ReplicaSets.
- For more information on deployments, see Deployments.
- For more information on replication controllers, see Replication controllers.
- For more information on the controller manager, see Kubernetes Controller Manager Operator.