Chapter 1. About node remediation, fencing, and maintenance
Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.
Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as fencing
before initiating recovery of the workload, known as remediation
and ideally, recovery of the node also.
It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, Red Hat OpenShift provides multiple components for the automation of failure detection, fencing and remediation.
1.1. Self Node Remediation
The Self Node Remediation Operator is a Red Hat OpenShift add-on Operator that implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.
Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.
1.2. Fence Agents Remediation
The Fence Agents Remediation (FAR) Operator is a Red Hat OpenShift add-on operator that automatically remediates unhealthy nodes, similar to the Self Node Remediation Operator. You can use well-known agents to fence and remediate unhealthy nodes. The remediation includes rebooting the unhealthy node using a fence agent, and then evicting workloads from the unhealthy node, depending on the remediation strategy.
1.3. Machine Deletion Remediation
The Machine Deletion Remediation (MDR) Operator is a Red Hat OpenShift add-on Operator that uses the Machine API to reprovision unhealthy nodes. MDR works with NodeHealthCheck (NHC) to create a Custom Resource (CR) for MDR with information about the unhealthy node.
MDR follows the annotation on the node to the associated machine object and confirms that it has an owning controller. MDR proceeds to delete the machine, and then the owning controller recreates a replacement machine.
1.4. Machine Health Check
Machine Health Check utilizes a Red Hat OpenShift built-in failure detection, fencing and remediation system, which monitors the status of machines and the conditions of nodes. Machine Health Checks can be configured to trigger external fencing and remediation systems, like Self Node Remediation.
1.5. Node Health Check
The Node Health Check Operator is a Red Hat OpenShift add-on Operator that implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides these features. By default, it is configured to utilize the Self Node Remediation system.
1.6. Node Maintenance
Administrators face situations where they need to interrupt the cluster, for example, replace a drive, RAM, or a NIC.
In advance of this maintenance, affected nodes should be cordoned and drained. When a node is cordoned, new workloads cannot be scheduled on that node. When a node is drained, to avoid or minimize downtime, workloads on the affected node are transferred to other nodes.
While this maintenance can be achieved using command line tools, the Node Maintenance Operator offers a declarative approach to achieve this by using a custom resource. When such a resource exists for a node, the Operator cordons and drains the node until the resource is deleted.
1.7. Flow of events during fencing and remediation
When a node becomes unhealthy multiple events occur to detect, fence, and remediate the node, in order to restore workloads, and ideally the node, to health. Some events are triggered by the OpenShift cluster, and some events are reactions by the Workload Availability operators. Understanding this flow of events, and the duration between these events, is important to make informed decisions. Decisions including, which remediation provider to use, and how to configure Node Health Check Operator and the chosen remediaton provider.
The example outlined next is a common use case that outlines the phased flow of events. Only when the Node Health Check Operator has the Ready=Unknown
unhealthy conditions do the phases act as follows.
1.7.1. Phase 1 - Kubernetes Health Check (Core OpenShift)
The unhealthy node stops communicating with the API server. After approximately 50 seconds the API server sets the "Ready" condition of the node to "Unknown", that is, Ready=Unknown
.
1.7.2. Phase 2 - Node Health Check (NHC)
If the Ready=Unknown
condition is present longer than the configured duration, it starts a remediation.
The user-configured duration in this phase represents the tolerance that the Operator has towards the duration of the unhealthy condition. It takes into account that while the workload is restarting as requested, the resource is expected to be "Unready".
For example:
- If you have a workload that takes a long time to restart, then you need to have a longer timeout.
- Likewise, when the workload restart is short, then the timeout needed is also short.
1.7.3. Phase 3 - Remediate Host / Remediate API (depending on the configured remediation operator)
Using Machine Deletion Remediation (MDR), Self Node Remediation (SNR) or Fence Agents Remediation (FAR), the remediator fences and isolates the node by rebooting it in order to reach a safe state.
The details of this phase are configured by the user and depends on their workload requirements.
For example:
- Machine Deletion Remediation - The choice of platform influences the time it takes to reprovision the machine, and then the duration of the remediation. MDR is only applicable to clusters that use Machine API.
- Self Node Remediation - The remediation time depends on many parameters, including the safe time is takes to automatically reboot unhealthy nodes, and the watchdog devices used to ensure that the machine enters a safe state when an error condition is detected.
- Fence Agents Remediation - The fencing agent time depends on many parameters including, the cluster nodes, the management interface, and the agent parameters.
1.7.4. Phase 4 - Workload starting
When MDR is used, the remediator deletes the resources. When FAR and SNR are used, varying remediation strategies are available for them to use. One strategy is OutOfServiceTaint
which uses out-of-service taint to permit the deletion of resources in the cluster. In both cases, deleting the resources enables faster rescheduling of the affected workload. The workload is then rescheduled and restarted.
This phase is initiated automatically by the remediators when fencing is complete. If fencing does not complete, and an escalation remediation is required, the user must configure the timeout in seconds for the entire remediation process. If the timeout passes, and the node is still unhealthy, NHC will try the next remediator in line to remediate the unhealthy node.
1.8. About metrics for workload availability operators
The addition of data analysis enhances observability for the workload availability operators. The data provides metrics about the activity of the operators, and the effect on the cluster. These metrics improve decision-making capabilities, enable data-driven optimization, and enhance overall system performance.
You can use metrics to do these tasks:
- Access comprehensive tracking data for operators, to monitor overall system efficiency.
- Access actionable insights derived from tracking data, such as identifying frequently failing nodes, or downtime due to operator’s remediations.
- Visualize how the operator’s remediations are actually enhancing the system efficiency.
1.8.1. Configuring metrics for workload availability operators
You can use the Red Hat OpenShift web console to install the Node Health Check Operator.
Prerequisites
- You must first configure the monitoring stack. For more information, see Configuring the monitoring stack.
- You must enable monitoring for used-defined projects. For more information, see Enabling monitoring for used-defined projects.
Procedure
Create the
prometheus-user-token
secret from the existingprometheus-user-workload-token
secret as follows:existingPrometheusTokenSecret=$(kubectl get secret --namespace openshift-user-workload-monitoring | grep prometheus-user-workload-token | awk '{print $1}') 1 kubectl get secret ${existingPrometheusTokenSecret} --namespace=openshift-user-workload-monitoring -o yaml | \ sed '/namespace: .*==/d;/ca.crt:/d;/serviceCa.crt/d;/creationTimestamp:/d;/resourceVersion:/d;/uid:/d;/annotations/d;/kubernetes.io/d;' | \ sed 's/namespace: .*/namespace: openshift-workload-availability/' | \ 2 sed 's/name: .*/name: prometheus-user-workload-token/' | \ 3 sed 's/type: .*/type: Opaque/' | \ > prom-token.yaml kubectl apply -f prom-token.yaml
- 1
- The prometheus-user-token is required by the Metric ServiceMonitor, created in the next step.
- 2
- Ensure the new Secret’s namespace is the one where NHC Operator is installed, for example, openshift-workload-availability .
- 3
- The prometheus-user-workload-token only exists if User Worload Prometheus scraping is enabled.
Create the ServiceMonitor as follows:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: node-healthcheck-metrics-monitor namespace: openshift-workload-availability 1 labels: app.kubernetes.io/component: controller-manager spec: endpoints: - interval: 30s port: https scheme: https authorization: type: Bearer credentials: name: prometheus-user-workload-token key: token tlsConfig: ca: configMap: name: nhc-serving-certs-ca-bundle key: service-ca.crt serverName: node-healthcheck-controller-manager-metrics-service.openshift-workload-availability.svc 2 selector: matchLabels: app.kubernetes.io/component: controller-manager app.kubernetes.io/name: node-healthcheck-operator app.kubernetes.io/instance: metrics
- 1
- Specify the namespace where you want to configure the metrics, for example,
openshift-workload-availability
. - 2
- The serverName must contain the same namespace where the Operator is installed. In the example,
openshift-workload-availability
is placed after the metrics service name and before the filetype extension.
Verification
To confirm that the configuration is successful the Observe > Targets tab in OCP Web UI shows Endpoint Up
.
1.8.2. Example metrics for workload availability operators
The following are example metrics from the various workload availability operators.
The metrics include information on the following indicators:
- Operator availability: Showing if and when each Operator is up and running.
- Node remediation count: Showing the number of remediations across the same node, and across all nodes.
- Node remediation duration: Showing the remediation downtime or recovery time.
- Node remediation gauge: Showing the number of ongoing remediations.