Chapter 1. About node remediation, fencing, and maintenance

PDF

Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.

Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as fencing before initiating recovery of the workload, known as remediation and ideally, recovery of the node also.

It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, Red Hat OpenShift provides multiple components for the automation of failure detection, fencing and remediation.

1.1. Self Node Remediation

The Self Node Remediation Operator is a Red Hat OpenShift add-on Operator that implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.

Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.

1.2. Fence Agents Remediation

The Fence Agents Remediation (FAR) Operator is a Red Hat OpenShift add-on operator that automatically remediates unhealthy nodes, similar to the Self Node Remediation Operator. You can use well-known agents to fence and remediate unhealthy nodes. The remediation includes rebooting the unhealthy node using a fence agent, and then evicting workloads from the unhealthy node, depending on the remediation strategy.

1.3. Machine Deletion Remediation

The Machine Deletion Remediation (MDR) Operator is a Red Hat OpenShift add-on Operator that uses the Machine API to reprovision unhealthy nodes. MDR works with NodeHealthCheck (NHC) to create a Custom Resource (CR) for MDR with information about the unhealthy node.

MDR follows the annotation on the node to the associated machine object and confirms that it has an owning controller. MDR proceeds to delete the machine, and then the owning controller recreates a replacement machine.

1.4. Machine Health Check

Machine Health Check utilizes a Red Hat OpenShift built-in failure detection, fencing and remediation system, which monitors the status of machines and the conditions of nodes. Machine Health Checks can be configured to trigger external fencing and remediation systems, like Self Node Remediation.

1.5. Node Health Check

The Node Health Check Operator is a Red Hat OpenShift add-on Operator that implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides these features. By default, it is configured to utilize the Self Node Remediation system.

1.6. Node Maintenance

Administrators face situations where they need to interrupt the cluster, for example, replace a drive, RAM, or a NIC.

In advance of this maintenance, affected nodes should be cordoned and drained. When a node is cordoned, new workloads cannot be scheduled on that node. When a node is drained, to avoid or minimize downtime, workloads on the affected node are transferred to other nodes.

While this maintenance can be achieved using command line tools, the Node Maintenance Operator offers a declarative approach to achieve this by using a custom resource. When such a resource exists for a node, the Operator cordons and drains the node until the resource is deleted.

1.7. About metrics for workload availability operators

The addition of data analysis enhances observability for the workload availability operators. The data provides metrics about the activity of the operators, and the effect on the cluster. These metrics improve decision-making capabilities, enable data-driven optimization, and enhance overall system performance.

You can use metrics to do these tasks:

Access comprehensive tracking data for operators, to monitor overall system efficiency.
Access actionable insights derived from tracking data, such as identifying frequently failing nodes, or downtime due to operator’s remediations.
Visualize how the operator’s remediations are actually enhancing the system efficiency.

1.7.1. Configuring metrics for workload availability operators

You can use the Red Hat OpenShift web console to install the Node Health Check Operator.

Prerequisites

You must first configure the monitoring stack. For more information, see Configuring the monitoring stack.
You must enable monitoring for used-defined projects. For more information, see Enabling monitoring for used-defined projects.

Procedure

Create the prometheus-user-token secret from the existing prometheus-user-workload-token secret as follows:

existingPrometheusTokenSecret=$(kubectl get secret --namespace openshift-user-workload-monitoring | grep prometheus-user-workload-token | awk '{print $1}') 1

kubectl get secret ${existingPrometheusTokenSecret} --namespace=openshift-user-workload-monitoring -o yaml | \
    sed '/namespace: .*==/d;/ca.crt:/d;/serviceCa.crt/d;/creationTimestamp:/d;/resourceVersion:/d;/uid:/d;/annotations/d;/kubernetes.io/d;' | \
    sed 's/namespace: .*/namespace: openshift-workload-availability/' | \ 2
    sed 's/name: .*/name: prometheus-user-workload-token/' | \ 3
    sed 's/type: .*/type: Opaque/' | \
    > prom-token.yaml

kubectl apply -f prom-token.yaml

1: The prometheus-user-token is required by the Metric ServiceMonitor, created in the next step.
2: Ensure the new Secret’s namespace is the one where NHC Operator is installed, for example, openshift-workload-availability .
3: The prometheus-user-workload-token only exists if User Worload Prometheus scraping is enabled.

Create the ServiceMonitor as follows:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: node-healthcheck-metrics-monitor
  namespace: openshift-workload-availability 1
  labels:
    app.kubernetes.io/component: controller-manager
spec:
  endpoints:
  - interval: 30s
    port: https
    scheme: https
    authorization:
      type: Bearer
      credentials:
        name: prometheus-user-workload-token
        key: token
    tlsConfig:
      ca:
        configMap:
          name: nhc-serving-certs-ca-bundle
          key: service-ca.crt
      serverName: node-healthcheck-controller-manager-metrics-service.openshift-workload-availability.svc 2
  selector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
      app.kubernetes.io/name: node-healthcheck-operator
      app.kubernetes.io/instance: metrics

1: Specify the namespace where you want to configure the metrics, for example, openshift-workload-availability.
2: The serverName must contain the same namespace where the Operator is installed. In the example, openshift-workload-availability is placed after the metrics service name and before the filetype extension.

Verification

To confirm that the configuration is successful the Observe > Targets tab in OCP Web UI shows Endpoint Up.

1.7.2. Example metrics for workload availability operators

The following are example metrics from the various workload availability operators.

The metrics include information on the following indicators:

Operator availability: Showing if and when each Operator is up and running.
Node remediation count: Showing the number of remediations across the same node, and across all nodes.
Node remediation duration: Showing the remediation downtime or recovery time.
Node remediation gauge: Showing the number of ongoing remediations.

Chapter 1. About node remediation, fencing, and maintenance

1.1. Self Node Remediation

1.2. Fence Agents Remediation

1.3. Machine Deletion Remediation

1.4. Machine Health Check

1.5. Node Health Check

1.6. Node Maintenance

1.7. About metrics for workload availability operators

1.7.1. Configuring metrics for workload availability operators

1.7.2. Example metrics for workload availability operators

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links