Inicio
Productos
Workload Availability for Red Hat OpenShift
24.1
Remediation, fencing, and maintenance
Chapter 1. About node remediation, fencing, and maintenance

Este contenido no está disponible en el idioma seleccionado.

Chapter 1. About node remediation, fencing, and maintenance

Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.

Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as fencing before initiating recovery of the workload, known as remediation and ideally, recovery of the node also.

It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, Red Hat OpenShift provides multiple components for the automation of failure detection, fencing and remediation.

1.1. Self Node Remediation
Copiar enlace

The Self Node Remediation Operator is a Red Hat OpenShift add-on Operator that implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.

Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.

1.2. Fence Agents Remediation
Copiar enlace

The Fence Agents Remediation (FAR) Operator is a Red Hat OpenShift add-on operator that automatically remediates unhealthy nodes, similar to the Self Node Remediation Operator. Using a management interface or traditional API, FAR runs a fence-agent to remediate a node from an unhealthy state by power-cycling the node.

FAR is designed to run an existing set of upstream fencing agents for environments with a traditional API end-point, for example, IPMI, for power cycling cluster nodes.

1.3. Machine Deletion Remediation
Copiar enlace

The Machine Deletion Remediation (MDR) Operator is a Red Hat OpenShift add-on Operator that uses the Machine API to reprovision unhealthy nodes. MDR works with NodeHealthCheck (NHC) to create a Custom Resource (CR) for MDR with information about the unhealthy node.

MDR follows the annotation on the node to the associated machine object and confirms that it has an owning controller. MDR proceeds to delete the machine, and then the owning controller recreates a replacement machine.

1.4. Machine Health Check
Copiar enlace

Machine Health Check utilizes a Red Hat OpenShift built-in failure detection, fencing and remediation system, which monitors the status of machines and the conditions of nodes. Machine Health Checks can be configured to trigger external fencing and remediation systems, like Self Node Remediation.

1.5. Node Health Check
Copiar enlace

The Node Health Check Operator is a Red Hat OpenShift add-on Operator that implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides these features. By default, it is configured to utilize the Self Node Remediation system.

1.6. Node Maintenance
Copiar enlace

Administrators face situations where they need to interrupt the cluster, for example, replace a drive, RAM, or a NIC.

In advance of this maintenance, affected nodes should be cordoned and drained. When a node is cordoned, new workloads cannot be scheduled on that node. When a node is drained, to avoid or minimize downtime, workloads on the affected node are transferred to other nodes.

While this maintenance can be achieved using command line tools, the Node Maintenance Operator offers a declarative approach to achieve this by using a custom resource. When such a resource exists for a node, the Operator cordons and drains the node until the resource is deleted.

Volver arriba

Este contenido no está disponible en el idioma seleccionado.

Chapter 1. About node remediation, fencing, and maintenance

1.1. Self Node Remediation
Copiar enlace

1.2. Fence Agents Remediation
Copiar enlace

1.3. Machine Deletion Remediation
Copiar enlace

1.4. Machine Health Check
Copiar enlace

1.5. Node Health Check
Copiar enlace

1.6. Node Maintenance
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 1. About node remediation, fencing, and maintenance

1.1. Self Node RemediationCopiar enlaceEnlace copiado en el portapapeles!

1.2. Fence Agents RemediationCopiar enlaceEnlace copiado en el portapapeles!

1.3. Machine Deletion RemediationCopiar enlaceEnlace copiado en el portapapeles!

1.4. Machine Health CheckCopiar enlaceEnlace copiado en el portapapeles!

1.5. Node Health CheckCopiar enlaceEnlace copiado en el portapapeles!

1.6. Node MaintenanceCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

1.1. Self Node Remediation
Copiar enlace

1.2. Fence Agents Remediation
Copiar enlace

1.3. Machine Deletion Remediation
Copiar enlace

1.4. Machine Health Check
Copiar enlace

1.5. Node Health Check
Copiar enlace

1.6. Node Maintenance
Copiar enlace