Início
Produtos
Workload Availability for Red Hat OpenShift
25.1
Remediation, fencing, and maintenance
Chapter 6. Remediating nodes with Node Health Checks

Este conteúdo não está disponível no idioma selecionado.

Chapter 6. Remediating nodes with Node Health Checks

You can use the Node Health Check Operator to identify unhealthy nodes. The Operator then uses other remediation providers to remediate the unhealthy nodes.

The Node Health Check Operator can be used with the other remediation providers, including:

The Self Node Remediation Operator.
The Fence Agents Remediation Operator.
The Machine Deletion Remediation Operator.

Note

Due to the existence of preinstalled machine health checks on Red Hat OpenShift Service on AWS (ROSA) clusters, the Node Health Check Operator is unable to function in such an environment.

The Node Health Check Operator is a "Rolling Stream" Operator, meaning updates are available asynchronously of OpenShift Container Platform releases. For more information, see OpenShift Operator Life Cycles on the Red Hat Customer Portal.

6.1. About the Node Health Check Operator
Copiar o link

The Node Health Check Operator detects the health of the nodes in a cluster. The NodeHealthCheck controller creates the NodeHealthCheck custom resource (CR), which defines a set of criteria and thresholds to determine the health of a node.

When the Node Health Check Operator detects an unhealthy node, it creates a remediation CR that triggers the remediation provider. For example, the controller creates the SelfNodeRemediation CR, which triggers the Self Node Remediation Operator to remediate the unhealthy node.

The NodeHealthCheck CR resembles the following YAML file, with self-node-remediation as the remediation provider:

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-sample
spec:
  minHealthy: 51% 
  pauseRequests: 
    - <pause-test-cluster>
  remediationTemplate: 
    apiVersion: self-node-remediation.medik8s.io/v1alpha1
    name: self-node-remediation-resource-deletion-template
    namespace: openshift-workload-availability
    kind: SelfNodeRemediationTemplate
  escalatingRemediations: 
    - remediationTemplate:
        apiVersion: self-node-remediation.medik8s.io/v1alpha1
        name: self-node-remediation-resource-deletion-template
        namespace: openshift-workload-availability
        kind: SelfNodeRemediationTemplate
    order: 1
    timeout: 300s
  selector: 
    matchExpressions:
      - key: node-role.kubernetes.io/worker
        operator: Exists
  unhealthyConditions: 
    - type: Ready
      status: "False"
      duration: 300s 
    - type: Ready
      status: Unknown
      duration: 300s

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-sample
spec:
  minHealthy: 51%


  pauseRequests:


    - <pause-test-cluster>
  remediationTemplate:


    apiVersion: self-node-remediation.medik8s.io/v1alpha1
    name: self-node-remediation-resource-deletion-template
    namespace: openshift-workload-availability
    kind: SelfNodeRemediationTemplate
  escalatingRemediations:


    - remediationTemplate:
        apiVersion: self-node-remediation.medik8s.io/v1alpha1
        name: self-node-remediation-resource-deletion-template
        namespace: openshift-workload-availability
        kind: SelfNodeRemediationTemplate
    order: 1
    timeout: 300s
  selector:


    matchExpressions:
      - key: node-role.kubernetes.io/worker
        operator: Exists
  unhealthyConditions:


    - type: Ready
      status: "False"
      duration: 300s


    - type: Ready
      status: Unknown
      duration: 300s

Copy to Clipboard

Toggle word wrap

1: Specifies the amount of healthy nodes(in percentage or number) required for a remediation provider to concurrently remediate nodes in the targeted pool. If the number of healthy nodes equals to or exceeds the limit set by minHealthy, remediation occurs. The default value is 51%.
2: Prevents any new remediation from starting, while allowing any ongoing remediations to persist. The default value is empty. However, you can enter an array of strings that identify the cause of pausing the remediation. For example, pause-test-cluster.
Note
During the upgrade process, nodes in the cluster might become temporarily unavailable and get identified as unhealthy. In the case of worker nodes, when the Operator detects that the cluster is upgrading, it stops remediating new unhealthy nodes to prevent such nodes from rebooting.
3: Specifies a remediation template from the remediation provider. For example, from the Self Node Remediation Operator. remediationTemplate is mutually exclusive with escalatingRemediations.
4: Specifies a list of RemediationTemplates with order and timeout fields. To obtain a healthy node, use this field to sequence and configure multiple remediations. This strategy increases the likelihood of obtaining a healthy node, instead of depending on a single remediation that might not be successful. The order field determines the order in which the remediations are invoked (lower order = earlier invocation). The timeout field determines when the next remediation is invoked. escalatingRemediations is mutually exclusive with remediationTemplate.
Note
When escalatingRemediations is used the remediation providers, Self Node Remediation Operator and Fence Agents Remediation Operator, can be used multiple times with different remediationTemplate configurations. However, you can not use the same Machine Deletion Remediation configuration with different remediationTemplate configurations.
5: Specifies a selector that matches labels or expressions that you want to check. Avoid selecting both control-plane and worker nodes in one CR.
6: Specifies a list of the conditions that determine whether a node is considered unhealthy.
7 8: Specifies the timeout duration for a node condition. If a condition is met for the duration of the timeout, the node will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy node.

The NodeHealthCheck CR resembles the following YAML file, with metal3 as the remediation provider:

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nhc-worker-metal3
spec:
  minHealthy: 30%
  remediationTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: Metal3RemediationTemplate
    name: metal3-remediation
    namespace: openshift-machine-api
  selector:
    matchExpressions:
    - key: node-role.kubernetes.io/worker
      operator: Exists
  unhealthyConditions:
  - duration: 300s
    status: 'False'
    type: Ready
  - duration: 300s
    status: 'Unknown'
    type: Ready

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nhc-worker-metal3
spec:
  minHealthy: 30%
  remediationTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: Metal3RemediationTemplate
    name: metal3-remediation
    namespace: openshift-machine-api
  selector:
    matchExpressions:
    - key: node-role.kubernetes.io/worker
      operator: Exists
  unhealthyConditions:
  - duration: 300s
    status: 'False'
    type: Ready
  - duration: 300s
    status: 'Unknown'
    type: Ready

Copy to Clipboard

Toggle word wrap

Note

The matchExpressions are examples only; you must map your machine groups based on your specific needs.

The Metal3RemediationTemplate resembles the following YAML file, with metal3 as the remediation provider:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
  name: metal3-remediation
  namespace: openshift-machine-api
spec:
  template:
    spec:
      strategy:
        retryLimit: 1
        timeout: 5m0s
        type: Reboot

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
  name: metal3-remediation
  namespace: openshift-machine-api
spec:
  template:
    spec:
      strategy:
        retryLimit: 1
        timeout: 5m0s
        type: Reboot

Copy to Clipboard

Toggle word wrap

Note

In addition to creating a NodeHealthCheck CR, you must also create the Metal3RemediationTemplate.

6.1.1. Understanding the Node Health Check Operator workflow
Copiar o link

When a node is identified as unhealthy, the Node Health Check Operator checks how many other nodes are unhealthy. If the number of healthy nodes exceeds the amount that is specified in the minHealthy field of the NodeHealthCheck CR, the controller creates a remediation CR from the details that are provided in the external remediation template by the remediation provider. After remediation, the kubelet updates the node’s health status.

When the node turns healthy, the controller deletes the external remediation template.

6.1.2. About how node health checks prevent conflicts with machine health checks
Copiar o link

When both, node health checks and machine health checks are deployed, the node health check avoids conflict with the machine health check.

Note

Red Hat OpenShift deploys machine-api-termination-handler as the default MachineHealthCheck resource.

The following list summarizes the system behavior when node health checks and machine health checks are deployed:

If only the default machine health check exists, the node health check continues to identify unhealthy nodes. However, the node health check ignores unhealthy nodes in a Terminating state. The default machine health check handles the unhealthy nodes with a Terminating state.
Example log message
```
INFO MHCChecker	ignoring unhealthy Node, it is terminating and will be handled by MHC	{"NodeName": "node-1.example.com"}
```
```
INFO MHCChecker	ignoring unhealthy Node, it is terminating and will be handled by MHC	{"NodeName": "node-1.example.com"}
```
Copy to Clipboard Toggle word wrap

If the default machine health check is modified (for example, the unhealthyConditions is Ready), or if additional machine health checks are created, the node health check is disabled.

Example log message

INFO controllers.NodeHealthCheck disabling NHC in order to avoid conflict with custom MHCs configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}

INFO controllers.NodeHealthCheck disabling NHC in order to avoid conflict with custom MHCs configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}

Copy to Clipboard

Toggle word wrap

When, again, only the default machine health check exists, the node health check is re-enabled.

Example log message

INFO controllers.NodeHealthCheck re-enabling NHC, no conflicting MHC configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}

INFO controllers.NodeHealthCheck re-enabling NHC, no conflicting MHC configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}

Copy to Clipboard

Toggle word wrap

6.2. Control plane fencing
Copiar o link

In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.

Do not use the same NodeHealthCheck CR for worker nodes and control plane nodes. Grouping worker nodes and control plane nodes together can result in incorrect evaluation of the minimum healthy node count, and cause unexpected or missing remediations. This is because of the way the Node Health Check Operator handles control plane nodes. You should group the control plane nodes in their own group and the worker nodes in their own group. If required, you can also create multiple groups of worker nodes.

Considerations for remediation strategies:

Avoid Node Health Check configurations that involve multiple configurations overlapping the same nodes because they can result in unexpected behavior. This suggestion applies to both worker and control plane nodes.
The Node Health Check Operator implements a hardcoded limitation of remediating a maximum of one control plane node at a time. Multiple control plane nodes should not be remediated at the same time.

6.3. Installing the Node Health Check Operator by using the web console
Copiar o link

You can use the Red Hat OpenShift web console to install the Node Health Check Operator.

Prerequisites

Procedure

In the Red Hat OpenShift web console, navigate to Operators OperatorHub.
Select the Node Health Check Operator, then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator will be installed to the openshift-workload-availability namespace.
Ensure that the Console plug-in is set to Enable.
Click Install.

Verification

To confirm that the installation is successful:

Navigate to the Operators Installed Operators page.
Check that the Operator is installed in the openshift-workload-availability namespace and that its status is Succeeded.

If the Operator is not installed successfully:

Navigate to the Operators Installed Operators page and inspect the Status column for any errors or failures.
Navigate to the Workloads Pods page and check the logs in any pods in the openshift-workload-availability project that are reporting issues.

6.4. Installing the Node Health Check Operator by using the CLI
Copiar o link

You can use the OpenShift CLI (oc) to install the Node Health Check Operator.

You can install the Node Health Check Operator in your own namespace or in the openshift-workload-availability namespace.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a Namespace custom resource (CR) for the Node Health Check Operator:
1. Define the Namespace CR and save the YAML file, for example, node-health-check-namespace.yaml:
  apiVersion: v1 kind: Namespace metadata: name: openshift-workload-availability
  Copy to Clipboard Toggle word wrap
2. To create the Namespace CR, run the following command:
  $ oc create -f node-health-check-namespace.yaml
  Copy to Clipboard Toggle word wrap

Create an OperatorGroup CR:

Define the OperatorGroup CR and save the YAML file, for example, workload-availability-operator-group.yaml:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: workload-availability-operator-group
  namespace: openshift-workload-availability

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: workload-availability-operator-group
  namespace: openshift-workload-availability

Copy to Clipboard

Toggle word wrap

To create the OperatorGroup CR, run the following command:

oc create -f workload-availability-operator-group.yaml

$ oc create -f workload-availability-operator-group.yaml

Copy to Clipboard

Toggle word wrap

Create a Subscription CR:
1. Define the Subscription CR and save the YAML file, for example, node-health-check-subscription.yaml:
  apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: node-health-check-operator namespace: openshift-workload-availability
  1
  spec: channel: stable
  2
  installPlanApproval: Manual
  3
  name: node-healthcheck-operator source: redhat-operators sourceNamespace: openshift-marketplace package: node-healthcheck-operator
  Copy to Clipboard Toggle word wrap
  1
  Specify the Namespace where you want to install the Node Health Check Operator. To install the Node Health Check Operator in the openshift-workload-availability namespace, specify openshift-workload-availability in the Subscription CR.
  2
  Specify the channel name for your subscription. To upgrade to the latest version of the Node Health Check Operator, you must manually change the channel name for your subscription from candidate to stable.
  3
  Set the approval strategy to Manual in case your specified version is superseded by a later version in the catalog. This plan prevents an automatic upgrade to a later version and requires manual approval before the starting CSV can complete the installation.
2. To create the Subscription CR, run the following command:
  $ oc create -f node-health-check-subscription.yaml
  Copy to Clipboard Toggle word wrap

Verification

Verify that the installation succeeded by inspecting the CSV resource:

oc get csv -n openshift-workload-availability

$ oc get csv -n openshift-workload-availability

Copy to Clipboard

Toggle word wrap

Example output

NAME                              DISPLAY                     VERSION  REPLACES PHASE
node-healthcheck-operator.v0.7.0  Node Health Check Operator  0.7.0   node-healthcheck-operator.v0.6.1           Succeeded

NAME                              DISPLAY                     VERSION  REPLACES PHASE
node-healthcheck-operator.v0.7.0  Node Health Check Operator  0.7.0   node-healthcheck-operator.v0.6.1           Succeeded

Copy to Clipboard

Toggle word wrap

Verify that the Node Health Check Operator is up and running:

oc get deployment -n openshift-workload-availability

$ oc get deployment -n openshift-workload-availability

Copy to Clipboard

Toggle word wrap

Example output

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
node-healthcheck-controller-manager            2/2     2            2           10d

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
node-healthcheck-controller-manager            2/2     2            2           10d

Copy to Clipboard

Toggle word wrap

6.5. Creating a node health check
Copiar o link

Using the web console, you can create a node health check to identify unhealthy nodes and specify the remediation type and strategy to fix them.

Procedure

From the Administrator perspective of the Red Hat OpenShift web console, click Compute NodeHealthChecks CreateNodeHealthCheck.
Specify whether to configure the node health check using the Form view or the YAML view.
Enter a Name for the node health check. The name must consist of lower case, alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.
Specify the Remediator type, as Self node remediation or Other. You must install the Self Node Remediation Operator manually. Selecting Other requires an API version, Kind, Name, and Namespace to be entered, which then points to the remediation template resource of a remediator.
Make a Nodes selection by specifying the labels of the nodes you want to remediate. The selection matches labels that you want to check. If more than one label is specified, the nodes must contain each label. The default value is empty, which selects both worker and control-plane nodes.
Note
When creating a node health check with the Self Node Remediation Operator, you must select either node-role.kubernetes.io/worker or node-role.kubernetes.io/control-plane as the value.
Specify the minimum number of healthy nodes, using either a percentage or a number, required for a NodeHealthCheck to remediate nodes in the targeted pool. If the number of healthy nodes equals to or exceeds the limit set by Min healthy, remediation occurs. The default value is 51%.
Specify a list of Unhealthy conditions that if a node meets determines whether the node is considered unhealthy, and requires remediation. You can specify the Type, Status and Duration. You can also create your own custom type.
Click Create to create the node health check.

Verification

Navigate to the Compute NodeHealthCheck page and verify that the corresponding node health check is listed, and their status displayed. Once created, node health checks can be paused, modified, and deleted.

6.6. Gathering data about the Node Health Check Operator
Copiar o link

To collect debugging information about the Node Health Check Operator, use the must-gather tool. For information about the must-gather image for the Node Health Check Operator, see Gathering data about specific features.

6.7. Additional resources
Copiar o link

Voltar ao topo

Este conteúdo não está disponível no idioma selecionado.

Chapter 6. Remediating nodes with Node Health Checks

6.1. About the Node Health Check Operator
Copiar o link

6.1.1. Understanding the Node Health Check Operator workflow
Copiar o link

6.1.2. About how node health checks prevent conflicts with machine health checks
Copiar o link

6.2. Control plane fencing
Copiar o link

6.3. Installing the Node Health Check Operator by using the web console
Copiar o link

6.4. Installing the Node Health Check Operator by using the CLI
Copiar o link

6.5. Creating a node health check
Copiar o link

6.6. Gathering data about the Node Health Check Operator
Copiar o link

6.7. Additional resources
Copiar o link

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Tornando o open source mais inclusivo

Sobre a Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este conteúdo não está disponível no idioma selecionado.

Chapter 6. Remediating nodes with Node Health Checks

6.1. About the Node Health Check OperatorCopiar o linkLink copiado para a área de transferência!

6.1.1. Understanding the Node Health Check Operator workflowCopiar o linkLink copiado para a área de transferência!

6.1.2. About how node health checks prevent conflicts with machine health checksCopiar o linkLink copiado para a área de transferência!

6.2. Control plane fencingCopiar o linkLink copiado para a área de transferência!

6.3. Installing the Node Health Check Operator by using the web consoleCopiar o linkLink copiado para a área de transferência!

6.4. Installing the Node Health Check Operator by using the CLICopiar o linkLink copiado para a área de transferência!

6.5. Creating a node health checkCopiar o linkLink copiado para a área de transferência!

6.6. Gathering data about the Node Health Check OperatorCopiar o linkLink copiado para a área de transferência!

6.7. Additional resourcesCopiar o linkLink copiado para a área de transferência!

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Tornando o open source mais inclusivo

Sobre a Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. About the Node Health Check Operator
Copiar o link

6.1.1. Understanding the Node Health Check Operator workflow
Copiar o link

6.1.2. About how node health checks prevent conflicts with machine health checks
Copiar o link

6.2. Control plane fencing
Copiar o link

6.3. Installing the Node Health Check Operator by using the web console
Copiar o link

6.4. Installing the Node Health Check Operator by using the CLI
Copiar o link

6.5. Creating a node health check
Copiar o link

6.6. Gathering data about the Node Health Check Operator
Copiar o link

6.7. Additional resources
Copiar o link