Este conteúdo não está disponível no idioma selecionado.

Chapter 5. Remediating nodes with Machine Health Checks


Machine health checks automatically repair unhealthy machines in a particular machine pool.

5.1. About machine health checks

Note

You can only apply a machine health check to control plane machines on clusters that use control plane machine sets.

To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the NotReady status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.

The controller that observes a MachineHealthCheck resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a machine deleted event.

To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.

Note

Consider the timeouts carefully, accounting for workloads and requirements.

  • Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.
  • Too short timeouts can result in a remediation loop. For example, the timeout for checking the NotReady status must be long enough to allow the machine to complete the startup process.

To stop the check, remove the resource.

5.1.1. Limitations when deploying machine health checks

There are limitations to consider before deploying a machine health check:

  • Only machines owned by a machine set are remediated by a machine health check.
  • If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
  • If the corresponding node for a machine does not join the cluster after the nodeStartupTimeout, the machine is remediated.
  • A machine is remediated immediately if the Machine resource phase is Failed.

5.2. Configuring machine health checks to use the Self Node Remediation Operator

Use the following procedure to configure the worker or control-plane machine health checks to use the Self Node Remediation Operator as a remediation provider.

Note

To use the Self Node Remediation Operator as a remediation provider for machine health checks, a machine must have an associated node in the cluster.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a SelfNodeRemediationTemplate CR:

    1. Define the SelfNodeRemediationTemplate CR:

      apiVersion: self-node-remediation.medik8s.io/v1alpha1
      kind: SelfNodeRemediationTemplate
      metadata:
        namespace: openshift-machine-api
        name: selfnoderemediationtemplate-sample
      spec:
        template:
          spec:
            remediationStrategy: Automatic 
      1
      Copy to Clipboard Toggle word wrap
      1
      Specifies the remediation strategy. The default remediation strategy is Automatic.
    2. To create the SelfNodeRemediationTemplate CR, run the following command:

      $ oc create -f <snrt-name>.yaml
      Copy to Clipboard Toggle word wrap
  2. Create or update the MachineHealthCheck CR to point to the SelfNodeRemediationTemplate CR:

    1. Define or update the MachineHealthCheck CR:

      apiVersion: machine.openshift.io/v1beta1
      kind: MachineHealthCheck
      metadata:
        name: machine-health-check
        namespace: openshift-machine-api
      spec:
        selector:
          matchLabels: 
      1
      
            machine.openshift.io/cluster-api-machine-role: "worker"
            machine.openshift.io/cluster-api-machine-type: "worker"
        unhealthyConditions:
        - type:    "Ready"
          timeout: "300s"
          status: "False"
        - type:    "Ready"
          timeout: "300s"
          status: "Unknown"
        maxUnhealthy: "40%"
        nodeStartupTimeout: "10m"
        remediationTemplate: 
      2
      
          kind: SelfNodeRemediationTemplate
          apiVersion: self-node-remediation.medik8s.io/v1alpha1
          name: selfnoderemediationtemplate-sample
      Copy to Clipboard Toggle word wrap
      1
      Selects whether the machine health check is for worker or control-plane nodes. The label can also be user-defined.
      2
      Specifies the details for the remediation template.
    2. To create a MachineHealthCheck CR, run the following command:

      $ oc create -f <mhc-name>.yaml
      Copy to Clipboard Toggle word wrap
    3. To update a MachineHealthCheck CR, run the following command:

      $ oc apply -f <mhc-name>.yaml
      Copy to Clipboard Toggle word wrap
Voltar ao topo
Red Hat logoGithubredditYoutubeTwitter

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Ajudamos os usuários da Red Hat a inovar e atingir seus objetivos com nossos produtos e serviços com conteúdo em que podem confiar. Explore nossas atualizações recentes.

Tornando o open source mais inclusivo

A Red Hat está comprometida em substituir a linguagem problemática em nosso código, documentação e propriedades da web. Para mais detalhes veja o Blog da Red Hat.

Sobre a Red Hat

Fornecemos soluções robustas que facilitam o trabalho das empresas em plataformas e ambientes, desde o data center principal até a borda da rede.

Theme

© 2025 Red Hat