2.3. Deploying machine health checks

Understand and deploy machine health checks.

重要

This process is not applicable to clusters where you manually provisioned the machines yourself. You can use the advanced machine management and scaling capabilities only in clusters where the machine API is operational.

2.3.1. About machine health checks
复制链接

You can define conditions under which machines in a cluster are considered unhealthy by using a MachineHealthCheck resource. Machines matching the conditions are automatically remediated.

To monitor machine health, create a MachineHealthCheck custom resource (CR) that includes a label for the set of machines to monitor and a condition to check, such as staying in the NotReady status for 15 minutes or displaying a permanent condition in the node-problem-detector.

The controller that observes a MachineHealthCheck CR checks for the condition that you defined. If a machine fails the health check, the machine is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a machine deleted event.

注意

For machines with the master role, the machine health check reports the number of unhealthy nodes, but the machine is not deleted. For example:

Example output

oc get machinehealthcheck example -n openshift-machine-api

$ oc get machinehealthcheck example -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

NAME      MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
example   40%            3                  1

NAME      MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
example   40%            3                  1

Copy to Clipboard

Toggle word wrap

To limit the disruptive impact of machine deletions, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy threshold allows for in the targeted pool of machines, the controller stops deleting machines and you must manually intervene.

To stop the check, remove the custom resource.

2.3.1.1. MachineHealthChecks on Bare Metal
复制链接

Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host. Usually bare metal reprovisioning is a lengthy process, during which the cluster is missing compute resources and applications might be interrupted. To change the default remediation process from machine deletion to host power-cycle, annotate the MachineHealthCheck resource with the machine.openshift.io/remediation-strategy: external-baremetal annotation.

After you set the annotation, unhealthy machines are power-cycled by using BMC credentials.

2.3.1.2. Limitations when deploying machine health checks
复制链接

There are limitations to consider before deploying a machine health check:

Only machines owned by a machine set are remediated by a machine health check.
Control plane machines are not currently supported and are not remediated if they are unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the nodeStartupTimeout, the machine is remediated.
A machine is remediated immediately if the Machine resource phase is Failed.

2.3.2. Sample MachineHealthCheck resource
复制链接

The MachineHealthCheck resource resembles one of the following YAML files:

MachineHealthCheck for bare metal

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example 
  namespace: openshift-machine-api
  annotations:
    machine.openshift.io/remediation-strategy: external-baremetal 
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role> 
      machine.openshift.io/cluster-api-machine-type: <role> 
      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> 
  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s" 
    status: "False"
  - type:    "Ready"
    timeout: "300s" 
    status: "Unknown"
  maxUnhealthy: "40%" 
  nodeStartupTimeout: "10m"

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example


  namespace: openshift-machine-api
  annotations:
    machine.openshift.io/remediation-strategy: external-baremetal


spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role>


      machine.openshift.io/cluster-api-machine-type: <role>


      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>


  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s"


    status: "False"
  - type:    "Ready"
    timeout: "300s"


    status: "Unknown"
  maxUnhealthy: "40%"


  nodeStartupTimeout: "10m"

Copy to Clipboard

Toggle word wrap

1: Specify the name of the machine health check to deploy.
2: For bare metal clusters, you must include the machine.openshift.io/remediation-strategy: external-baremetal annotation in the annotations section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster.
3 4: Specify a label for the machine pool that you want to check.
5: Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a.
6 7: Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
8: Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
9: Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

注意

The matchLabels are examples only; you must map your machine groups based on your specific needs.

MachineHealthCheck for all other installation types

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example 
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role> 
      machine.openshift.io/cluster-api-machine-type: <role> 
      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> 
  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s" 
    status: "False"
  - type:    "Ready"
    timeout: "300s" 
    status: "Unknown"
  maxUnhealthy: "40%" 
  nodeStartupTimeout: "10m"

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example


  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role>


      machine.openshift.io/cluster-api-machine-type: <role>


      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>


  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s"


    status: "False"
  - type:    "Ready"
    timeout: "300s"


    status: "Unknown"
  maxUnhealthy: "40%"


  nodeStartupTimeout: "10m"

Copy to Clipboard

Toggle word wrap

1: Specify the name of the machine health check to deploy.
2 3: Specify a label for the machine pool that you want to check.
4: Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a.
5 6: Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
7: Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
8: Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

注意

The matchLabels are examples only; you must map your machine groups based on your specific needs.

2.3.2.1. Short-circuiting machine health check remediation
复制链接

Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the maxUnhealthy field in the MachineHealthCheck resource.

If the user defines a value for the maxUnhealthy field, before remediating any machines, the MachineHealthCheck compares the value of maxUnhealthy with the number of machines within its target pool that it has determined to be unhealthy. Remediation is not performed if the number of unhealthy machines exceeds the maxUnhealthy limit.

重要

If maxUnhealthy is not set, the value defaults to 100% and the machines are remediated regardless of the state of the cluster.

The maxUnhealthy field can be set as either an integer or percentage. There are different remediation implementations depending on the maxUnhealthy value.

2.3.2.1.1. Setting maxUnhealthy by using an absolute value
复制链接

If maxUnhealthy is set to 2:

Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy

These values are independent of how many machines are being checked by the machine health check.

2.3.2.1.2. Setting maxUnhealthy by using percentages
复制链接

If maxUnhealthy is set to 40% and there are 25 machines being checked:

Remediation will be performed if 10 or fewer nodes are unhealthy
Remediation will not be performed if 11 or more nodes are unhealthy

If maxUnhealthy is set to 40% and there are 6 machines being checked:

Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy

注意

The allowed number of machines is rounded down when the percentage of maxUnhealthy machines that are checked is not a whole number.

2.3.3. Creating a MachineHealthCheck resource
复制链接

You can create a MachineHealthCheck resource for all MachineSets in your cluster. You should not create a MachineHealthCheck resource that targets control plane machines.

Prerequisites

Install the oc command line interface.

Procedure

Create a healthcheck.yml file that contains the definition of your machine health check.
Apply the healthcheck.yml file to your cluster:
```
oc apply -f healthcheck.yml
```
```
$ oc apply -f healthcheck.yml
```
Copy to Clipboard Toggle word wrap

2.3.4. Scaling a machine set manually
复制链接

If you must add or remove an instance of a machine in a machine set, you can manually scale the machine set.

This guidance is relevant to fully automated, installer-provisioned infrastructure installations. Customized, user-provisioned infrastructure installations does not have machine sets.

Prerequisites

Install an OpenShift Container Platform cluster and the oc command line.
Log in to oc as a user with cluster-admin permission.

Procedure

View the machine sets that are in the cluster:
```
oc get machinesets -n openshift-machine-api
```
```
$ oc get machinesets -n openshift-machine-api
```
Copy to Clipboard Toggle word wrap
The machine sets are listed in the form of <clusterid>-worker-<aws-region-az>.

Scale the machine set:

oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Or:

oc edit machineset <machineset> -n openshift-machine-api

$ oc edit machineset <machineset> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

You can scale the machine set up or down. It takes several minutes for the new machines to be available.

2.3.5. Understanding the difference between machine sets and the machine config pool
复制链接

MachineSet objects describe OpenShift Container Platform nodes with respect to the cloud or machine provider.

The MachineConfigPool object allows MachineConfigController components to define and provide the status of machines in the context of upgrades.

The MachineConfigPool object allows users to configure how upgrades are rolled out to the OpenShift Container Platform nodes in the machine config pool.

The NodeSelector object can be replaced with a reference to the MachineSet object.

2.3. Deploying machine health checks

2.3.1. About machine health checks
复制链接

2.3.1.1. MachineHealthChecks on Bare Metal
复制链接

2.3.1.2. Limitations when deploying machine health checks
复制链接

2.3.2. Sample MachineHealthCheck resource
复制链接

2.3.2.1. Short-circuiting machine health check remediation
复制链接

2.3.2.1.1. Setting maxUnhealthy by using an absolute value
复制链接

2.3.2.1.2. Setting maxUnhealthy by using percentages
复制链接

2.3.3. Creating a MachineHealthCheck resource
复制链接

2.3.4. Scaling a machine set manually
复制链接

2.3.5. Understanding the difference between machine sets and the machine config pool
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.3. Deploying machine health checks

2.3.1. About machine health checks复制链接链接已复制到粘贴板!

2.3.1.1. MachineHealthChecks on Bare Metal复制链接链接已复制到粘贴板!

2.3.1.2. Limitations when deploying machine health checks复制链接链接已复制到粘贴板!

2.3.2. Sample MachineHealthCheck resource复制链接链接已复制到粘贴板!

2.3.2.1. Short-circuiting machine health check remediation复制链接链接已复制到粘贴板!

2.3.2.1.1. Setting maxUnhealthy by using an absolute value复制链接链接已复制到粘贴板!

2.3.2.1.2. Setting maxUnhealthy by using percentages复制链接链接已复制到粘贴板!

2.3.3. Creating a MachineHealthCheck resource复制链接链接已复制到粘贴板!

2.3.4. Scaling a machine set manually复制链接链接已复制到粘贴板!

2.3.5. Understanding the difference between machine sets and the machine config pool复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.3.1. About machine health checks
复制链接

2.3.1.1. MachineHealthChecks on Bare Metal
复制链接

2.3.1.2. Limitations when deploying machine health checks
复制链接

2.3.2. Sample MachineHealthCheck resource
复制链接

2.3.2.1. Short-circuiting machine health check remediation
复制链接

2.3.2.1.1. Setting maxUnhealthy by using an absolute value
复制链接

2.3.2.1.2. Setting maxUnhealthy by using percentages
复制链接

2.3.3. Creating a MachineHealthCheck resource
复制链接

2.3.4. Scaling a machine set manually
复制链接

2.3.5. Understanding the difference between machine sets and the machine config pool
复制链接