Questo contenuto non è disponibile nella lingua selezionata.
Chapter 14. Deploying machine health checks
You can configure and deploy a machine health check to automatically repair damaged machines in a machine pool.
You can use the advanced machine management and scaling capabilities only in clusters where the Machine API is operational. Clusters with user-provisioned infrastructure require additional validation and configuration to use the Machine API.
Clusters with the infrastructure platform type
none
To view the platform type for your cluster, run the following command:
$ oc get infrastructure cluster -o jsonpath='{.status.platform}'
14.1. About machine health checks Copia collegamentoCollegamento copiato negli appunti!
You can only apply a machine health check to machines that are managed by compute machine sets or control plane machine sets.
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the
NotReady
The controller that observes a
MachineHealthCheck
machine deleted
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the
maxUnhealthy
Consider the timeouts carefully, accounting for workloads and requirements.
- Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.
-
Too short timeouts can result in a remediation loop. For example, the timeout for checking the status must be long enough to allow the machine to complete the startup process.
NotReady
To stop the check, remove the resource.
14.1.1. Limitations when deploying machine health checks Copia collegamentoCollegamento copiato negli appunti!
There are limitations to consider before deploying a machine health check:
- Only machines owned by a machine set are remediated by a machine health check.
- If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
-
If the corresponding node for a machine does not join the cluster after the , the machine is remediated.
nodeStartupTimeout -
A machine is remediated immediately if the resource phase is
Machine.Failed
14.2. Sample MachineHealthCheck resource Copia collegamentoCollegamento copiato negli appunti!
The
MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role>
machine.openshift.io/cluster-api-machine-type: <role>
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
- 1
- Specify the name of the machine health check to deploy.
- 2 3
- Specify a label for the machine pool that you want to check.
- 4
- Specify the machine set to track in
<cluster_name>-<label>-<zone>format. For example,prod-node-us-east-1a. - 5 6
- Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
- 7
- Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by
maxUnhealthy, remediation is not performed. - 8
- Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
The
matchLabels
14.2.1. Short-circuiting machine health check remediation Copia collegamentoCollegamento copiato negli appunti!
Short-circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the
maxUnhealthy
MachineHealthCheck
If the user defines a value for the
maxUnhealthy
MachineHealthCheck
maxUnhealthy
maxUnhealthy
If
maxUnhealthy
100%
The appropriate
maxUnhealthy
MachineHealthCheck
maxUnhealthy
maxUnhealthy
If you configure a
MachineHealthCheck
maxUnhealthy
1
This configuration ensures that the machine health check takes no action when multiple control plane machines appear to be unhealthy. Multiple unhealthy control plane machines can indicate that the etcd cluster is degraded or that a scaling operation to replace a failed machine is in progress.
If the etcd cluster is degraded, manual intervention might be required. If a scaling operation is in progress, the machine health check should allow it to finish.
The
maxUnhealthy
maxUnhealthy
14.2.1.1. Setting maxUnhealthy by using an absolute value Copia collegamentoCollegamento copiato negli appunti!
If
maxUnhealthy
2
- Remediation will be performed if 2 or fewer nodes are unhealthy
- Remediation will not be performed if 3 or more nodes are unhealthy
These values are independent of how many machines are being checked by the machine health check.
14.2.1.2. Setting maxUnhealthy by using percentages Copia collegamentoCollegamento copiato negli appunti!
If
maxUnhealthy
40%
- Remediation will be performed if 10 or fewer nodes are unhealthy
- Remediation will not be performed if 11 or more nodes are unhealthy
If
maxUnhealthy
40%
- Remediation will be performed if 2 or fewer nodes are unhealthy
- Remediation will not be performed if 3 or more nodes are unhealthy
The allowed number of machines is rounded down when the percentage of
maxUnhealthy
14.3. Creating a machine health check resource Copia collegamentoCollegamento copiato negli appunti!
You can create a
MachineHealthCheck
You can only apply a machine health check to machines that are managed by compute machine sets or control plane machine sets.
Prerequisites
-
Install the command-line interface.
oc
Procedure
-
Create a file that contains the definition of your machine health check.
healthcheck.yml Apply the
file to your cluster:healthcheck.yml$ oc apply -f healthcheck.yml
You can configure and deploy a machine health check to detect and repair unhealthy bare metal nodes.
14.4. About power-based remediation of bare metal Copia collegamentoCollegamento copiato negli appunti!
In a bare metal cluster, remediation of nodes is critical to ensuring the overall health of the cluster. Physically remediating a cluster can be challenging and any delay in putting the machine into a safe or an operational state increases the time the cluster remains in a degraded state, and the risk that subsequent failures might bring the cluster offline. Power-based remediation helps counter such challenges.
Instead of reprovisioning the nodes, power-based remediation uses a power controller to power off an inoperable node. This type of remediation is also called power fencing.
OpenShift Container Platform uses the
MachineHealthCheck
Power-based remediation provides the following capabilities:
- Allows the recovery of control plane nodes
- Reduces the risk data loss in hyperconverged environments
- Reduces the downtime associated with recovering physical machines
14.4.1. MachineHealthChecks on bare metal Copia collegamentoCollegamento copiato negli appunti!
Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host. Usually bare metal reprovisioning is a lengthy process, during which the cluster is missing compute resources and applications might be interrupted. To change the default remediation process from machine deletion to host power-cycle, annotate the
MachineHealthCheck
machine.openshift.io/remediation-strategy: external-baremetal
After you set the annotation, unhealthy machines are power-cycled by using BMC credentials.
14.4.2. Understanding the remediation process Copia collegamentoCollegamento copiato negli appunti!
The remediation process operates as follows:
- The MachineHealthCheck (MHC) controller detects that a node is unhealthy.
- The MHC notifies the bare metal machine controller which requests to power-off the unhealthy node.
- After the power is off, the node is deleted, which allows the cluster to reschedule the affected workload on other nodes.
- The bare metal machine controller requests to power on the node.
- After the node is up, the node re-registers itself with the cluster, resulting in the creation of a new node.
- After the node is recreated, the bare metal machine controller restores the annotations and labels that existed on the unhealthy node before its deletion.
If the power operations did not complete, the bare metal machine controller triggers the reprovisioning of the unhealthy node unless this is a control plane node or a node that was provisioned externally.
14.4.3. Creating a MachineHealthCheck resource for bare metal Copia collegamentoCollegamento copiato negli appunti!
Prerequisites
- The OpenShift Container Platform is installed using installer-provisioned infrastructure (IPI).
- Access to Baseboard Management Controller (BMC) credentials (or BMC access to each node)
- Network access to the BMC interface of the unhealthy node.
Procedure
-
Create a file that contains the definition of your machine health check.
healthcheck.yaml Apply the
file to your cluster using the following command:healthcheck.yaml$ oc apply -f healthcheck.yaml
Sample MachineHealthCheck resource for bare metal
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example
namespace: openshift-machine-api
annotations:
machine.openshift.io/remediation-strategy: external-baremetal
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role>
machine.openshift.io/cluster-api-machine-type: <role>
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
- 1
- Specify the name of the machine health check to deploy.
- 2
- For bare metal clusters, you must include the
machine.openshift.io/remediation-strategy: external-baremetalannotation in theannotationssection to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster. - 3 4
- Specify a label for the machine pool that you want to check.
- 5
- Specify the compute machine set to track in
<cluster_name>-<label>-<zone>format. For example,prod-node-us-east-1a. - 6 7
- Specify the timeout duration for the node condition. If the condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
- 8
- Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by
maxUnhealthy, remediation is not performed. - 9
- Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
The
matchLabels
14.4.4. Troubleshooting issues with power-based remediation Copia collegamentoCollegamento copiato negli appunti!
To troubleshoot an issue with power-based remediation, verify the following:
- You have access to the BMC.
- BMC is connected to the control plane node that is responsible for running the remediation task.