Chapter 4. Using Machine Deletion Remediation
You can use the Machine Deletion Remediation Operator to reprovision unhealthy nodes using the Machine API. You can use the Machine Deletion Remediation Operator in conjunction with the Node Health Check Operator.
4.1. About the Machine Deletion Remediation Operator
The Machine Deletion Remediation (MDR) operator works with the NodeHealthCheck
controller, to reprovision unhealthy nodes using the Machine API. MDR follows the annotation on the node to the associated machine object, confirms that it has an owning controller (for example, MachineSetController
), and deletes it. Once the machine CR is deleted, the owning controller creates a replacement.
The prerequisites for MDR include:
- a Machine API-based cluster that is able to programmatically destroy and create cluster nodes,
- nodes that are associated with machines, and
- declaratively managed machines.
You can then modify the NodeHealthCheck
CR to use MDR as its remediator. An example MDR template object and NodeHealthCheck
configuration are provided in the documentation.
The MDR process works as follows:
- the Node Health Check Operator detects an unhealthy node and creates a MDR CR.
- the MDR Operator watches for the MDR CR associated with the unhealthy node and deletes it, if the machine has an owning controller.
-
when the node is healthy again, the MDR CR is deleted by the
NodeHealthCheck
controller.
4.2. Installing the Machine Deletion Remediation Operator by using the web console
You can use the Red Hat OpenShift web console to install the Machine Deletion Remediation Operator.
Prerequisites
-
Log in as a user with
cluster-admin
privileges.
Procedure
-
In the Red Hat OpenShift web console, navigate to Operators
OperatorHub. - Select the Machine Deletion Remediation Operator, or MDR, from the list of available Operators, and then click Install.
-
Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the
openshift-workload-availability
namespace. - Click Install.
Verification
To confirm that the installation is successful:
-
Navigate to the Operators
Installed Operators page. -
Check that the Operator is installed in the
openshift-workload-availability
namespace and its status isSucceeded
.
If the Operator is not installed successfully:
-
Navigate to the Operators
Installed Operators page and inspect the Status column for any errors or failures. -
Navigate to the Workloads
Pods page and check the log of the pod in the openshift-workload-availability
project for any reported issues.
4.3. Installing the Machine Deletion Remediation Operator by using the CLI
You can use the OpenShift CLI (oc
) to install the Machine Deletion Remediation Operator.
You can install the Machine Deletion Remediation Operator in your own namespace or in the openshift-workload-availability
namespace.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Create a
Namespace
custom resource (CR) for the Machine Deletion Remediation Operator:Define the
Namespace
CR and save the YAML file, for example,workload-availability-namespace.yaml
:apiVersion: v1 kind: Namespace metadata: name: openshift-workload-availability
To create the
Namespace
CR, run the following command:$ oc create -f workload-availability-namespace.yaml
Create an
OperatorGroup
CR:Define the
OperatorGroup
CR and save the YAML file, for example,workload-availability-operator-group.yaml
:apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: workload-availability-operator-group namespace: openshift-workload-availability
To create the
OperatorGroup
CR, run the following command:$ oc create -f workload-availability-operator-group.yaml
Create a
Subscription
CR:Define the
Subscription
CR and save the YAML file, for example,machine-deletion-remediation-subscription.yaml
:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: machine-deletion-remediation-operator namespace: openshift-workload-availability 1 spec: channel: stable name: machine-deletion-remediation-operator source: redhat-operators sourceNamespace: openshift-marketplace package: machine-deletion-remediation
- 1
- Specify the
Namespace
where you want to install the Machine Deletion Remediation Operator. When installing the Machine Deletion Remediation Operator in theopenshift-workload-availability
Subscription
CR, theNamespace
andOperatorGroup
CRs will already exist.
To create the
Subscription
CR, run the following command:$ oc create -f machine-deletion-remediation-subscription.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n openshift-workload-availability
Example output
NAME DISPLAY VERSION REPLACES PHASE machine-deletion-remediation.v0.3.0 Machine Deletion Remediation Operator 0.3.0 machine-deletion-remediation.v0.2.1 Succeeded
4.4. Configuring the Machine Deletion Remediation Operator
You can use the Machine Deletion Remediation Operator, with the Node Health Check Operator, to create the MachineDeletionRemediationTemplate
Custom Resource (CR). This CR defines the remediation strategy for the nodes.
The MachineDeletionRemediationTemplate
CR resembles the following YAML file:
apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1 kind: MachineDeletionRemediationTemplate metadata: name: machinedeletionremediationtemplate-sample namespace: openshift-workload-availability spec: template: spec: {}
4.5. Troubleshooting the Machine Deletion Remediation Operator
4.5.1. General troubleshooting
- Issue
- You want to troubleshoot issues with the Machine Deletion Remediation Operator.
- Resolution
Check the Operator logs.
$ oc logs <machine-deletion-remediation-controller-manager-name> -c manager -n <namespace-name>
4.5.2. Unsuccessful remediation
- Issue
- An unhealthy node was not remediated.
- Resolution
Verify that the
MachineDeletionRemediation
CR was created by running the following command:$ oc get mdr -A
If the
NodeHealthCheck
controller did not create theMachineDeletionRemediation
CR when the node turned unhealthy, check the logs of theNodeHealthCheck
controller. Additionally, ensure that theNodeHealthCheck
CR includes the required specification to use the remediation template.If the
MachineDeletionRemediation
CR was created, ensure that its name matches the unhealthy node object.
4.5.3. Machine Deletion Remediation Operator resources exist even after uninstalling the Operator
- Issue
- The Machine Deletion Remediation Operator resources, such as the remediation CR and the remediation template CR, exist even after uninstalling the Operator.
- Resolution
To remove the Machine Deletion Remediation Operator resources, you can delete the resources by selecting the Delete all operand instances for this operator checkbox before uninstalling. This checkbox feature is only available in Red Hat OpenShift since version 4.13. For all versions of Red Hat OpenShift, you can delete the resources by running the following relevant command for each resource type:
$ oc delete mdr <machine-deletion-remediation> -n <namespace>
$ oc delete mdrt <machine-deletion-remediation-template> -n <namespace>
The remediation CR
mdr
must be created and deleted by the same entity, for example, NHC. If the remediation CRmdr
is still present, it is deleted, together with the MDR operator.The remediation template CR
mdrt
only exists if you use MDR with NHC. When the MDR operator is deleted using the web console, the remediation template CRmdrt
is also deleted.
4.6. Gathering data about the Machine Deletion Remediation Operator
To collect debugging information about the Machine Deletion Remediation Operator, use the must-gather
tool. For information about the must-gather
image for the Machine Deletion Remediation Operator, see Gathering data about specific features.