Chapter 3. Using Fence Agents Remediation
You can use the Fence Agents Remediation Operator to automatically remediate unhealthy nodes, similar to the Self Node Remediation Operator. Using a management interface or traditional API, this Operator runs a fence-agent to remediate a node from an unhealthy state by power-cycling the node.
3.1. About the Fence Agents Remediation Operator Copy linkLink copied to clipboard!
The Fence Agents Remediation (FAR) Operator uses external tools to fence unhealthy nodes. These tools are a set of fence agents, where each fence agent can be used for different environments to fence a node, and using a traditional Application Programming Interface (API) call that reboots a node. By doing so, FAR can minimize downtime for stateful applications, restores compute capacity if transient failures occur, and increases the availability of workloads.
FAR not only fences a node when it becomes unhealthy, it also tries to remediate the node from being unhealthy to healthy. It adds a taint to evict stateless pods, fences the node with a fence agent, and after a reboot, it completes the remediation with resource deletion to remove any remaining workloads (mostly stateful workloads). Adding the taint and deleting the workloads accelerates the workload rescheduling.
The Operator watches for new or deleted custom resources (CRs) called FenceAgentsRemediation which trigger a fence agent to remediate a node, based on the CR’s name. FAR uses the NodeHealthCheck controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the NodeHealthCheck resource creates the FenceAgentsRemediation CR, based on the FenceAgentsRemediationTemplate CR, which then triggers the Fence Agents Remediation Operator.
FAR uses a fence agent to fence a Kubernetes node. Generally, fencing is the process of taking unresponsive/unhealthy computers into a safe state, and isolating the computer. Fence agent is a software code that uses a management interface to perform fencing, mostly power-based fencing which enables power-cycling, reset, or turning off the computer. An example fence agent is fence_ipmilan which is used for Intelligent Platform Management Interface (IPMI) environments.
apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
name: node-name
namespace: openshift-operators
spec:
- 1
- The node-name should match the name of the unhealthy cluster node.
The Operator includes a set of fence agents, that are also available in the Red Hat High Availability Add-On, which use a management interface, such as IPMI or an API, to provision/reboot a node for bare metal servers, virtual machines, and cloud platforms.
3.2. Installing the Fence Agents Remediation Operator by using the web console Copy linkLink copied to clipboard!
You can use the Red Hat OpenShift web console to install the Fence Agents Remediation Operator.
Prerequisites
-
Log in as a user with
cluster-adminprivileges.
Procedure
-
In the Red Hat OpenShift web console, navigate to Operators
OperatorHub. - Select the Fence Agents Remediation Operator, or FAR, from the list of available Operators, and then click Install.
-
Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the
openshift-operatorsnamespace. - Click Install.
Verification
To confirm that the installation is successful:
-
Navigate to the Operators
Installed Operators page. -
Check that the Operator is installed in the
openshift-operatorsnamespace and its status isSucceeded.
If the Operator is not installed successfully:
-
Navigate to the Operators
Installed Operators page and inspect the Status column for any errors or failures. -
Navigate to the Workloads
Pods page and check the log of the fence-agents-remediation-controller-managerpod for any reported issues.
3.3. Installing the Fence Agents Remediation Operator by using the CLI Copy linkLink copied to clipboard!
You can use the OpenShift CLI (oc) to install the Fence Agents Remediation Operator.
You can install the Fence Agents Remediation Operator in your own namespace or in the openshift-operators namespace.
To install the Operator in your own namespace, follow the steps in the procedure.
To install the Operator in the openshift-operators namespace, skip to step 3 of the procedure because the steps to create a new Namespace custom resource (CR) and an OperatorGroup CR are not required.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Create a
Namespacecustom resource (CR) for the Fence Agents Remediation Operator:Define the
NamespaceCR and save the YAML file, for example,fence-agents-remediation-namespace.yaml:apiVersion: v1 kind: Namespace metadata: name: fence-agents-remediation-namespaceTo create the
NamespaceCR, run the following command:$ oc create -f fence-agents-remediation-namespace.yaml
Create an
OperatorGroupCR:Define the
OperatorGroupCR and save the YAML file, for example,fence-agents-remediation-operator-group.yaml:apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: fence-agents-remediation-operator-group namespace: fence-agents-remediation-namespaceTo create the
OperatorGroupCR, run the following command:$ oc create -f fence-agents-remediation-operator-group.yaml
Create a
SubscriptionCR:Define the
SubscriptionCR and save the YAML file, for example,fence-agents-remediation-subscription.yaml:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: fence-agents-remediation-subscription namespace: fence-agents-remediation-namespace1 spec: channel: stable name: fence-agents-remediation source: redhat-operators sourceNamespace: openshift-marketplace package: fence-agents-remediation- 1
- Specify the
Namespacewhere you want to install the Fence Agents Remediation Operator, for example, thefence-agents-remediation-namespaceoutlined earlier in this procedure. You can install theSubscriptionCR for the Fence Agents Remediation Operator in theopenshift-operatorsnamespace where there is already a matchingOperatorGroupCR.
To create the
SubscriptionCR, run the following command:$ oc create -f fence-agents-remediation-subscription.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n fence-agents-remediation-namespaceExample output
NAME DISPLAY VERSION REPLACES PHASE fence-agents-remediation.v.0.2.0 Fence Agents Remediation Operator v.0.2.0 SucceededVerify that the Fence Agents Remediation Operator is up and running:
$ oc get deploy -n fence-agents-remediation-namespaceExample output
NAME READY UP-TO-DATE AVAILABLE AGE fence-agents-remediation-controller-manager 1/1 1 1 28h
3.4. Configuring the Fence Agents Remediation Operator Copy linkLink copied to clipboard!
You can use the Fence Agents Remediation Operator to create the FenceAgentsRemediationTemplate Custom Resource (CR), which is used by the Node Health Check Operator (NHC). This CR defines the fence agent to be used in the cluster with all the required parameters for remediating the nodes. There may be many FenceAgentsRemediationTemplate CRs, at most one for each fence agent, and when NHC is being used it can choose the FenceAgentsRemediationTemplate as the remediationTemplate to be used for power-cycling the node.
In the current release, there may be many FenceAgentsRemediationTemplate CRs, but at most one for each fence agent. This is as a known limitation that will be addressed in a future release.
The FenceAgentsRemediationTemplate CR resembles the following YAML file:
apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
name: fence-agents-remediation-template-fence-ipmilan
namespace: openshift-operators
spec:
template:
spec:
agent: fence_ipmilan
nodeparameters:
--ipport:
master-0-0: '6230'
master-0-1: '6231'
master-0-2: '6232'
worker-0-0: '6233'
worker-0-1: '6234'
worker-0-2: '6235'
sharedparameters:
'--action': reboot
'--ip': 192.168.123.1
'--lanplus': ''
'--password': password
'--username': admin
3.5. Troubleshooting the Fence Agents Remediation Operator Copy linkLink copied to clipboard!
3.5.1. General troubleshooting Copy linkLink copied to clipboard!
- Issue
- You want to troubleshoot issues with the Fence Agents Remediation Operator.
- Resolution
Check the Operator logs.
$ oc logs <fence-agents-remediation-controller-manager-name> -c manager -n <namespace-name>
3.5.2. Unsuccessful remediation Copy linkLink copied to clipboard!
- Issue
- An unhealthy node was not remediated.
- Resolution
Verify that the
FenceAgentsRemediationCR was created by running the following command:$ oc get far -AIf the
NodeHealthCheckcontroller did not create theFenceAgentsRemediationCR when the node turned unhealthy, check the logs of theNodeHealthCheckcontroller. Additionally, ensure that theNodeHealthCheckCR includes the required specification to use the remediation template.If the
FenceAgentsRemediationCR was created, ensure that its name matches the unhealthy node object.
3.5.3. Fence Agents Remediation Operator resources exist after uninstalling the Operator Copy linkLink copied to clipboard!
- Issue
- The Fence Agents Remediation Operator resources, such as the remediation CR and the remediation template CR, exist after uninstalling the Operator.
- Resolution
To remove the Fence Agents Remediation Operator resources, you can delete the resources by selecting the "Delete all operand instances for this operator" checkbox before uninstalling. This checkbox feature is only available in Red Hat OpenShift since version 4.13. For all versions of Red Hat OpenShift, you can delete the resources by running the following relevant command for each resource type:
$ oc delete far <fence-agents-remediation> -n <namespace>$ oc delete fartemplate <fence-agents-remediation-template> -n <namespace>The remediation CR
farmust be created and deleted by the same entity, for example, NHC. If the remediation CRfaris still present, it is deleted, together with the FAR operator.The remediation template CR
fartemplateonly exists if you use FAR with NHC. When the FAR operator is deleted using the web console, the remediation template CRfartemplateis also deleted.
3.6. Gathering data about the Fence Agents Remediation Operator Copy linkLink copied to clipboard!
To collect debugging information about the Fence Agents Remediation Operator, use the must-gather tool. For information about the must-gather image for the Fence Agents Remediation Operator, see Gathering data about specific features.