Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 3. Using Fence Agents Remediation
You can use the Fence Agents Remediation Operator to automatically remediate unhealthy nodes, similar to the Self Node Remediation Operator. FAR is designed to run an existing set of upstream fencing agents on environments with a traditional API end-point, for example, IPMI, for power cycling cluster nodes, while their pods are quickly evicted based on the remediation strategy.
3.1. About the Fence Agents Remediation Operator
The Fence Agents Remediation (FAR) Operator uses external tools to fence unhealthy nodes. These tools are a set of fence agents, where each fence agent can be used for different environments to fence a node, and using a traditional Application Programming Interface (API) call that reboots a node. By doing so, FAR can minimize downtime for stateful applications, restores compute capacity if transient failures occur, and increases the availability of workloads.
FAR not only fences a node when it becomes unhealthy, it also tries to remediate the node from being unhealthy to healthy. It adds a taint to evict stateless pods, fences the node with a fence agent, and after a reboot, it completes the remediation with resource deletion to remove any remaining workloads (mostly stateful workloads). Adding the taint and deleting the workloads accelerates the workload rescheduling.
				The Operator watches for new or deleted custom resources (CRs) called FenceAgentsRemediation which trigger a fence agent to remediate a node, based on the CR’s name. FAR uses the NodeHealthCheck controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the NodeHealthCheck resource creates the FenceAgentsRemediation CR, based on the FenceAgentsRemediationTemplate CR, which then triggers the Fence Agents Remediation Operator.
			
				FAR uses a fence agent to fence a Kubernetes node. Generally, fencing is the process of taking unresponsive/unhealthy computers into a safe state, and isolating the computer. Fence agent is a software code that uses a management interface to perform fencing, mostly power-based fencing which enables power-cycling, reset, or turning off the computer. An example fence agent is fence_ipmilan which is used for Intelligent Platform Management Interface (IPMI) environments.
			
- 1
- The node-name should match the name of the unhealthy cluster node.
- 2
- Specifies the remediation strategy for the nodes. For more information on the remediation strategies available, see the Understanding the Fence Agents Remediation Template configuration topic.
The Operator includes a set of fence agents, that are also available in the Red Hat High Availability Add-On, which use a management interface, such as IPMI or an API, to provision/reboot a node for bare metal servers, virtual machines, and cloud platforms.
3.2. Installing the Fence Agents Remediation Operator by using the web console
You can use the Red Hat OpenShift web console to install the Fence Agents Remediation Operator.
Prerequisites
- 
						Log in as a user with cluster-adminprivileges.
Procedure
- 
						In the Red Hat OpenShift web console, navigate to Operators OperatorHub. 
- Select the Fence Agents Remediation Operator, or FAR, from the list of available Operators, and then click Install.
- 
						Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the openshift-workload-availabilitynamespace.
- Click Install.
Verification
To confirm that the installation is successful:
- 
						Navigate to the Operators Installed Operators page. 
- 
						Check that the Operator is installed in the openshift-workload-availabilitynamespace and its status isSucceeded.
If the Operator is not installed successfully:
- 
						Navigate to the Operators Installed Operators page and inspect the Status column for any errors or failures. 
- 
						Navigate to the Workloads Pods page and check the log of the fence-agents-remediation-controller-managerpod for any reported issues.
3.3. Installing the Fence Agents Remediation Operator by using the CLI
				You can use the OpenShift CLI (oc) to install the Fence Agents Remediation Operator.
			
				You can install the Fence Agents Remediation Operator in your own namespace or in the openshift-workload-availability namespace.
			
Prerequisites
- 
						Install the OpenShift CLI (oc).
- 
						Log in as a user with cluster-adminprivileges.
Procedure
- Create a - Namespacecustom resource (CR) for the Fence Agents Remediation Operator:- Define the - NamespaceCR and save the YAML file, for example,- workload-availability-namespace.yaml:- apiVersion: v1 kind: Namespace metadata: name: openshift-workload-availability - apiVersion: v1 kind: Namespace metadata: name: openshift-workload-availability- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- To create the - NamespaceCR, run the following command:- oc create -f workload-availability-namespace.yaml - $ oc create -f workload-availability-namespace.yaml- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Create an - OperatorGroupCR:- Define the - OperatorGroupCR and save the YAML file, for example,- workload-availability-operator-group.yaml:- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: workload-availability-operator-group namespace: openshift-workload-availability - apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: workload-availability-operator-group namespace: openshift-workload-availability- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- To create the - OperatorGroupCR, run the following command:- oc create -f workload-availability-operator-group.yaml - $ oc create -f workload-availability-operator-group.yaml- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Create a - SubscriptionCR:- Define the - SubscriptionCR and save the YAML file, for example,- fence-agents-remediation-subscription.yaml:- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1
- Specify theNamespacewhere you want to install the Fence Agents Remediation Operator, for example, theopenshift-workload-availabilityoutlined earlier in this procedure. You can install theSubscriptionCR for the Fence Agents Remediation Operator in theopenshift-workload-availabilitynamespace where there is already a matchingOperatorGroupCR.
 
- To create the - SubscriptionCR, run the following command:- oc create -f fence-agents-remediation-subscription.yaml - $ oc create -f fence-agents-remediation-subscription.yaml- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
Verification
- Verify that the installation succeeded by inspecting the CSV resource: - oc get csv -n openshift-workload-availability - $ oc get csv -n openshift-workload-availability- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - NAME DISPLAY VERSION REPLACES PHASE fence-agents-remediation.v0.3.0 Fence Agents Remediation Operator 0.3.0 fence-agents-remediation.v0.2.1 Succeeded - NAME DISPLAY VERSION REPLACES PHASE fence-agents-remediation.v0.3.0 Fence Agents Remediation Operator 0.3.0 fence-agents-remediation.v0.2.1 Succeeded- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Verify that the Fence Agents Remediation Operator is up and running: - oc get deployment -n openshift-workload-availability - $ oc get deployment -n openshift-workload-availability- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - NAME READY UP-TO-DATE AVAILABLE AGE fence-agents-remediation-controller-manager 2/2 2 2 110m - NAME READY UP-TO-DATE AVAILABLE AGE fence-agents-remediation-controller-manager 2/2 2 2 110m- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
3.4. Configuring the Fence Agents Remediation Operator
				You can use the Fence Agents Remediation Operator to create the FenceAgentsRemediationTemplate Custom Resource (CR), which is used by the Node Health Check Operator (NHC). This CR defines the fence agent to be used in the cluster with all the required parameters for remediating the nodes. There may be many FenceAgentsRemediationTemplate CRs, at most one for each fence agent, and when NHC is being used it can choose the FenceAgentsRemediationTemplate as the remediationTemplate to be used for power-cycling the node.
			
				The FenceAgentsRemediationTemplate CR resembles the following YAML file:
			
- 1
- Displays the name of the fence agent to be executed, for example,fence_ipmilan.
- 2
- Displays the node-specific parameters for executing the fence agent, for example,ipport.
- 3
- Displays the cluster-wide parameters for executing the fence agent, for example,username.
- 4
- Displays the number of times to retry the fence agent command in case of failure. The default number of attempts is 5.
- 5
- Displays the interval between retries in seconds. The default is 5 seconds.
- 6
- Displays the timeout for the fence agent command. The default is 60 seconds. For values of 60 seconds or greater, the timeout value is expressed in both minutes and seconds in the YAML file.
3.4.1. Understanding the Fence Agents Remediation Template configuration
					The Fence Agents Remediation Operator also creates the FenceAgentsRemediationTemplate Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes that is aimed to recover workloads faster. The following remediation strategies are available:
				
- ResourceDeletion
- This remediation strategy removes the pods on the node.
- OutOfServiceTaint
- 
								This remediation strategy implicitly causes the removal of the pods and associated volume attachments on the node. It achieves this by placing the OutOfServiceTainttaint on the node. TheOutOfServiceTaintstrategy also represents a non-graceful node shutdown. A non-graceful node shutdown occurs when a node is shut down and not detected, instead of triggering an in-operating system shutdown. This strategy has been supported on technology preview since OpenShift Container Platform version 4.13, and on general availability since OpenShift Container Platform version 4.15.
					The FenceAgentsRemediationTemplate CR resembles the following YAML file:
				
- 1
- Specifies the type of remediation template based on the remediation strategy. Replace<remediation_object>with eitherresourceortaint; for example,fence-agents-remediation-resource-deletion-template.
- 2
- Specifies the remediation strategy. The remediation strategy can either beResourceDeletionorOutOfServiceTaint.
3.5. Troubleshooting the Fence Agents Remediation Operator
3.5.1. General troubleshooting
- Issue
- You want to troubleshoot issues with the Fence Agents Remediation Operator.
- Resolution
- Check the Operator logs. - oc logs <fence-agents-remediation-controller-manager-name> -c manager -n <namespace-name> - $ oc logs <fence-agents-remediation-controller-manager-name> -c manager -n <namespace-name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
3.5.2. Unsuccessful remediation
- Issue
- An unhealthy node was not remediated.
- Resolution
- Verify that the - FenceAgentsRemediationCR was created by running the following command:- oc get far -A - $ oc get far -A- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the - NodeHealthCheckcontroller did not create the- FenceAgentsRemediationCR when the node turned unhealthy, check the logs of the- NodeHealthCheckcontroller. Additionally, ensure that the- NodeHealthCheckCR includes the required specification to use the remediation template.- If the - FenceAgentsRemediationCR was created, ensure that its name matches the unhealthy node object.
3.5.3. Fence Agents Remediation Operator resources exist after uninstalling the Operator
- Issue
- The Fence Agents Remediation Operator resources, such as the remediation CR and the remediation template CR, exist after uninstalling the Operator.
- Resolution
- To remove the Fence Agents Remediation Operator resources, you can delete the resources by selecting the "Delete all operand instances for this operator" checkbox before uninstalling. This checkbox feature is only available in Red Hat OpenShift since version 4.13. For all versions of Red Hat OpenShift, you can delete the resources by running the following relevant command for each resource type: - oc delete far <fence-agents-remediation> -n <namespace> - $ oc delete far <fence-agents-remediation> -n <namespace>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - oc delete fartemplate <fence-agents-remediation-template> -n <namespace> - $ oc delete fartemplate <fence-agents-remediation-template> -n <namespace>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The remediation CR - farmust be created and deleted by the same entity, for example, NHC. If the remediation CR- faris still present, it is deleted, together with the FAR operator.- The remediation template CR - fartemplateonly exists if you use FAR with NHC. When the FAR operator is deleted using the web console, the remediation template CR- fartemplateis also deleted.
3.6. Gathering data about the Fence Agents Remediation Operator
				To collect debugging information about the Fence Agents Remediation Operator, use the must-gather tool. For information about the must-gather image for the Fence Agents Remediation Operator, see Gathering data about specific features.
			
3.7. Agents supported by the Fence Agents Remediation Operator
This section describes the agents currently supported by the Fence Agents Remediation Operator.
Most of the supported agents can be grouped by the node’s hardware proprietary and usage, as follows:
- BareMetal
- Virtualization
- Intel
- HP
- IBM
- VMware
- Cisco
- APC
- Dell
- Other
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with Out-of-Band controllers that support Redfish APIs. | 
| 
								 | An I/O Fencing agent that can be used with machines controlled by IPMI. | 
| [a] 
									This description also applies for the agents  fence_ilo3,fence_ilo4,fence_ilo5,fence_imm,fence_idrac, andfence_ipmilanplus. | |
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with RHEV-M REST API to fence virtual machines. | 
| 
								 | An I/O Fencing agent that can be used with virtual machines. | 
| [a] 
									This description also applies for the agent  fence_xvm. | |
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with Intel AMT (WS). | 
| 
								 | An I/O Fencing agent that can be used with Intel Modular device (tested on Intel MFSYS25, should also work with MFSYS35). | 
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used for HP servers with the Integrated Light Out (iLO) PCI card. | 
| 
								 | A fencing agent that can be used to connect to an iLO device. It logs into device via ssh and reboot a specified outlet. | 
| 
								 | An I/O Fencing agent that can be used with HP Moonshot iLO. | 
| 
								 | An I/O Fencing agent that can be used with HP iLO MP. | 
| 
								 | An I/O Fencing agent that can be used with HP BladeSystem and HP Integrity Superdome X. | 
| [a] 
									This description also applies for the agent  fence_ilo2.[b] 
									This description also applies for the agents  fence_ilo3_ssh,fence_ilo4_ssh, andfence_ilo5_ssh. | |
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with IBM Bladecenters with recent enough firmware that includes telnet support. | 
| 
								 | An I/O Fencing agent that can be used with IBM BladeCenter chassis. | 
| 
								 | An I/O Fencing agent that can be used with the IBM iPDU network power switch. | 
| 
								 | An I/O Fencing agent that can be used with the IBM RSA II management interface. | 
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with VMware API to fence virtual machines. | 
| 
								 | An I/O Fencing agent that can be used with the virtual machines managed by VMWare products that have SOAP API v4.1+. | 
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with any Cisco MDS 9000 series with SNMP enabled device. | 
| 
								 | An I/O Fencing agent that can be used with Cisco UCS to fence machines. | 
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with the APC network power switch. | 
| 
								 | An I/O Fencing agent that can be used with the APC network power switch or Tripplite PDU devices. | 
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with the Dell Remote Access Card v5 or CMC (DRAC). | 
| Agent | Description | 
|---|---|
| 
								 | An I/O Fencing agent that can be used with Brocade FC switches. | 
| 
								 | A resource that can be used to tell Nova that compute nodes are down and to reschedule flagged instances. | 
| 
								 | An I/O Fencing agent that can be used with the Eaton network power switch. | 
| 
								 | An I/O Fencing agent that can be used with MPX and MPH2 managed rack PDU. | 
| 
								 | An I/O Fencing agent that can be used with the ePowerSwitch 8M+ power switch to fence connected machines. | 
| 
								 | A resource that can be used to reschedule flagged instances. | 
| 
								 | A resource that can be used with ping-heuristics to control execution of another fence agent on the same fencing level. | 
| 
								 | An I/O Fencing agent that can be used with any SNMP IF-MIB capable device. | 
| 
								 | An I/O Fencing agent that can be used with the kdump crash recovery service. | 
| 
								 | An I/O Fencing agent that can be used with SCSI-3 persistent reservations to control access multipath devices. | 
| 
								 | An I/O Fencing agent that can be used with the Fujitsu-Siemens RSB management interface. | 
| 
								 | An I/O Fencing agent that can be used in environments where sbd can be used (shared storage). | 
| 
								 | An I/O Fencing agent that can be used with SCSI-3 persistent reservations to control access to shared storage devices. | 
| 
								 | An I/O Fencing agent that can be used with the WTI Network Power Switch (NPS). |