Chapter 3. Using Fence Agents Remediation

3.1. About the Fence Agents Remediation Operator
Copiar enlace

The Fence Agents Remediation (FAR) Operator uses external tools to fence unhealthy nodes. These tools are a set of fence agents, where each fence agent can be used for different environments to fence a node, and using a traditional Application Programming Interface (API) call that reboots a node. By doing so, FAR can minimize downtime for stateful applications, restores compute capacity if transient failures occur, and increases the availability of workloads.

FAR not only fences a node when it becomes unhealthy, it also tries to remediate the node from being unhealthy to healthy. It adds a taint to evict stateless pods, fences the node with a fence agent, and after a reboot, it completes the remediation with resource deletion to remove any remaining workloads (mostly stateful workloads). Adding the taint and deleting the workloads accelerates the workload rescheduling.

The Operator watches for new or deleted custom resources (CRs) called FenceAgentsRemediation which trigger a fence agent to remediate a node, based on the CR’s name. FAR uses the NodeHealthCheck controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the NodeHealthCheck resource creates the FenceAgentsRemediation CR, based on the FenceAgentsRemediationTemplate CR, which then triggers the Fence Agents Remediation Operator.

FAR uses a fence agent to fence a Kubernetes node. Generally, fencing is the process of taking unresponsive/unhealthy computers into a safe state, and isolating the computer. Fence agent is a software code that uses a management interface to perform fencing, mostly power-based fencing which enables power-cycling, reset, or turning off the computer. An example fence agent is fence_ipmilan which is used for Intelligent Platform Management Interface (IPMI) environments.

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
  name: node-name 
  namespace: openshift-workload-availability
spec:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
  name: node-name

1


  namespace: openshift-workload-availability
spec:

Copy to Clipboard

Toggle word wrap

1: The node-name should match the name of the unhealthy cluster node.

The Operator includes a set of fence agents, that are also available in the Red Hat High Availability Add-On, which use a management interface, such as IPMI or an API, to provision/reboot a node for bare metal servers, virtual machines, and cloud platforms.

3.2. Installing the Fence Agents Remediation Operator by using the web console
Copiar enlace

You can use the Red Hat OpenShift web console to install the Fence Agents Remediation Operator.

Prerequisites

Log in as a user with cluster-admin privileges.

Procedure

In the Red Hat OpenShift web console, navigate to Operators OperatorHub.
Select the Fence Agents Remediation Operator, or FAR, from the list of available Operators, and then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the openshift-workload-availability namespace.
Click Install.

Verification

To confirm that the installation is successful:

Navigate to the Operators Installed Operators page.
Check that the Operator is installed in the openshift-workload-availability namespace and its status is Succeeded.

If the Operator is not installed successfully:

Navigate to the Operators Installed Operators page and inspect the Status column for any errors or failures.
Navigate to the Workloads Pods page and check the log of the fence-agents-remediation-controller-manager pod for any reported issues.

3.3. Installing the Fence Agents Remediation Operator by using the CLI
Copiar enlace

You can use the OpenShift CLI (oc) to install the Fence Agents Remediation Operator.

You can install the Fence Agents Remediation Operator in your own namespace or in the openshift-workload-availability namespace.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a Namespace custom resource (CR) for the Fence Agents Remediation Operator:
1. Define the Namespace CR and save the YAML file, for example, workload-availability-namespace.yaml:
  apiVersion: v1 kind: Namespace metadata: name: openshift-workload-availability
  Copy to Clipboard Toggle word wrap
2. To create the Namespace CR, run the following command:
  $ oc create -f workload-availability-namespace.yaml
  Copy to Clipboard Toggle word wrap

Create an OperatorGroup CR:

Define the OperatorGroup CR and save the YAML file, for example, workload-availability-operator-group.yaml:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: workload-availability-operator-group
  namespace: openshift-workload-availability

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: workload-availability-operator-group
  namespace: openshift-workload-availability

Copy to Clipboard

Toggle word wrap

To create the OperatorGroup CR, run the following command:

oc create -f workload-availability-operator-group.yaml

$ oc create -f workload-availability-operator-group.yaml

Copy to Clipboard

Toggle word wrap

Create a Subscription CR:

Define the Subscription CR and save the YAML file, for example, fence-agents-remediation-subscription.yaml:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
    name: fence-agents-remediation-subscription
    namespace: openshift-workload-availability 
spec:
    channel: stable
    name: fence-agents-remediation
    source: redhat-operators
    sourceNamespace: openshift-marketplace
    package: fence-agents-remediation

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
    name: fence-agents-remediation-subscription
    namespace: openshift-workload-availability

1


spec:
    channel: stable
    name: fence-agents-remediation
    source: redhat-operators
    sourceNamespace: openshift-marketplace
    package: fence-agents-remediation

Copy to Clipboard

Toggle word wrap

1: Specify the Namespace where you want to install the Fence Agents Remediation Operator, for example, the openshift-workload-availability outlined earlier in this procedure. You can install the Subscription CR for the Fence Agents Remediation Operator in the openshift-workload-availability namespace where there is already a matching OperatorGroup CR.

To create the Subscription CR, run the following command:

oc create -f fence-agents-remediation-subscription.yaml

$ oc create -f fence-agents-remediation-subscription.yaml

Copy to Clipboard

Toggle word wrap

Verification

Verify that the installation succeeded by inspecting the CSV resource:

oc get csv -n openshift-workload-availability

$ oc get csv -n openshift-workload-availability

Copy to Clipboard

Toggle word wrap

Example output

NAME                               DISPLAY                          VERSION   REPLACES   PHASE
fence-agents-remediation.v0.3.0      Fence Agents Remediation Operator   0.3.0   fence-agents-remediation.v0.2.1           Succeeded

NAME                               DISPLAY                          VERSION   REPLACES   PHASE
fence-agents-remediation.v0.3.0      Fence Agents Remediation Operator   0.3.0   fence-agents-remediation.v0.2.1           Succeeded

Copy to Clipboard

Toggle word wrap

Verify that the Fence Agents Remediation Operator is up and running:

oc get deployment -n openshift-workload-availability

$ oc get deployment -n openshift-workload-availability

Copy to Clipboard

Toggle word wrap

Example output

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
fence-agents-remediation-controller-manager    2/2     2            2           110m

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
fence-agents-remediation-controller-manager    2/2     2            2           110m

Copy to Clipboard

Toggle word wrap

3.4. Configuring the Fence Agents Remediation Operator
Copiar enlace

You can use the Fence Agents Remediation Operator to create the FenceAgentsRemediationTemplate Custom Resource (CR), which is used by the Node Health Check Operator (NHC). This CR defines the fence agent to be used in the cluster with all the required parameters for remediating the nodes. There may be many FenceAgentsRemediationTemplate CRs, at most one for each fence agent, and when NHC is being used it can choose the FenceAgentsRemediationTemplate as the remediationTemplate to be used for power-cycling the node.

Note

In the current release, there might be many FenceAgentsRemediationTemplate CRs, but at most one for each fence agent. This is a known limitation that will be addressed in a future release.

The FenceAgentsRemediationTemplate CR resembles the following YAML file:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fence-agents-remediation-template-fence-ipmilan
  namespace: openshift-workload-availability
spec:
  template:
    spec:
      agent: fence_ipmilan 
      nodeparameters: 
        --ipport:
          master-0-0: '6230'
          master-0-1: '6231'
          master-0-2: '6232'
          worker-0-0: '6233'
          worker-0-1: '6234'
          worker-0-2: '6235'
      sharedparameters: 
        '--action': reboot
        '--ip': 192.168.123.1
        '--lanplus': ''
        '--password': password
        '--username': admin
      retryCount: '5' 
      retryInterval: '5s' 
      timeout: '60s'

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fence-agents-remediation-template-fence-ipmilan
  namespace: openshift-workload-availability
spec:
  template:
    spec:
      agent: fence_ipmilan

1


      nodeparameters:

2


        --ipport:
          master-0-0: '6230'
          master-0-1: '6231'
          master-0-2: '6232'
          worker-0-0: '6233'
          worker-0-1: '6234'
          worker-0-2: '6235'
      sharedparameters:

3


        '--action': reboot
        '--ip': 192.168.123.1
        '--lanplus': ''
        '--password': password
        '--username': admin
      retryCount: '5'

4


      retryInterval: '5s'

5


      timeout: '60s'

6

Copy to Clipboard

Toggle word wrap

1: Displays the name of the fence agent to be executed, for example, fence_ipmilan.
2: Displays the node-specific parameters for executing the fence agent, for example, ipport.
3: Displays the cluster-wide parameters for executing the fence agent, for example, username.
4: Displays the number of times to retry the fence agent command in case of failure. The default number of attempts is 5.
5: Displays the interval between retries in seconds. The default is 5 seconds.
6: Displays the timeout for the fence agent command. The default is 60 seconds. For values of 60 seconds or greater, the timeout value is expressed in both minutes and seconds in the YAML file.

3.4.1. Understanding the Fence Agents Remediation Template configuration
Copiar enlace

The Fence Agents Remediation Operator also creates the FenceAgentsRemediationTemplate Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes that is aimed to recover workloads faster. The following remediation strategies are available:

ResourceDeletion: This remediation strategy removes the pods on the node.
OutOfServiceTaint: This remediation strategy implicitly causes the removal of the pods and associated volume attachments on the node. It achieves this by placing the OutOfServiceTaint taint on the node. The OutOfServiceTaint strategy also represents a non-graceful node shutdown. A non-graceful node shutdown occurs when a node is shut down and not detected, instead of triggering an in-operating system shutdown. This strategy has been supported on technology preview since OpenShift Container Platform version 4.13, and on general availability since OpenShift Container Platform version 4.15.

The FenceAgentsRemediationTemplate CR resembles the following YAML file:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fence-agents-remediation-<remediation_object>-deletion-template 
  namespace: openshift-workload-availability
spec:
  template:
    spec:
      remediationStrategy: <remediation_strategy>

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fence-agents-remediation-<remediation_object>-deletion-template

1


  namespace: openshift-workload-availability
spec:
  template:
    spec:
      remediationStrategy: <remediation_strategy>

2

Copy to Clipboard

Toggle word wrap

1: Specifies the type of remediation template based on the remediation strategy. Replace <remediation_object> with either resource or taint; for example, fence-agents-remediation-resource-deletion-template.
2: Specifies the remediation strategy. The remediation strategy can either be ResourceDeletion or OutOfServiceTaint. >>>>>>> d0fbafb (OCPBUGS-37721: Peer review feedback applied)

3.5. Troubleshooting the Fence Agents Remediation Operator
Copiar enlace

3.5.1. General troubleshooting
Copiar enlace

Issue

You want to troubleshoot issues with the Fence Agents Remediation Operator.

Resolution

Check the Operator logs.

oc logs <fence-agents-remediation-controller-manager-name> -c manager -n <namespace-name>

$ oc logs <fence-agents-remediation-controller-manager-name> -c manager -n <namespace-name>

Copy to Clipboard

Toggle word wrap

3.5.2. Unsuccessful remediation
Copiar enlace

Issue

An unhealthy node was not remediated.

Resolution

Verify that the FenceAgentsRemediation CR was created by running the following command:

oc get far -A

$ oc get far -A

Copy to Clipboard

Toggle word wrap

If the NodeHealthCheck controller did not create the FenceAgentsRemediation CR when the node turned unhealthy, check the logs of the NodeHealthCheck controller. Additionally, ensure that the NodeHealthCheck CR includes the required specification to use the remediation template.

If the FenceAgentsRemediation CR was created, ensure that its name matches the unhealthy node object.

3.5.3. Fence Agents Remediation Operator resources exist after uninstalling the Operator
Copiar enlace

Issue

The Fence Agents Remediation Operator resources, such as the remediation CR and the remediation template CR, exist after uninstalling the Operator.

Resolution

To remove the Fence Agents Remediation Operator resources, you can delete the resources by selecting the "Delete all operand instances for this operator" checkbox before uninstalling. This checkbox feature is only available in Red Hat OpenShift since version 4.13. For all versions of Red Hat OpenShift, you can delete the resources by running the following relevant command for each resource type:

oc delete far <fence-agents-remediation> -n <namespace>

$ oc delete far <fence-agents-remediation> -n <namespace>

Copy to Clipboard

Toggle word wrap

oc delete fartemplate <fence-agents-remediation-template> -n <namespace>

$ oc delete fartemplate <fence-agents-remediation-template> -n <namespace>

Copy to Clipboard

Toggle word wrap

The remediation CR far must be created and deleted by the same entity, for example, NHC. If the remediation CR far is still present, it is deleted, together with the FAR operator.

The remediation template CR fartemplate only exists if you use FAR with NHC. When the FAR operator is deleted using the web console, the remediation template CR fartemplate is also deleted.

3.6. Gathering data about the Fence Agents Remediation Operator
Copiar enlace

To collect debugging information about the Fence Agents Remediation Operator, use the must-gather tool. For information about the must-gather image for the Fence Agents Remediation Operator, see Gathering data about specific features.

Este contenido no está disponible en el idioma seleccionado.

3.1. About the Fence Agents Remediation Operator
Copiar enlace

3.2. Installing the Fence Agents Remediation Operator by using the web console
Copiar enlace

3.3. Installing the Fence Agents Remediation Operator by using the CLI
Copiar enlace

3.4. Configuring the Fence Agents Remediation Operator
Copiar enlace

3.4.1. Understanding the Fence Agents Remediation Template configuration
Copiar enlace

3.5. Troubleshooting the Fence Agents Remediation Operator
Copiar enlace

3.5.1. General troubleshooting
Copiar enlace

3.5.2. Unsuccessful remediation
Copiar enlace

3.5.3. Fence Agents Remediation Operator resources exist after uninstalling the Operator
Copiar enlace

3.6. Gathering data about the Fence Agents Remediation Operator
Copiar enlace

3.7. Additional resources
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 3. Using Fence Agents Remediation

3.1. About the Fence Agents Remediation OperatorCopiar enlaceEnlace copiado en el portapapeles!

3.2. Installing the Fence Agents Remediation Operator by using the web consoleCopiar enlaceEnlace copiado en el portapapeles!

3.3. Installing the Fence Agents Remediation Operator by using the CLICopiar enlaceEnlace copiado en el portapapeles!

3.4. Configuring the Fence Agents Remediation OperatorCopiar enlaceEnlace copiado en el portapapeles!

3.4.1. Understanding the Fence Agents Remediation Template configurationCopiar enlaceEnlace copiado en el portapapeles!

3.5. Troubleshooting the Fence Agents Remediation OperatorCopiar enlaceEnlace copiado en el portapapeles!

3.5.1. General troubleshootingCopiar enlaceEnlace copiado en el portapapeles!

3.5.2. Unsuccessful remediationCopiar enlaceEnlace copiado en el portapapeles!

3.5.3. Fence Agents Remediation Operator resources exist after uninstalling the OperatorCopiar enlaceEnlace copiado en el portapapeles!

3.6. Gathering data about the Fence Agents Remediation OperatorCopiar enlaceEnlace copiado en el portapapeles!

3.7. Additional resourcesCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. About the Fence Agents Remediation Operator
Copiar enlace

3.2. Installing the Fence Agents Remediation Operator by using the web console
Copiar enlace

3.3. Installing the Fence Agents Remediation Operator by using the CLI
Copiar enlace

3.4. Configuring the Fence Agents Remediation Operator
Copiar enlace

3.4.1. Understanding the Fence Agents Remediation Template configuration
Copiar enlace

3.5. Troubleshooting the Fence Agents Remediation Operator
Copiar enlace

3.5.1. General troubleshooting
Copiar enlace

3.5.2. Unsuccessful remediation
Copiar enlace

3.5.3. Fence Agents Remediation Operator resources exist after uninstalling the Operator
Copiar enlace

3.6. Gathering data about the Fence Agents Remediation Operator
Copiar enlace

3.7. Additional resources
Copiar enlace