Este conteúdo não está disponível no idioma selecionado.
Chapter 10. Nodes
10.1. Node maintenance
Nodes can be placed into maintenance mode by using the oc adm
utility or NodeMaintenance
custom resources (CRs).
The node-maintenance-operator
(NMO) is no longer shipped with OpenShift Virtualization. It is deployed as a standalone Operator from the OperatorHub in the Red Hat OpenShift Service on AWS web console or by using the OpenShift CLI (oc
).
For more information on remediation, fencing, and maintaining nodes, see the Workload Availability for Red Hat OpenShift documentation.
Virtual machines (VMs) must have a persistent volume claim (PVC) with a shared ReadWriteMany
(RWX) access mode to be live migrated.
The Node Maintenance Operator watches for new or deleted NodeMaintenance
CRs. When a new NodeMaintenance
CR is detected, no new workloads are scheduled and the node is cordoned off from the rest of the cluster. All pods that can be evicted are evicted from the node. When a NodeMaintenance
CR is deleted, the node that is referenced in the CR is made available for new workloads.
Using a NodeMaintenance
CR for node maintenance tasks achieves the same results as the oc adm cordon
and oc adm drain
commands using standard Red Hat OpenShift Service on AWS custom resource processing.
10.1.1. Eviction strategies
Placing a node into maintenance marks the node as unschedulable and drains all the VMs and pods from it.
You can configure eviction strategies for virtual machines (VMs) or for the cluster.
- VM eviction strategy
The VM
LiveMigrate
eviction strategy ensures that a virtual machine instance (VMI) is not interrupted if the node is placed into maintenance or drained. VMIs with this eviction strategy will be live migrated to another node.You can configure eviction strategies for virtual machines (VMs) by using the OpenShift Virtualization web console or the command line.
ImportantThe default eviction strategy is
LiveMigrate
. A non-migratable VM with aLiveMigrate
eviction strategy might prevent nodes from draining or block an infrastructure upgrade because the VM is not evicted from the node. This situation causes a migration to remain in aPending
orScheduling
state unless you shut down the VM manually.You must set the eviction strategy of non-migratable VMs to
LiveMigrateIfPossible
, which does not block an upgrade, or toNone
, for VMs that should not be migrated.
10.1.1.1. Configuring a VM eviction strategy using the command line
You can configure an eviction strategy for a virtual machine (VM) by using the command line.
The default eviction strategy is LiveMigrate
. A non-migratable VM with a LiveMigrate
eviction strategy might prevent nodes from draining or block an infrastructure upgrade because the VM is not evicted from the node. This situation causes a migration to remain in a Pending
or Scheduling
state unless you shut down the VM manually.
You must set the eviction strategy of non-migratable VMs to LiveMigrateIfPossible
, which does not block an upgrade, or to None
, for VMs that should not be migrated.
Procedure
Edit the
VirtualMachine
resource by running the following command:$ oc edit vm <vm_name> -n <namespace>
Example eviction strategy
apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: name: <vm_name> spec: template: spec: evictionStrategy: LiveMigrateIfPossible 1 # ...
- 1
- Specify the eviction strategy. The default value is
LiveMigrate
.
Restart the VM to apply the changes:
$ virtctl restart <vm_name> -n <namespace>
10.1.2. Run strategies
A virtual machine (VM) configured with spec.running: true
is immediately restarted. The spec.runStrategy
key provides greater flexibility for determining how a VM behaves under certain conditions.
The spec.runStrategy
and spec.running
keys are mutually exclusive. Only one of them can be used.
A VM configuration with both keys is invalid.
10.1.2.1. Run strategies
The spec.runStrategy
key has four possible values:
Always
-
The virtual machine instance (VMI) is always present when a virtual machine (VM) is created on another node. A new VMI is created if the original stops for any reason. This is the same behavior as
running: true
. RerunOnFailure
- The VMI is re-created on another node if the previous instance fails. The instance is not re-created if the VM stops successfully, such as when it is shut down.
Manual
-
You control the VMI state manually with the
start
,stop
, andrestart
virtctl client commands. The VM is not automatically restarted. Halted
-
No VMI is present when a VM is created. This is the same behavior as
running: false
.
Different combinations of the virtctl start
, stop
and restart
commands affect the run strategy.
The following table describes a VM’s transition between states. The first column shows the VM’s initial run strategy. The remaining columns show a virtctl command and the new run strategy after that command is run.
Initial run strategy | Start | Stop | Restart |
---|---|---|---|
Always | - | Halted | Always |
RerunOnFailure | - | Halted | RerunOnFailure |
Manual | Manual | Manual | Manual |
Halted | Always | - | - |
If a node in a cluster installed by using installer-provisioned infrastructure fails the machine health check and is unavailable, VMs with runStrategy: Always
or runStrategy: RerunOnFailure
are rescheduled on a new node.
10.1.2.2. Configuring a VM run strategy by using the command line
You can configure a run strategy for a virtual machine (VM) by using the command line.
The spec.runStrategy
and spec.running
keys are mutually exclusive. A VM configuration that contains values for both keys is invalid.
Procedure
Edit the
VirtualMachine
resource by running the following command:$ oc edit vm <vm_name> -n <namespace>
Example run strategy
apiVersion: kubevirt.io/v1 kind: VirtualMachine spec: runStrategy: Always # ...
10.1.3. Maintaining bare metal nodes
When you deploy Red Hat OpenShift Service on AWS on bare metal infrastructure, there are additional considerations that must be taken into account compared to deploying on cloud infrastructure. Unlike in cloud environments where the cluster nodes are considered ephemeral, re-provisioning a bare metal node requires significantly more time and effort for maintenance tasks.
When a bare metal node fails, for example, if a fatal kernel error happens or a NIC card hardware failure occurs, workloads on the failed node need to be restarted elsewhere else on the cluster while the problem node is repaired or replaced. Node maintenance mode allows cluster administrators to gracefully power down nodes, moving workloads to other parts of the cluster and ensuring workloads do not get interrupted. Detailed progress and node status details are provided during maintenance.
10.1.4. Additional resources
10.2. Managing node labeling for obsolete CPU models
You can schedule a virtual machine (VM) on a node as long as the VM CPU model and policy are supported by the node.
10.2.1. About node labeling for obsolete CPU models
The OpenShift Virtualization Operator uses a predefined list of obsolete CPU models to ensure that a node supports only valid CPU models for scheduled VMs.
By default, the following CPU models are eliminated from the list of labels generated for the node:
Example 10.1. Obsolete CPU models
"486" Conroe athlon core2duo coreduo kvm32 kvm64 n270 pentium pentium2 pentium3 pentiumpro phenom qemu32 qemu64
This predefined list is not visible in the HyperConverged
CR. You cannot remove CPU models from this list, but you can add to the list by editing the spec.obsoleteCPUs.cpuModels
field of the HyperConverged
CR.
10.2.2. About node labeling for CPU features
Through the process of iteration, the base CPU features in the minimum CPU model are eliminated from the list of labels generated for the node.
For example:
-
An environment might have two supported CPU models:
Penryn
andHaswell
. If
Penryn
is specified as the CPU model forminCPU
, each base CPU feature forPenryn
is compared to the list of CPU features supported byHaswell
.Example 10.2. CPU features supported by
Penryn
apic clflush cmov cx16 cx8 de fpu fxsr lahf_lm lm mca mce mmx msr mtrr nx pae pat pge pni pse pse36 sep sse sse2 sse4.1 ssse3 syscall tsc
Example 10.3. CPU features supported by
Haswell
aes apic avx avx2 bmi1 bmi2 clflush cmov cx16 cx8 de erms fma fpu fsgsbase fxsr hle invpcid lahf_lm lm mca mce mmx movbe msr mtrr nx pae pat pcid pclmuldq pge pni popcnt pse pse36 rdtscp rtm sep smep sse sse2 sse4.1 sse4.2 ssse3 syscall tsc tsc-deadline x2apic xsave
If both
Penryn
andHaswell
support a specific CPU feature, a label is not created for that feature. Labels are generated for CPU features that are supported only byHaswell
and not byPenryn
.Example 10.4. Node labels created for CPU features after iteration
aes avx avx2 bmi1 bmi2 erms fma fsgsbase hle invpcid movbe pcid pclmuldq popcnt rdtscp rtm sse4.2 tsc-deadline x2apic xsave
10.2.3. Configuring obsolete CPU models
You can configure a list of obsolete CPU models by editing the HyperConverged
custom resource (CR).
Procedure
Edit the
HyperConverged
custom resource, specifying the obsolete CPU models in theobsoleteCPUs
array. For example:apiVersion: hco.kubevirt.io/v1beta1 kind: HyperConverged metadata: name: kubevirt-hyperconverged namespace: openshift-cnv spec: obsoleteCPUs: cpuModels: 1 - "<obsolete_cpu_1>" - "<obsolete_cpu_2>" minCPUModel: "<minimum_cpu_model>" 2
- 1
- Replace the example values in the
cpuModels
array with obsolete CPU models. Any value that you specify is added to a predefined list of obsolete CPU models. The predefined list is not visible in the CR. - 2
- Replace this value with the minimum CPU model that you want to use for basic CPU features. If you do not specify a value,
Penryn
is used by default.
10.3. Preventing node reconciliation
Use skip-node
annotation to prevent the node-labeller
from reconciling a node.
10.3.1. Using skip-node annotation
If you want the node-labeller
to skip a node, annotate that node by using the oc
CLI.
Prerequisites
-
You have installed the OpenShift CLI (
oc
).
Procedure
Annotate the node that you want to skip by running the following command:
$ oc annotate node <node_name> node-labeller.kubevirt.io/skip-node=true 1
- 1
- Replace
<node_name>
with the name of the relevant node to skip.
Reconciliation resumes on the next cycle after the node annotation is removed or set to false.