Chapter 3. Recommended cluster scaling practices
The guidance in this section is only relevant for installations with cloud provider integration.
Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
3.1. Recommended practices for scaling the cluster
When scaling up the cluster to higher node counts:
- Spread nodes across all of the available zones for higher availability.
- Scale up by no more than 25 to 50 machines at once.
- Consider creating new machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.
Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
The controller might not be able to create the machines if the replicas in the machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits and excessive queries might lead to machine creation failures due to cloud platform limitations.
Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
When scaling large and dense clusters to lower node counts, it might take large amounts of time as the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client QPS and burst rates are currently set to 5
and 10
respectively and they cannot be modified in OpenShift Container Platform.
3.2. Modifying a machine set
To make changes to a machine set, edit the MachineSet
YAML. Then, remove all machines associated with the machine set by deleting each machine or scaling down the machine set to 0
replicas. Then, scale the replicas back to the desired number. Changes you make to a machine set do not affect existing machines.
If you need to scale a machine set without making other changes, you do not need to delete the machines.
By default, the OpenShift Container Platform router pods are deployed on workers. Because the router is required to access some cluster resources, including the web console, do not scale the worker machine set to 0
unless you first relocate the router pods.
Prerequisites
-
Install an OpenShift Container Platform cluster and the
oc
command line. -
Log in to
oc
as a user withcluster-admin
permission.
Procedure
Edit the machine set:
$ oc edit machineset <machineset> -n openshift-machine-api
Scale down the machine set to
0
:$ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api
Or:
$ oc edit machineset <machineset> -n openshift-machine-api
Wait for the machines to be removed.
Scale up the machine set as needed:
$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
Or:
$ oc edit machineset <machineset> -n openshift-machine-api
Wait for the machines to start. The new machines contain changes you made to the machine set.
3.3. About machine health checks
You can define conditions under which machines in a cluster are considered unhealthy by using a MachineHealthCheck
resource. Machines matching the conditions are automatically remediated.
To monitor machine health, create a MachineHealthCheck
custom resource (CR) that includes a label for the set of machines to monitor and a condition to check, such as staying in the NotReady
status for 15 minutes or displaying a permanent condition in the node-problem-detector.
The controller that observes a MachineHealthCheck
CR checks for the condition that you defined. If a machine fails the health check, the machine is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a machine deleted
event.
For machines with the master role, the machine health check reports the number of unhealthy nodes, but the machine is not deleted. For example:
Example output
$ oc get machinehealthcheck example -n openshift-machine-api
NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY example 40% 3 1
To limit the disruptive impact of machine deletions, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy
threshold allows for in the targeted pool of machines, the controller stops deleting machines and you must manually intervene.
To stop the check, remove the custom resource.
3.3.1. MachineHealthChecks on Bare Metal
Machine deletion on bare metal cluster triggers reprovisioning of a bare metal host. Usually bare metal reprovisioning is a lengthy process, during which the cluster is missing compute resources and applications might be interrupted. To change the default remediation process from machine deletion to host power-cycle, annotate the MachineHealthCheck resource with the machine.openshift.io/remediation-strategy: external-baremetal
annotation.
After you set the annotation, unhealthy machines are power-cycled by using BMC credentials.
3.3.2. Limitations when deploying machine health checks
There are limitations to consider before deploying a machine health check:
- Only machines owned by a machine set are remediated by a machine health check.
- Control plane machines are not currently supported and are not remediated if they are unhealthy.
- If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
-
If the corresponding node for a machine does not join the cluster after the
nodeStartupTimeout
, the machine is remediated. -
A machine is remediated immediately if the
Machine
resource phase isFailed
.
3.4. Sample MachineHealthCheck resource
The MachineHealthCheck
resource resembles one of the following YAML files:
MachineHealthCheck
for bare metal
apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: example 1 namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal 2 spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: <role> 3 machine.openshift.io/cluster-api-machine-type: <role> 4 machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> 5 unhealthyConditions: - type: "Ready" timeout: "300s" 6 status: "False" - type: "Ready" timeout: "300s" 7 status: "Unknown" maxUnhealthy: "40%" 8 nodeStartupTimeout: "10m" 9
- 1
- Specify the name of the machine health check to deploy.
- 2
- For bare metal clusters, you must include the
machine.openshift.io/remediation-strategy: external-baremetal
annotation in theannotations
section to enable power-cycle remediation. With this remediation strategy, unhealthy hosts are rebooted instead of removed from the cluster. - 3 4
- Specify a label for the machine pool that you want to check.
- 5
- Specify the machine set to track in
<cluster_name>-<label>-<zone>
format. For example,prod-node-us-east-1a
. - 6 7
- Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
- 8
- Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
- 9
- Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
The matchLabels
are examples only; you must map your machine groups based on your specific needs.
MachineHealthCheck
for all other installation types
apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: example 1 namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: <role> 2 machine.openshift.io/cluster-api-machine-type: <role> 3 machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> 4 unhealthyConditions: - type: "Ready" timeout: "300s" 5 status: "False" - type: "Ready" timeout: "300s" 6 status: "Unknown" maxUnhealthy: "40%" 7 nodeStartupTimeout: "10m" 8
- 1
- Specify the name of the machine health check to deploy.
- 2 3
- Specify a label for the machine pool that you want to check.
- 4
- Specify the machine set to track in
<cluster_name>-<label>-<zone>
format. For example,prod-node-us-east-1a
. - 5 6
- Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
- 7
- Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer.
- 8
- Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
The matchLabels
are examples only; you must map your machine groups based on your specific needs.
3.4.1. Short-circuiting machine health check remediation
Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the maxUnhealthy
field in the MachineHealthCheck
resource.
If the user defines a value for the maxUnhealthy
field, before remediating any machines, the MachineHealthCheck
compares the value of maxUnhealthy
with the number of machines within its target pool that it has determined to be unhealthy. Remediation is not performed if the number of unhealthy machines exceeds the maxUnhealthy
limit.
If maxUnhealthy
is not set, the value defaults to 100%
and the machines are remediated regardless of the state of the cluster.
The maxUnhealthy
field can be set as either an integer or percentage. There are different remediation implementations depending on the maxUnhealthy
value.
3.4.1.1. Setting maxUnhealthy
by using an absolute value
If maxUnhealthy
is set to 2
:
- Remediation will be performed if 2 or fewer nodes are unhealthy
- Remediation will not be performed if 3 or more nodes are unhealthy
These values are independent of how many machines are being checked by the machine health check.
3.4.1.2. Setting maxUnhealthy
by using percentages
If maxUnhealthy
is set to 40%
and there are 25 machines being checked:
- Remediation will be performed if 10 or fewer nodes are unhealthy
- Remediation will not be performed if 11 or more nodes are unhealthy
If maxUnhealthy
is set to 40%
and there are 6 machines being checked:
- Remediation will be performed if 2 or fewer nodes are unhealthy
- Remediation will not be performed if 3 or more nodes are unhealthy
The allowed number of machines is rounded down when the percentage of maxUnhealthy
machines that are checked is not a whole number.
3.5. Creating a MachineHealthCheck resource
You can create a MachineHealthCheck
resource for all MachineSets
in your cluster. You should not create a MachineHealthCheck
resource that targets control plane machines.
Prerequisites
-
Install the
oc
command line interface.
Procedure
-
Create a
healthcheck.yml
file that contains the definition of your machine health check. Apply the
healthcheck.yml
file to your cluster:$ oc apply -f healthcheck.yml