Questo contenuto non è disponibile nella lingua selezionata.
Chapter 3. Recommended cluster scaling practices
The guidance in this section is only relevant for installations with cloud provider integration.
These guidelines apply to OpenShift Container Platform with software-defined networking (SDN), not Open Virtual Network (OVN).
Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
3.1. Recommended practices for scaling the cluster Copia collegamentoCollegamento copiato negli appunti!
When scaling up the cluster to higher node counts:
- Spread nodes across all of the available zones for higher availability.
- Scale up by no more than 25 to 50 machines at once.
- Consider creating new machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.
Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
The controller might not be able to create the machines if the replicas in the machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits and excessive queries might lead to machine creation failures due to cloud platform limitations.
Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
When scaling large and dense clusters to lower node counts, it might take large amounts of time as the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client QPS and burst rates are currently set to
5
10
3.2. Modifying a machine set by using the CLI Copia collegamentoCollegamento copiato negli appunti!
When you modify a machine set, your changes only apply to machines that are created after you save the updated
MachineSet
If you need to scale a machine set without making other changes, you do not need to delete the machines.
By default, the OpenShift Container Platform router pods are deployed on machines. Because the router is required to access some cluster resources, including the web console, do not scale the machine set to
0
Prerequisites
- Your OpenShift Container Platform cluster uses the Machine API.
-
You are logged in to the cluster as an administrator by using the OpenShift CLI ().
oc
Procedure
Edit the machine set:
$ oc edit machineset <machine_set_name> -n openshift-machine-apiNote the value of the
field, as you need it when scaling the machine set to apply the changes.spec.replicasapiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: name: <machine_set_name> namespace: openshift-machine-api spec: replicas: 21 # ...- 1
- The examples in this procedure show a machine set that has a
replicasvalue of2.
- Update the machine set CR with the configuration options that you want and save your changes.
List the machines that are managed by the updated machine set by running the following command:
$ oc get -n openshift-machine-api machines -l machine.openshift.io/cluster-api-machineset=<machine_set_name>Example output
NAME PHASE TYPE REGION ZONE AGE <machine_name_original_1> Running m6i.xlarge us-west-1 us-west-1a 4h <machine_name_original_2> Running m6i.xlarge us-west-1 us-west-1a 4hFor each machine that is managed by the updated machine set, set the
annotation by running the following command:delete$ oc annotate machine/<machine_name_original_1> \ -n openshift-machine-api \ machine.openshift.io/delete-machine="true"Scale the machine set to twice the number of replicas by running the following command:
$ oc scale --replicas=4 \1 machineset <machine_set_name> \ -n openshift-machine-api- 1
- The original example value of
2is doubled to4.
List the machines that are managed by the updated machine set by running the following command:
$ oc get -n openshift-machine-api machines -l machine.openshift.io/cluster-api-machineset=<machine_set_name>Example output
NAME PHASE TYPE REGION ZONE AGE <machine_name_original_1> Running m6i.xlarge us-west-1 us-west-1a 4h <machine_name_original_2> Running m6i.xlarge us-west-1 us-west-1a 4h <machine_name_updated_1> Provisioned m6i.xlarge us-west-1 us-west-1a 55s <machine_name_updated_2> Provisioning m6i.xlarge us-west-1 us-west-1a 55sWhen the new machines are in the
phase, you can scale the machine set to the original number of replicas.RunningScale the machine set to the original number of replicas by running the following command:
$ oc scale --replicas=2 \1 machineset <machine_set_name> \ -n openshift-machine-api- 1
- The original example value of
2.
Verification
To verify that the machines without the updated configuration are deleted, list the machines that are managed by the updated machine set by running the following command:
$ oc get -n openshift-machine-api machines -l machine.openshift.io/cluster-api-machineset=<machine_set_name>Example output while deletion is in progress
NAME PHASE TYPE REGION ZONE AGE <machine_name_original_1> Deleting m6i.xlarge us-west-1 us-west-1a 4h <machine_name_original_2> Deleting m6i.xlarge us-west-1 us-west-1a 4h <machine_name_updated_1> Running m6i.xlarge us-west-1 us-west-1a 5m41s <machine_name_updated_2> Running m6i.xlarge us-west-1 us-west-1a 5m41sExample output when deletion is complete
NAME PHASE TYPE REGION ZONE AGE <machine_name_updated_1> Running m6i.xlarge us-west-1 us-west-1a 6m30s <machine_name_updated_2> Running m6i.xlarge us-west-1 us-west-1a 6m30sTo verify that a machine created by the updated machine set has the correct configuration, examine the relevant fields in the CR for one of the new machines by running the following command:
$ oc describe machine <machine_name_updated_1> -n openshift-machine-api
3.3. About machine health checks Copia collegamentoCollegamento copiato negli appunti!
Machine health checks automatically repair unhealthy machines in a particular machine pool.
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the
NotReady
You cannot apply a machine health check to a machine with the master role.
The controller that observes a
MachineHealthCheck
machine deleted
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the
maxUnhealthy
Consider the timeouts carefully, accounting for workloads and requirements.
- Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.
-
Too short timeouts can result in a remediation loop. For example, the timeout for checking the status must be long enough to allow the machine to complete the startup process.
NotReady
To stop the check, remove the resource.
3.3.1. Limitations when deploying machine health checks Copia collegamentoCollegamento copiato negli appunti!
There are limitations to consider before deploying a machine health check:
- Only machines owned by a machine set are remediated by a machine health check.
- Control plane machines are not currently supported and are not remediated if they are unhealthy.
- If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
-
If the corresponding node for a machine does not join the cluster after the , the machine is remediated.
nodeStartupTimeout -
A machine is remediated immediately if the resource phase is
Machine.Failed
3.4. Sample MachineHealthCheck resource Copia collegamentoCollegamento copiato negli appunti!
The
MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role>
machine.openshift.io/cluster-api-machine-type: <role>
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
- 1
- Specify the name of the machine health check to deploy.
- 2 3
- Specify a label for the machine pool that you want to check.
- 4
- Specify the machine set to track in
<cluster_name>-<label>-<zone>format. For example,prod-node-us-east-1a. - 5 6
- Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
- 7
- Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by
maxUnhealthy, remediation is not performed. - 8
- Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.
The
matchLabels
3.4.1. Short-circuiting machine health check remediation Copia collegamentoCollegamento copiato negli appunti!
Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the
maxUnhealthy
MachineHealthCheck
If the user defines a value for the
maxUnhealthy
MachineHealthCheck
maxUnhealthy
maxUnhealthy
If
maxUnhealthy
100%
The appropriate
maxUnhealthy
MachineHealthCheck
maxUnhealthy
maxUnhealthy
The
maxUnhealthy
maxUnhealthy
3.4.1.1. Setting maxUnhealthy by using an absolute value Copia collegamentoCollegamento copiato negli appunti!
If
maxUnhealthy
2
- Remediation will be performed if 2 or fewer nodes are unhealthy
- Remediation will not be performed if 3 or more nodes are unhealthy
These values are independent of how many machines are being checked by the machine health check.
3.4.1.2. Setting maxUnhealthy by using percentages Copia collegamentoCollegamento copiato negli appunti!
If
maxUnhealthy
40%
- Remediation will be performed if 10 or fewer nodes are unhealthy
- Remediation will not be performed if 11 or more nodes are unhealthy
If
maxUnhealthy
40%
- Remediation will be performed if 2 or fewer nodes are unhealthy
- Remediation will not be performed if 3 or more nodes are unhealthy
The allowed number of machines is rounded down when the percentage of
maxUnhealthy
3.5. Creating a MachineHealthCheck resource Copia collegamentoCollegamento copiato negli appunti!
You can create a
MachineHealthCheck
MachineSets
MachineHealthCheck
Prerequisites
-
Install the command line interface.
oc
Procedure
-
Create a file that contains the definition of your machine health check.
healthcheck.yml Apply the
file to your cluster:healthcheck.yml$ oc apply -f healthcheck.yml