Inicio
Productos
OpenShift Container Platform
4.8
Scalability and performance
Chapter 3. Recommended cluster scaling practices

Este contenido no está disponible en el idioma seleccionado.

Chapter 3. Recommended cluster scaling practices

Important

The guidance in this section is only relevant for installations with cloud provider integration.

These guidelines apply to OpenShift Container Platform with software-defined networking (SDN), not Open Virtual Network (OVN).

Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.

3.1. Recommended practices for scaling the cluster
Copiar enlace

When scaling up the cluster to higher node counts:

Spread nodes across all of the available zones for higher availability.
Scale up by no more than 25 to 50 machines at once.
Consider creating new machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.

Note

Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.

The controller might not be able to create the machines if the replicas in the machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits and excessive queries might lead to machine creation failures due to cloud platform limitations.

Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.

Note

When scaling large and dense clusters to lower node counts, it might take large amounts of time as the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client QPS and burst rates are currently set to

and

respectively and they cannot be modified in OpenShift Container Platform.

3.2. Modifying a machine set
Copiar enlace

To make changes to a machine set, edit the

MachineSet

YAML. Then, remove all machines associated with the machine set by deleting each machine or scaling down the machine set to

replicas. Then, scale the replicas back to the desired number. Changes you make to a machine set do not affect existing machines.

If you need to scale a machine set without making other changes, you do not need to delete the machines.

Note

By default, the OpenShift Container Platform router pods are deployed on workers. Because the router is required to access some cluster resources, including the web console, do not scale the worker machine set to

unless you first relocate the router pods.

Prerequisites

Install an OpenShift Container Platform cluster and the
```
oc
```
command line.
Log in to
```
oc
```
as a user with
```
cluster-admin
```
permission.

Procedure

Edit the machine set:

$ oc edit machineset <machineset> -n openshift-machine-api

Scale down the machine set to

$ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api

Or:

$ oc edit machineset <machineset> -n openshift-machine-api

Tip

You can alternatively apply the following YAML to scale the machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  replicas: 0

Wait for the machines to be removed.

Scale up the machine set as needed:

$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

Or:

$ oc edit machineset <machineset> -n openshift-machine-api

Tip

You can alternatively apply the following YAML to scale the machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  replicas: 2

Wait for the machines to start. The new machines contain changes you made to the machine set.

3.3. About machine health checks
Copiar enlace

Machine health checks automatically repair unhealthy machines in a particular machine pool.

To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the

NotReady

status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.

Note

You cannot apply a machine health check to a machine with the master role.

The controller that observes a

MachineHealthCheck

resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a

machine deleted

event.

To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the

maxUnhealthy

threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.

Note

Consider the timeouts carefully, accounting for workloads and requirements.

Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.
Too short timeouts can result in a remediation loop. For example, the timeout for checking the
```
NotReady
```
status must be long enough to allow the machine to complete the startup process.

To stop the check, remove the resource.

For example, you should stop the check during the upgrade process because the nodes in the cluster might become temporarily unavailable. The

MachineHealthCheck

might identify such nodes as unhealthy and reboot them. To avoid rebooting such nodes, remove any

MachineHealthCheck

resource that you have deployed before updating the cluster. However, a

MachineHealthCheck

resource that is deployed by default (such as

machine-api-termination-handler

) cannot be removed and will be recreated.

3.3.1. Limitations when deploying machine health checks
Copiar enlace

There are limitations to consider before deploying a machine health check:

Only machines owned by a machine set are remediated by a machine health check.
Control plane machines are not currently supported and are not remediated if they are unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the
```
nodeStartupTimeout
```
, the machine is remediated.
A machine is remediated immediately if the
```
Machine
```
resource phase is
```
Failed
```
.

3.4. Sample MachineHealthCheck resource
Copiar enlace

The

MachineHealthCheck

resource for all cloud-based installation types, and other than bare metal, resembles the following YAML file:

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example


  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role>


      machine.openshift.io/cluster-api-machine-type: <role>


      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone>


  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s"


    status: "False"
  - type:    "Ready"
    timeout: "300s"


    status: "Unknown"
  maxUnhealthy: "40%"


  nodeStartupTimeout: "10m"

1: Specify the name of the machine health check to deploy.
2 3: Specify a label for the machine pool that you want to check.
4: Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a.
5 6: Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
7: Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by maxUnhealthy, remediation is not performed.
8: Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

Note

The

matchLabels

are examples only; you must map your machine groups based on your specific needs.

3.4.1. Short-circuiting machine health check remediation
Copiar enlace

Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy. Short-circuiting is configured through the

maxUnhealthy

field in the

MachineHealthCheck

resource.

If the user defines a value for the

maxUnhealthy

field, before remediating any machines, the

MachineHealthCheck

compares the value of

maxUnhealthy

with the number of machines within its target pool that it has determined to be unhealthy. Remediation is not performed if the number of unhealthy machines exceeds the

maxUnhealthy

limit.

Important

maxUnhealthy

is not set, the value defaults to

100%

and the machines are remediated regardless of the state of the cluster.

The appropriate

maxUnhealthy

value depends on the scale of the cluster you deploy and how many machines the

MachineHealthCheck

covers. For example, you can use the

maxUnhealthy

value to cover multiple machine sets across multiple availability zones so that if you lose an entire zone, your

maxUnhealthy

setting prevents further remediation within the cluster.

The

maxUnhealthy

field can be set as either an integer or percentage. There are different remediation implementations depending on the

maxUnhealthy

value.

3.4.1.1. Setting maxUnhealthy by using an absolute value
Copiar enlace

maxUnhealthy

is set to

Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy

These values are independent of how many machines are being checked by the machine health check.

3.4.1.2. Setting maxUnhealthy by using percentages
Copiar enlace

maxUnhealthy

is set to

40%

and there are 25 machines being checked:

Remediation will be performed if 10 or fewer nodes are unhealthy
Remediation will not be performed if 11 or more nodes are unhealthy

maxUnhealthy

is set to

40%

and there are 6 machines being checked:

Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy

Note

The allowed number of machines is rounded down when the percentage of

maxUnhealthy

machines that are checked is not a whole number.

3.5. Creating a MachineHealthCheck resource
Copiar enlace

You can create a

MachineHealthCheck

resource for all

MachineSets

in your cluster. You should not create a

MachineHealthCheck

resource that targets control plane machines.

Prerequisites

Install the
```
oc
```
command line interface.

Procedure

Create a
```
healthcheck.yml
```
file that contains the definition of your machine health check.

Apply the

healthcheck.yml

file to your cluster:

$ oc apply -f healthcheck.yml

Este contenido no está disponible en el idioma seleccionado.

Chapter 3. Recommended cluster scaling practices

3.1. Recommended practices for scaling the cluster
Copiar enlace

3.2. Modifying a machine set
Copiar enlace

3.3. About machine health checks
Copiar enlace

3.3.1. Limitations when deploying machine health checks
Copiar enlace

3.4. Sample MachineHealthCheck resource
Copiar enlace

3.4.1. Short-circuiting machine health check remediation
Copiar enlace

3.4.1.1. Setting maxUnhealthy by using an absolute value
Copiar enlace

3.4.1.2. Setting maxUnhealthy by using percentages
Copiar enlace

3.5. Creating a MachineHealthCheck resource
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 3. Recommended cluster scaling practices

3.1. Recommended practices for scaling the clusterCopiar enlaceEnlace copiado en el portapapeles!

3.2. Modifying a machine setCopiar enlaceEnlace copiado en el portapapeles!

3.3. About machine health checksCopiar enlaceEnlace copiado en el portapapeles!

3.3.1. Limitations when deploying machine health checksCopiar enlaceEnlace copiado en el portapapeles!

3.4. Sample MachineHealthCheck resourceCopiar enlaceEnlace copiado en el portapapeles!

3.4.1. Short-circuiting machine health check remediationCopiar enlaceEnlace copiado en el portapapeles!

3.4.1.1. Setting maxUnhealthy by using an absolute valueCopiar enlaceEnlace copiado en el portapapeles!

3.4.1.2. Setting maxUnhealthy by using percentagesCopiar enlaceEnlace copiado en el portapapeles!

3.5. Creating a MachineHealthCheck resourceCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. Recommended practices for scaling the cluster
Copiar enlace

3.2. Modifying a machine set
Copiar enlace

3.3. About machine health checks
Copiar enlace

3.3.1. Limitations when deploying machine health checks
Copiar enlace

3.4. Sample MachineHealthCheck resource
Copiar enlace

3.4.1. Short-circuiting machine health check remediation
Copiar enlace

3.4.1.1. Setting maxUnhealthy by using an absolute value
Copiar enlace

3.4.1.2. Setting maxUnhealthy by using percentages
Copiar enlace

3.5. Creating a MachineHealthCheck resource
Copiar enlace