5.7. 污点和容限

5.7.1. 了解污点和容限

通过使用污点（taint），节点可以拒绝调度 pod，除非 pod 具有匹配的容限（toleration）。

您可以通过节点规格（NodeSpec）将污点应用到节点，并通过 Pod 规格（PodSpec）将容限应用到 pod。当您应用污点时，调度程序无法将 pod 放置到该节点上，除非 pod 可以容限该污点。

节点规格中的污点示例

spec:
  taints:
  - effect: NoExecute
    key: key1
    value: value1
....

Pod 规格中的容限示例

spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
....

污点与容限由 key、value 和 effect 组成。

参数描述

key

key 是任意字符串，最多 253 个字符。key 必须以字母或数字开头，可以包含字母、数字、连字符、句点和下划线。

value

value 是任意字符串，最多 63 个字符。value 必须以字母或数字开头，可以包含字母、数字、连字符、句点和下划线。

effect

effect 的值包括：

`NoSchedule` ^[1]	与污点不匹配的新 pod 不会调度到该节点上。该节点上现有的 pod 会保留。
`PreferNoSchedule`	与污点不匹配的新 pod 可以调度到该节点上，但调度程序会尽量不这样调度。该节点上现有的 pod 会保留。
`NoExecute`	与污点不匹配的新 pod 无法调度到该节点上。节点上没有匹配容限的现有 pod 将被移除。

operator

`Equal`	`key`/`value`/`effect` 参数必须匹配。这是默认值。
`Exists`	`key`/`effect` 参数必须匹配。您必须保留一个空的 `value` 参数，这将匹配任何值。

如果为 control plane 节点（也称为 master 节点）添加了一个 NoSchedule 污点，则节点必须具有 node-role.kubernetes.io/master=:NoSchedule 污点，该污点会被默认添加。

例如：

apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
...

容限与污点匹配：

如果 operator 参数设为 Equal：
- key 参数相同；
- value 参数相同；
- effect 参数相同。
如果 operator 参数设为 Exists：
- key 参数相同；
- effect 参数相同。

OpenShift Container Platform 中内置了以下污点：

node.kubernetes.io/not-ready：节点未就绪。这与节点状况 Ready=False 对应。
node.kubernetes.io/unreachable：节点无法从节点控制器访问。这与节点状况 Ready=Unknown 对应。
node.kubernetes.io/memory-pressure：节点存在内存压力问题。这与节点状况 MemoryPressure=True 对应。
node.kubernetes.io/disk-pressure：节点存在磁盘压力问题。这与节点状况 DiskPressure=True 对应。
node.kubernetes.io/network-unavailable：节点网络不可用。
node.kubernetes.io/unschedulable：节点不可调度。
node.cloudprovider.kubernetes.io/uninitialized：当节点控制器通过外部云提供商启动时，在节点上设置这个污点来将其标记为不可用。在云控制器管理器中的某个控制器初始化这个节点后，kubelet 会移除此污点。
node.kubernetes.io/pid-pressure ：节点具有 pid 压力。这与节点状况 PIDPressure=True 对应。
重要
OpenShift Container Platform 不设置默认的 pid.available evictionHard。

5.7.1.1. 了解如何使用容限秒数来延迟 pod 驱除

您可以通过在 Pod 规格或 MachineSet 对象中指定 tolerationSeconds 参数，指定 pod 在被驱除前可以保持与节点绑定的时长。如果将具有 NoExecute effect 的污点添加到节点，则容限污点（包含 tolerationSeconds 参数）的 pod，在此期限内 pod 不会被驱除。

输出示例

spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600

在这里，如果此 pod 正在运行但没有匹配的容限，pod 保持与节点绑定 3600 秒，然后被驱除。如果污点在这个时间之前移除，pod 就不会被驱除。

5.7.1.2. 了解如何使用多个污点

您可以在同一个节点中放入多个污点，并在同一 pod 中放入多个容限。OpenShift Container Platform 按照如下所述处理多个污点和容限：

处理 pod 具有匹配容限的污点。
其余的不匹配污点在 pod 上有指示的 effect：
- 如果至少有一个不匹配污点具有 NoSchedule effect，则 OpenShift Container Platform 无法将 pod 调度到该节点上。
- 如果没有不匹配污点具有 NoSchedule effect，但至少有一个不匹配污点具有 PreferNoSchedule effect，则 OpenShift Container Platform 尝试不将 pod 调度到该节点上。
- 如果至少有一个未匹配污点具有 NoExecute effect，OpenShift Container Platform 会将 pod 从该节点驱除（如果它已在该节点上运行），或者不将 pod 调度到该节点上（如果还没有在该节点上运行）。
  - 不容许污点的 Pod 会立即被驱除。
  - 如果 Pod 容许污点而没有在 Pod 规格中指定 tolerationSeconds，则会永久保持绑定。
  - 如果 Pod 容许污点，且指定了 tolerationSeconds，则会在指定的时间里保持绑定。

例如：

向节点添加以下污点：

$ oc adm taint nodes node1 key1=value1:NoSchedule

$ oc adm taint nodes node1 key1=value1:NoExecute

$ oc adm taint nodes node1 key2=value2:NoSchedule

pod 具有以下容限：

spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"

在本例中，pod 无法调度到节点上，因为没有与第三个污点匹配的容限。如果在添加污点时 pod 已在节点上运行，pod 会继续运行，因为第三个污点是三个污点中 pod 唯一不容许的污点。

5.7.1.3. 了解 pod 调度和节点状况（根据状况保留节点）

Taint Nodes By Condition （默认启用）可自动污点报告状况的节点，如内存压力和磁盘压力。如果某个节点报告一个状况，则添加一个污点，直到状况被清除为止。这些污点具有 NoSchedule effect；即，pod 无法调度到该节点上，除非 pod 有匹配的容限。

在调度 pod 前，调度程序会检查节点上是否有这些污点。如果污点存在，则将 pod 调度到另一个节点。由于调度程序检查的是污点而非实际的节点状况，因此您可以通过添加适当的 pod 容限，将调度程序配置为忽略其中一些节点状况。

为确保向后兼容，守护进程会自动将下列容限添加到所有守护进程中：

node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/unschedulable（1.10 或更高版本）
node.kubernetes.io/network-unavailable（仅限主机网络）

您还可以在守护进程集中添加任意容限。

注意

control plane 还会在具有 QoS 类的 pod 中添加 node.kubernetes.io/memory-pressure 容限。这是因为 Kubernetes 在 Guaranteed 或 Burstable QoS 类中管理 pod。新的 BestEffort pod 不会调度到受影响的节点上。

5.7.1.4. 了解根据状况驱除 pod（基于垃圾的驱除）

Taint-Based Evictions 功能默认是启用的，可以从遇到特定状况（如 not-ready 和 unreachable）的节点驱除 pod。当节点遇到其中一个状况时，OpenShift Container Platform 会自动给节点添加污点，并开始驱除 pod 以及将 pod 重新调度到其他节点。

Taint Based Evictions 具有 NoExecute 效果，不容许污点的 pod 都被立即驱除，容许污点的 pod 不会被驱除，除非 pod 使用 tolerationSeconds 参数。

tolerationSeconds 参数允许您指定 pod 保持与具有节点状况的节点绑定的时长。如果在 tolerationSections 到期后状况仍然存在，则污点会保持在节点上，并且具有匹配容限的 pod 将被驱除。如果状况在 tolerationSeconds 到期前清除，则不会删除具有匹配容限的 pod。

如果使用没有值的 tolerationSeconds 参数，则 pod 不会因为未就绪和不可访问的节点状况而被驱除。

注意

OpenShift Container Platform 会以限速方式驱除 pod，从而防止在主控机从节点分离等情形中发生大量 pod 驱除。

默认情况下，如果给定区域中的节点超过 55% 的节点不健康，节点生命周期控制器会将该区域的状态改为 PartialDisruption，并且 pod 驱除率会减少。对于此状态的小型集群（默认为 50 个节点或更少），这个区中的节点不会污点，驱除会被停止。

如需更多信息，请参阅 Kubernetes 文档中的有关驱除率限制。

OpenShift Container Platform 会自动为 node.kubernetes.io/not-ready 和 node.kubernetes.io/unreachable 添加容限并设置 tolerationSeconds=300，除非 Pod 配置中指定了其中任一种容限。

spec:
  tolerations:
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300 1
  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300

1: 这些容限确保了在默认情况下，pod 在检测到这些节点条件问题中的任何一个时，会保持绑定五分钟。

您可以根据需要配置这些容限。例如，如果您有一个具有许多本地状态的应用程序，您可能希望在发生网络分区时让 pod 与节点保持绑定更久一些，以等待分区恢复并避免 pod 驱除行为的发生。

由守护进程集生成的 pod 在创建时会带有以下污点的 NoExecute 容限，且没有 tolerationSeconds:

node.kubernetes.io/unreachable
node.kubernetes.io/not-ready

因此，守护进程集 pod 不会被驱除。

5.7.1.5. 容限所有污点

您可以通过添加 operator: "Exists" 容限而无需 key 和 value 参数，将节点配置为容许所有污点。具有此容限的 Pod 不会从具有污点的节点中删除。

用于容忍所有污点的Pod 规格

spec:
  tolerations:
  - operator: "Exists"

5.7.2. 添加污点和容限

您可以为 pod 和污点添加容限，以便节点能够控制哪些 pod 应该或不应该调度到节点上。对于现有的 pod 和节点，您应首先将容限添加到 pod，然后将污点添加到节点，以避免在添加容限前从节点上移除 pod。

流程

通过编辑 Pod spec 使其包含 tolerations 小节来向 pod 添加容限：
使用 Equal 运算符的 pod 配置文件示例
```
spec:
  tolerations:
  - key: "key1" 1
    value: "value1"
    operator: "Equal"
    effect: "NoExecute"
    tolerationSeconds: 3600 2
```
1
容限参数，如 Taint 和 toleration 组件表中所述。
2
tolerationSeconds 参数指定 pod 在被驱除前可以保持与节点绑定的时长。
例如：
使用 Exists 运算符的 pod 配置文件示例
```
spec:
   tolerations:
    - key: "key1"
      operator: "Exists" 1
      effect: "NoExecute"
      tolerationSeconds: 3600
```
1
Exists 运算符不会接受一个 value。
本例在 node1 上放置一个键为 key1 且值为 value1 的污点，污点效果是 NoExecute。
通过以下命令，使用 Taint 和 toleration 组件表中描述的参数为节点添加污点：
```
$ oc adm taint nodes <node_name> <key>=<value>:<effect>
```
例如：
```
$ oc adm taint nodes node1 key1=value1:NoExecute
```
此命令在 node1 上放置一个键为 key1，值为 value1 的污点，其效果是 NoExecute。
注意
如果为 control plane 节点（也称为 master 节点）添加了一个 NoSchedule 污点，则节点必须具有 node-role.kubernetes.io/master=:NoSchedule 污点，该污点会被默认添加。
例如：
```
apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
...
```
pod 上的容限与节点上的污点匹配。具有任一容限的 pod 可以调度到 node1 上。

5.7.3. 使用机器集添加污点和容限

您可以使用机器集为节点添加污点。与 MachineSet 对象关联的所有节点都会使用污点更新。容限对由机器集添加的污点的处理方式与直接添加到节点的污点的处理方式相同。

流程

通过编辑 Pod spec 使其包含 tolerations 小节来向 pod 添加容限：
使用 Equal 运算符的 pod 配置文件示例
```
spec:
  tolerations:
  - key: "key1" 1
    value: "value1"
    operator: "Equal"
    effect: "NoExecute"
    tolerationSeconds: 3600 2
```
1
容限参数，如 Taint 和 toleration 组件表中所述。
2
tolerationSeconds 参数指定 pod 在被驱除前与节点绑定的时长。
例如：
使用 Exists 运算符的 pod 配置文件示例
```
spec:
  tolerations:
  - key: "key1"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600
```
将污点添加到 MachineSet 对象：
1. 为您想要污点的节点编辑 MachineSet YAML，也可以创建新 MachineSet 对象：
```
$ oc edit machineset <machineset>
```
2. 将污点添加到 spec.template.spec 部分：
  机器集规格中的污点示例
```
spec:
....
  template:
....
    spec:
      taints:
      - effect: NoExecute
        key: key1
        value: value1
....
```
  本例在节点上放置一个键为 key1，值为 value1 的污点，污点效果是 NoExecute。
3. 将机器缩减为 0:
```
$ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api
```
  提示
  您还可以应用以下 YAML 来扩展机器集：
  apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: name: <machineset> namespace: openshift-machine-api spec: replicas: 0
  等待机器被删除。
4. 根据需要扩展机器设置：
```
$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
```
  或者：
```
$ oc edit machineset <machineset> -n openshift-machine-api
```
  等待机器启动。污点添加到与 MachineSet 对象关联的节点上。

5.7.4. 使用污点和容限将用户绑定到节点

如果要指定一组节点供特定用户独占使用，为 pod 添加容限。然后，在这些节点中添加对应的污点。具有容限的 pod 被允许使用污点节点，或集群中的任何其他节点。

如果您希望确保 pod 只调度到那些污点节点，还要将标签添加到同一组节点，并为 pod 添加节点关联性，以便 pod 只能调度到具有该标签的节点。

流程

配置节点以使用户只能使用该节点：

为这些节点添加对应的污点：

例如：

$ oc adm taint nodes node1 dedicated=groupName:NoSchedule

提示

您还可以应用以下 YAML 来添加污点：

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    ...
spec:
  taints:
    - key: dedicated
      value: groupName
      effect: NoSchedule

通过编写自定义准入控制器，为 pod 添加容限。

5.7.5. 使用污点和容限控制具有特殊硬件的节点

如果集群中有少量节点具有特殊的硬件，您可以使用污点和容限让不需要特殊硬件的 pod 与这些节点保持距离，从而将这些节点保留给那些确实需要特殊硬件的 pod。您还可以要求需要特殊硬件的 pod 使用特定的节点。

您可以将容限添加到需要特殊硬件并污点具有特殊硬件的节点的 pod 中。

流程

确保为特定 pod 保留具有特殊硬件的节点：

为需要特殊硬件的 pod 添加容限。

例如：

spec:
  tolerations:
    - key: "disktype"
      value: "ssd"
      operator: "Equal"
      effect: "NoSchedule"
      tolerationSeconds: 3600

使用以下命令之一，给拥有特殊硬件的节点添加污点：

$ oc adm taint nodes <node-name> disktype=ssd:NoSchedule

或者：

$ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule

提示

您还可以应用以下 YAML 来添加污点：

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    ...
spec:
  taints:
    - key: disktype
      value: ssd
      effect: PreferNoSchedule

5.7.6. 删除污点和容限

您可以根据需要，从节点移除污点并从 pod 移除容限。您应首先将容限添加到 pod，然后将污点添加到节点，以避免在添加容限前从节点上移除 pod。

流程

移除污点和容限：

从节点移除污点：

$ oc adm taint nodes <node-name> <key>-

例如：

$ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-