7.5. 对 NUMA 感知调度进行故障排除
要排除 NUMA 感知 pod 调度的常见问题,请执行以下步骤。
先决条件
-
安装 OpenShift Container Platform CLI(
oc
)。 - 以具有 cluster-admin 权限的用户身份登录。
- 安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。
流程
运行以下命令,验证
noderesourcetopologies
CRD 是否已在集群中部署:$ oc get crd | grep noderesourcetopologies
输出示例
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
运行以下命令,检查 NUMA-aware 调度程序名称是否与 NUMA 感知工作负载中指定的名称匹配:
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
输出示例
topo-aware-scheduler
验证 NUMA-aware schedulable 节点是否应用了
noderesourcetopologies
CR。运行以下命令:$ oc get noderesourcetopologies.topology.node.k8s.io
输出示例
NAME AGE compute-0.example.com 17h compute-1.example.com 17h
注意节点数应该等于机器配置池 (
mcp
) worker 定义中配置的 worker 节点数量。运行以下命令,验证所有 schedulable 节点的 NUMA 区粒度:
$ oc get noderesourcetopologies.topology.node.k8s.io -o yaml
输出示例
apiVersion: v1 items: - apiVersion: topology.node.k8s.io/v1 kind: NodeResourceTopology metadata: annotations: k8stopoawareschedwg/rte-update: periodic creationTimestamp: "2022-06-16T08:55:38Z" generation: 63760 name: worker-0 resourceVersion: "8450223" uid: 8b77be46-08c0-4074-927b-d49361471590 topologyPolicies: - SingleNUMANodeContainerLevel zones: - costs: - name: node-0 value: 10 - name: node-1 value: 21 name: node-0 resources: - allocatable: "38" available: "38" capacity: "40" name: cpu - allocatable: "134217728" available: "134217728" capacity: "134217728" name: hugepages-2Mi - allocatable: "262352048128" available: "262352048128" capacity: "270107316224" name: memory - allocatable: "6442450944" available: "6442450944" capacity: "6442450944" name: hugepages-1Gi type: Node - costs: - name: node-0 value: 21 - name: node-1 value: 10 name: node-1 resources: - allocatable: "268435456" available: "268435456" capacity: "268435456" name: hugepages-2Mi - allocatable: "269231067136" available: "269231067136" capacity: "270573244416" name: memory - allocatable: "40" available: "40" capacity: "40" name: cpu - allocatable: "1073741824" available: "1073741824" capacity: "1073741824" name: hugepages-1Gi type: Node - apiVersion: topology.node.k8s.io/v1 kind: NodeResourceTopology metadata: annotations: k8stopoawareschedwg/rte-update: periodic creationTimestamp: "2022-06-16T08:55:37Z" generation: 62061 name: worker-1 resourceVersion: "8450129" uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3 topologyPolicies: - SingleNUMANodeContainerLevel zones: 1 - costs: - name: node-0 value: 10 - name: node-1 value: 21 name: node-0 resources: 2 - allocatable: "38" available: "38" capacity: "40" name: cpu - allocatable: "6442450944" available: "6442450944" capacity: "6442450944" name: hugepages-1Gi - allocatable: "134217728" available: "134217728" capacity: "134217728" name: hugepages-2Mi - allocatable: "262391033856" available: "262391033856" capacity: "270146301952" name: memory type: Node - costs: - name: node-0 value: 21 - name: node-1 value: 10 name: node-1 resources: - allocatable: "40" available: "40" capacity: "40" name: cpu - allocatable: "1073741824" available: "1073741824" capacity: "1073741824" name: hugepages-1Gi - allocatable: "268435456" available: "268435456" capacity: "268435456" name: hugepages-2Mi - allocatable: "269192085504" available: "269192085504" capacity: "270534262784" name: memory type: Node kind: List metadata: resourceVersion: "" selfLink: ""
7.5.1. 报告的资源可用性更精确
启用 cacheResyncPeriod
规格,以帮助 NUMA Resource Operator 通过监控节点上的待处理资源,并在调度程序缓存中同步此信息,以帮助 NUMA Resource Operator 报告更准确的资源可用性。这也有助于减少 Topology Affinity Error 错误,因为未优化调度决策。间隔越低,网络负载越大。cacheResyncPeriod
规格默认禁用。
先决条件
-
安装 OpenShift CLI(
oc
)。 -
以具有
cluster-admin
特权的用户身份登录。
流程
删除当前运行的
NUMAResourcesScheduler
资源:运行以下命令来获取活跃的
NUMAResourcesScheduler
:$ oc get NUMAResourcesScheduler
输出示例
NAME AGE numaresourcesscheduler 92m
运行以下命令来删除二级调度程序资源:
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
输出示例
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
将以下 YAML 保存到文件
nro-scheduler-cacheresync.yaml
中。本例将日志级别更改为Debug
:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.16" cacheResyncPeriod: "5s" 1
- 1
- 为调度程序缓存同步输入间隔值(以秒为单位)。值
5s
通常用于大多数实现。
运行以下命令来创建更新的
NUMAResourcesScheduler
资源:$ oc create -f nro-scheduler-cacheresync.yaml
输出示例
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
验证步骤
检查 NUMA-aware 调度程序是否已成功部署:
运行以下命令检查 CRD 是否已成功创建:
$ oc get crd | grep numaresourcesschedulers
输出示例
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
运行以下命令,检查新的自定义调度程序是否可用:
$ oc get numaresourcesschedulers.nodetopology.openshift.io
输出示例
NAME AGE numaresourcesscheduler 3h26m
检查调度程序的日志是否显示增加的日志级别:
运行以下命令,获取在
openshift-numaresources
命名空间中运行的 pod 列表:$ oc get pods -n openshift-numaresources
输出示例
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
运行以下命令,获取二级调度程序 pod 的日志:
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
输出示例
... I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq" I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
7.5.2. 检查 NUMA 感知调度程序日志
通过查看日志来排除 NUMA 感知调度程序的问题。如果需要,可以通过修改 NUMAResourcesScheduler
资源的 spec.logLevel
字段来增加调度程序日志级别。可接受值为 Normal
、Debug
和 Trace
,其中 Trace
是最详细的选项。
要更改辅助调度程序的日志级别,请删除正在运行的调度程序资源,并使用更改后的日志级别重新部署它。在此停机期间,调度程序无法调度新的工作负载。
先决条件
-
安装 OpenShift CLI(
oc
)。 -
以具有
cluster-admin
特权的用户身份登录。
流程
删除当前运行的
NUMAResourcesScheduler
资源:运行以下命令来获取活跃的
NUMAResourcesScheduler
:$ oc get NUMAResourcesScheduler
输出示例
NAME AGE numaresourcesscheduler 90m
运行以下命令来删除二级调度程序资源:
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
输出示例
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
将以下 YAML 保存到文件
nro-scheduler-debug.yaml
中。本例将日志级别更改为Debug
:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.16" logLevel: Debug
运行以下命令,创建更新的
Debug
loggingNUMAResourcesScheduler
资源:$ oc create -f nro-scheduler-debug.yaml
输出示例
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
验证步骤
检查 NUMA-aware 调度程序是否已成功部署:
运行以下命令检查 CRD 是否已成功创建:
$ oc get crd | grep numaresourcesschedulers
输出示例
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
运行以下命令,检查新的自定义调度程序是否可用:
$ oc get numaresourcesschedulers.nodetopology.openshift.io
输出示例
NAME AGE numaresourcesscheduler 3h26m
检查调度程序的日志是否显示增加的日志级别:
运行以下命令,获取在
openshift-numaresources
命名空间中运行的 pod 列表:$ oc get pods -n openshift-numaresources
输出示例
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
运行以下命令,获取二级调度程序 pod 的日志:
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
输出示例
... I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq" I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
7.5.3. 对资源拓扑 exporter 进行故障排除
通过检查对应的 resource-topology-exporter
日志,对发生意外结果的 noderesourcetopologlogies
对象进行故障排除。
建议为它们引用的节点命名 NUMA 资源拓扑导出器实例。例如,名为 worker 的 worker
节点应具有对应的 noderesourcetopologies
对象,称为 worker
。
先决条件
-
安装 OpenShift CLI(
oc
)。 -
以具有
cluster-admin
特权的用户身份登录。
流程
获取由 NUMA Resources Operator 管理的守护进程集(daemonset)。每个守护进程在
NUMAResourcesOperator
CR 中有一个对应的nodeGroup
。运行以下命令:$ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"
输出示例
{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
使用上一步中的
name
值获取所需的守护进程集的标签:$ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"
输出示例
{"name":"resource-topology"}
运行以下命令,使用
resource-topology
标签获取 pod:$ oc get pods -n openshift-numaresources -l name=resource-topology -o wide
输出示例
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
检查与您要故障排除的节点对应的 worker pod 上运行的
resource-topology-exporter
容器的日志。运行以下命令:$ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
输出示例
I0221 13:38:18.334140 1 main.go:206] using sysinfo: reservedCpus: 0,1 reservedMemory: "0": 1178599424 I0221 13:38:18.334370 1 main.go:67] === System information === I0221 13:38:18.334381 1 sysinfo.go:231] cpus: reserved "0-1" I0221 13:38:18.334493 1 sysinfo.go:237] cpus: online "0-103" I0221 13:38:18.546750 1 main.go:72] cpus: allocatable "2-103" hugepages-1Gi: numa cell 0 -> 6 numa cell 1 -> 1 hugepages-2Mi: numa cell 0 -> 64 numa cell 1 -> 128 memory: numa cell 0 -> 45758Mi numa cell 1 -> 48372Mi
7.5.4. 更正缺少的资源拓扑 exporter 配置映射
如果您在配置了集群设置的集群中安装 NUMA Resources Operator,在有些情况下,Operator 会显示为 active,但资源拓扑 exporter (RTE) 守护进程集 pod 的日志显示 RTE 的配置缺失,例如:
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
此日志消息显示集群中未正确应用带有所需配置的 kubeletconfig
,从而导致缺少 RTE configmap
。例如,以下集群缺少 numaresourcesoperator-worker
configmap
自定义资源 (CR):
$ oc get configmap
输出示例
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
在正确配置的集群中,oc get configmap
也会返回一个 numaresourcesoperator-worker
configmap
CR。
先决条件
-
安装 OpenShift Container Platform CLI(
oc
)。 - 以具有 cluster-admin 权限的用户身份登录。
- 安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。
流程
使用以下命令,比较
kubeletconfig
中的spec.machineConfigPoolSelector.matchLabels
值和MachineConfigPool
(mcp
) worker CR 中的metadata.labels
的值:运行以下命令来检查
kubeletconfig
标签:$ oc get kubeletconfig -o yaml
输出示例
machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabled
运行以下命令来检查
mcp
标签:$ oc get mcp worker -o yaml
输出示例
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""
cnf-worker-tuning: enabled
标签没有存在于MachineConfigPool
对象中。
编辑
MachineConfigPool
CR 使其包含缺少的标签,例如:$ oc edit mcp worker -o yaml
输出示例
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabled
- 应用标签更改并等待集群应用更新的配置。运行以下命令:
验证
检查是否应用了缺少的
numaresourcesoperator-worker
configmap
CR:$ oc get configmap
输出示例
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h numaresourcesoperator-worker 1 5m openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
7.5.5. 收集 NUMA Resources Operator 数据
您可以使用 oc adm must-gather
CLI 命令来收集有关集群的信息,包括与 NUMA Resources Operator 关联的功能和对象。
先决条件
-
您可以使用具有
cluster-admin
角色的用户访问集群。 -
已安装 OpenShift CLI(
oc
)。
流程
要使用
must-gather
来收集 NUMA Resources Operator 数据,您必须指定 NUMA Resources Operatormust-gather
镜像。$ oc adm must-gather --image=registry.redhat.io/numaresources-must-gather/numaresources-must-gather-rhel9:v4.16