This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.6.6. 对 NUMA 感知调度进行故障排除
要排除 NUMA 感知 pod 调度的常见问题,请执行以下步骤。
先决条件
-
安装 OpenShift Container Platform CLI(
oc
)。 - 以具有 cluster-admin 权限的用户身份登录。
- 安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。
流程
运行以下命令,验证
noderesourcetopologies
CRD 是否已在集群中部署:oc get crd | grep noderesourcetopologies
$ oc get crd | grep noderesourcetopologies
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令,检查 NUMA-aware 调度程序名称是否与 NUMA 感知工作负载中指定的名称匹配:
oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
topo-aware-scheduler
topo-aware-scheduler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证 NUMA-aware scheduable 节点是否应用了
noderesourcetopologies
CR。运行以下命令:oc get noderesourcetopologies.topology.node.k8s.io
$ oc get noderesourcetopologies.topology.node.k8s.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME AGE compute-0.example.com 17h compute-1.example.com 17h
NAME AGE compute-0.example.com 17h compute-1.example.com 17h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意节点数应该等于机器配置池 (
mcp
) worker 定义中配置的 worker 节点数量。运行以下命令,验证所有 scheduable 节点的 NUMA 区粒度:
oc get noderesourcetopologies.topology.node.k8s.io -o yaml
$ oc get noderesourcetopologies.topology.node.k8s.io -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6.1. 检查 NUMA 感知调度程序日志 复制链接链接已复制到粘贴板!
通过查看日志来排除 NUMA 感知调度程序的问题。如果需要,可以通过修改 NUMAResourcesScheduler
资源的 spec.logLevel
字段来增加调度程序日志级别。可接受值为 Normal
、Debug
和 Trace
,其中 Trace
是最详细的选项。
要更改辅助调度程序的日志级别,请删除正在运行的调度程序资源,并使用更改后的日志级别重新部署它。在此停机期间,调度程序无法调度新的工作负载。
先决条件
-
安装 OpenShift CLI(
oc
)。 -
以具有
cluster-admin
特权的用户身份登录。
流程
删除当前运行的
NUMAResourcesScheduler
资源:运行以下命令来获取活跃的
NUMAResourcesScheduler
:oc get NUMAResourcesScheduler
$ oc get NUMAResourcesScheduler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME AGE numaresourcesscheduler 90m
NAME AGE numaresourcesscheduler 90m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令来删除二级调度程序资源:
oc delete NUMAResourcesScheduler numaresourcesscheduler
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
将以下 YAML 保存到文件
nro-scheduler-debug.yaml
中。本例将日志级别更改为Debug
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令,创建更新的
Debug
loggingNUMAResourcesScheduler
资源:oc create -f nro-scheduler-debug.yaml
$ oc create -f nro-scheduler-debug.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
验证步骤
检查 NUMA-aware 调度程序是否已成功部署:
运行以下命令检查 CRD 是否已创建成功:
oc get crd | grep numaresourcesschedulers
$ oc get crd | grep numaresourcesschedulers
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令,检查新的自定义调度程序是否可用:
oc get numaresourcesschedulers.nodetopology.openshift.io
$ oc get numaresourcesschedulers.nodetopology.openshift.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME AGE numaresourcesscheduler 3h26m
NAME AGE numaresourcesscheduler 3h26m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
检查调度程序的日志是否显示增加的日志级别:
运行以下命令,获取在
openshift-numaresources
命名空间中运行的 pod 列表:oc get pods -n openshift-numaresources
$ oc get pods -n openshift-numaresources
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令,获取二级调度程序 pod 的日志:
oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6.2. 对资源拓扑 exporter 进行故障排除 复制链接链接已复制到粘贴板!
通过检查对应的 resource-topology-exporter
日志,对发生意外结果的 noderesourcetopologlogies
对象进行故障排除。
建议为它们引用的节点命名 NUMA 资源拓扑导出器实例。例如,名为 worker 的 worker
节点应具有对应的 noderesourcetopologies
对象,称为 worker
。
先决条件
-
安装 OpenShift CLI(
oc
)。 -
以具有
cluster-admin
特权的用户身份登录。
流程
获取由 NUMA Resources Operator 管理的守护进程集(daemonset)。每个守护进程在
NUMAResourcesOperator
CR 中有一个对应的nodeGroup
。运行以下命令:oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"
$ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 使用上一步中的
name
值获取所需的守护进程集的标签:oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"
$ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
{"name":"resource-topology"}
{"name":"resource-topology"}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令,使用
resource-topology
标签获取 pod:oc get pods -n openshift-numaresources -l name=resource-topology -o wide
$ oc get pods -n openshift-numaresources -l name=resource-topology -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查与您要故障排除的节点对应的 worker pod 上运行的
resource-topology-exporter
容器的日志。运行以下命令:oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
$ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6.3. 更正缺少的资源拓扑 exporter 配置映射 复制链接链接已复制到粘贴板!
如果您在配置了集群设置的集群中安装 NUMA Resources Operator,在有些情况下,Operator 会显示为 active,但资源拓扑 exporter (RTE) 守护进程集 pod 的日志显示 RTE 的配置缺失,例如:
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
此日志消息显示集群中未正确应用带有所需配置的 kubeletconfig
,从而导致缺少 RTE configmap
。例如,以下集群缺少 numaresourcesoperator-worker
configmap
自定义资源 (CR):
oc get configmap
$ oc get configmap
输出示例
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
NAME DATA AGE
0e2a6bd3.openshift-kni.io 0 6d21h
kube-root-ca.crt 1 6d21h
openshift-service-ca.crt 1 6d21h
topo-aware-scheduler-config 1 6d18h
在正确配置的集群中,oc get configmap
也会返回一个 numaresourcesoperator-worker
configmap
CR。
先决条件
-
安装 OpenShift Container Platform CLI(
oc
)。 - 以具有 cluster-admin 权限的用户身份登录。
- 安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。
流程
使用以下命令,比较
kubeletconfig
中的spec.machineConfigPoolSelector.matchLabels
值和MachineConfigPool
(mcp
) worker CR 中的metadata.labels
的值:运行以下命令来检查
kubeletconfig
标签:oc get kubeletconfig -o yaml
$ oc get kubeletconfig -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabled
machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令来检查
mcp
标签:oc get mcp worker -o yaml
$ oc get mcp worker -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cnf-worker-tuning: enabled
标签没有存在于MachineConfigPool
对象中。
编辑
MachineConfigPool
CR 使其包含缺少的标签,例如:oc edit mcp worker -o yaml
$ oc edit mcp worker -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabled
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 应用标签更改并等待集群应用更新的配置。运行以下命令:
验证
检查是否应用了缺少的
numaresourcesoperator-worker
configmap
CR:oc get configmap
$ oc get configmap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow