6.5. 对 NUMA 感知调度进行故障排除

要排除 NUMA 感知 pod 调度的常见问题，请执行以下步骤。

先决条件

安装 OpenShift Container Platform CLI（oc）。
以具有 cluster-admin 权限的用户身份登录。
安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。

流程

运行以下命令，验证 noderesourcetopologies CRD 是否已在集群中部署：

$ oc get crd | grep noderesourcetopologies

输出示例

NAME                                                              CREATED AT
noderesourcetopologies.topology.node.k8s.io                       2022-01-18T08:28:06Z

运行以下命令，检查 NUMA-aware 调度程序名称是否与 NUMA 感知工作负载中指定的名称匹配：

$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

输出示例

topo-aware-scheduler

验证 NUMA-aware scheduable 节点是否应用了 noderesourcetopologies CR。运行以下命令:
```
$ oc get noderesourcetopologies.topology.node.k8s.io
```
输出示例
```
NAME                    AGE
compute-0.example.com   17h
compute-1.example.com   17h
```
注意
节点数应该等于机器配置池 (mcp) worker 定义中配置的 worker 节点数量。

运行以下命令，验证所有 scheduable 节点的 NUMA 区粒度：

$ oc get noderesourcetopologies.topology.node.k8s.io -o yaml

输出示例

apiVersion: v1
items:
- apiVersion: topology.node.k8s.io/v1
  kind: NodeResourceTopology
  metadata:
    annotations:
      k8stopoawareschedwg/rte-update: periodic
    creationTimestamp: "2022-06-16T08:55:38Z"
    generation: 63760
    name: worker-0
    resourceVersion: "8450223"
    uid: 8b77be46-08c0-4074-927b-d49361471590
  topologyPolicies:
  - SingleNUMANodeContainerLevel
  zones:
  - costs:
    - name: node-0
      value: 10
    - name: node-1
      value: 21
    name: node-0
    resources:
    - allocatable: "38"
      available: "38"
      capacity: "40"
      name: cpu
    - allocatable: "134217728"
      available: "134217728"
      capacity: "134217728"
      name: hugepages-2Mi
    - allocatable: "262352048128"
      available: "262352048128"
      capacity: "270107316224"
      name: memory
    - allocatable: "6442450944"
      available: "6442450944"
      capacity: "6442450944"
      name: hugepages-1Gi
    type: Node
  - costs:
    - name: node-0
      value: 21
    - name: node-1
      value: 10
    name: node-1
    resources:
    - allocatable: "268435456"
      available: "268435456"
      capacity: "268435456"
      name: hugepages-2Mi
    - allocatable: "269231067136"
      available: "269231067136"
      capacity: "270573244416"
      name: memory
    - allocatable: "40"
      available: "40"
      capacity: "40"
      name: cpu
    - allocatable: "1073741824"
      available: "1073741824"
      capacity: "1073741824"
      name: hugepages-1Gi
    type: Node
- apiVersion: topology.node.k8s.io/v1
  kind: NodeResourceTopology
  metadata:
    annotations:
      k8stopoawareschedwg/rte-update: periodic
    creationTimestamp: "2022-06-16T08:55:37Z"
    generation: 62061
    name: worker-1
    resourceVersion: "8450129"
    uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3
  topologyPolicies:
  - SingleNUMANodeContainerLevel
  zones: 1
  - costs:
    - name: node-0
      value: 10
    - name: node-1
      value: 21
    name: node-0
    resources: 2
    - allocatable: "38"
      available: "38"
      capacity: "40"
      name: cpu
    - allocatable: "6442450944"
      available: "6442450944"
      capacity: "6442450944"
      name: hugepages-1Gi
    - allocatable: "134217728"
      available: "134217728"
      capacity: "134217728"
      name: hugepages-2Mi
    - allocatable: "262391033856"
      available: "262391033856"
      capacity: "270146301952"
      name: memory
    type: Node
  - costs:
    - name: node-0
      value: 21
    - name: node-1
      value: 10
    name: node-1
    resources:
    - allocatable: "40"
      available: "40"
      capacity: "40"
      name: cpu
    - allocatable: "1073741824"
      available: "1073741824"
      capacity: "1073741824"
      name: hugepages-1Gi
    - allocatable: "268435456"
      available: "268435456"
      capacity: "268435456"
      name: hugepages-2Mi
    - allocatable: "269192085504"
      available: "269192085504"
      capacity: "270534262784"
      name: memory
    type: Node
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

1: zones 下的每个小节都描述了单个 NUMA 区域的资源。
2: resources 描述了 NUMA 区域资源的当前状态。检查 items.zones.resources.available 下列出的资源是否与分配给每个有保证的 pod 的独有的 NUMA 区资源对应。

6.5.1. 报告的资源可用性更精确

启用 cacheResyncPeriod 规格，以帮助 NUMA Resource Operator 通过监控节点上的待处理资源，并在调度程序缓存中同步此信息，以帮助 NUMA Resource Operator 报告更准确的资源可用性。这也有助于减少 Topology Affinity Error 错误，因为未优化调度决策。间隔越低，网络负载越大。cacheResyncPeriod 规格默认禁用。

先决条件

安装 OpenShift CLI（oc）。
以具有 cluster-admin 特权的用户身份登录。

流程

删除当前运行的 NUMAResourcesScheduler 资源：

运行以下命令来获取活跃的 NUMAResourcesScheduler ：

$ oc get NUMAResourcesScheduler

输出示例

NAME                     AGE
numaresourcesscheduler   92m

运行以下命令来删除二级调度程序资源：

$ oc delete NUMAResourcesScheduler numaresourcesscheduler

输出示例

numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted

将以下 YAML 保存到文件 nro-scheduler-cacheresync.yaml 中。本例将日志级别更改为 Debug ：

apiVersion: nodetopology.openshift.io/v1
kind: NUMAResourcesScheduler
metadata:
  name: numaresourcesscheduler
spec:
  imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.16"
  cacheResyncPeriod: "5s" 1

1: 为调度程序缓存同步输入间隔值（以秒为单位）。值 5s 通常用于大多数实现。

运行以下命令来创建更新的 NUMAResourcesScheduler 资源：

$ oc create -f nro-scheduler-cacheresync.yaml

输出示例

numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created

验证步骤

检查 NUMA-aware 调度程序是否已成功部署：

运行以下命令检查 CRD 是否已创建成功：

$ oc get crd | grep numaresourcesschedulers

输出示例

NAME                                                              CREATED AT
numaresourcesschedulers.nodetopology.openshift.io                 2022-02-25T11:57:03Z

运行以下命令，检查新的自定义调度程序是否可用：

$ oc get numaresourcesschedulers.nodetopology.openshift.io

输出示例

NAME                     AGE
numaresourcesscheduler   3h26m

检查调度程序的日志是否显示增加的日志级别：

运行以下命令，获取在 openshift-numaresources 命名空间中运行的 pod 列表：

$ oc get pods -n openshift-numaresources

输出示例

NAME                                               READY   STATUS    RESTARTS   AGE
numaresources-controller-manager-d87d79587-76mrm   1/1     Running   0          46h
numaresourcesoperator-worker-5wm2k                 2/2     Running   0          45h
numaresourcesoperator-worker-pb75c                 2/2     Running   0          45h
secondary-scheduler-7976c4d466-qm4sc               1/1     Running   0          21m

运行以下命令，获取二级调度程序 pod 的日志：

$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources

输出示例

...
I0223 11:04:55.614788       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
I0223 11:04:56.609114       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
I0223 11:05:22.626818       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
I0223 11:05:31.610356       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
I0223 11:05:31.713032       1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
I0223 11:05:53.461016       1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"

6.5.2. 检查 NUMA 感知调度程序日志

通过查看日志来排除 NUMA 感知调度程序的问题。如果需要，可以通过修改 NUMAResourcesScheduler 资源的 spec.logLevel 字段来增加调度程序日志级别。可接受值为 Normal、Debug 和 Trace，其中 Trace 是最详细的选项。

注意

要更改辅助调度程序的日志级别，请删除正在运行的调度程序资源，并使用更改后的日志级别重新部署它。在此停机期间，调度程序无法调度新的工作负载。

先决条件

安装 OpenShift CLI（oc）。
以具有 cluster-admin 特权的用户身份登录。

流程

删除当前运行的 NUMAResourcesScheduler 资源：

运行以下命令来获取活跃的 NUMAResourcesScheduler ：

$ oc get NUMAResourcesScheduler

输出示例

NAME                     AGE
numaresourcesscheduler   90m

运行以下命令来删除二级调度程序资源：

$ oc delete NUMAResourcesScheduler numaresourcesscheduler

输出示例

numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted

将以下 YAML 保存到文件 nro-scheduler-debug.yaml 中。本例将日志级别更改为 Debug ：

apiVersion: nodetopology.openshift.io/v1
kind: NUMAResourcesScheduler
metadata:
  name: numaresourcesscheduler
spec:
  imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.16"
  logLevel: Debug

运行以下命令，创建更新的 Debug logging NUMAResourcesScheduler 资源：

$ oc create -f nro-scheduler-debug.yaml

输出示例

numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created

验证步骤

检查 NUMA-aware 调度程序是否已成功部署：

运行以下命令检查 CRD 是否已创建成功：

$ oc get crd | grep numaresourcesschedulers

输出示例

NAME                                                              CREATED AT
numaresourcesschedulers.nodetopology.openshift.io                 2022-02-25T11:57:03Z

运行以下命令，检查新的自定义调度程序是否可用：

$ oc get numaresourcesschedulers.nodetopology.openshift.io

输出示例

NAME                     AGE
numaresourcesscheduler   3h26m

检查调度程序的日志是否显示增加的日志级别：

运行以下命令，获取在 openshift-numaresources 命名空间中运行的 pod 列表：

$ oc get pods -n openshift-numaresources

输出示例

NAME                                               READY   STATUS    RESTARTS   AGE
numaresources-controller-manager-d87d79587-76mrm   1/1     Running   0          46h
numaresourcesoperator-worker-5wm2k                 2/2     Running   0          45h
numaresourcesoperator-worker-pb75c                 2/2     Running   0          45h
secondary-scheduler-7976c4d466-qm4sc               1/1     Running   0          21m

运行以下命令，获取二级调度程序 pod 的日志：

$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources

输出示例

...
I0223 11:04:55.614788       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
I0223 11:04:56.609114       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
I0223 11:05:22.626818       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
I0223 11:05:31.610356       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
I0223 11:05:31.713032       1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
I0223 11:05:53.461016       1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"

6.5.3. 对资源拓扑 exporter 进行故障排除

通过检查对应的 resource-topology-exporter 日志，对发生意外结果的 noderesourcetopologlogies 对象进行故障排除。

注意

建议为它们引用的节点命名 NUMA 资源拓扑导出器实例。例如，名为 worker 的 worker 节点应具有对应的 noderesourcetopologies 对象，称为 worker。

先决条件

安装 OpenShift CLI（oc）。
以具有 cluster-admin 特权的用户身份登录。

流程

获取由 NUMA Resources Operator 管理的守护进程集（daemonset）。每个守护进程在 NUMAResourcesOperator CR 中有一个对应的 nodeGroup。运行以下命令:

$ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"

输出示例

{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}

使用上一步中的 name 值获取所需的守护进程集的标签：

$ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"

输出示例

{"name":"resource-topology"}

运行以下命令，使用 resource-topology 标签获取 pod：

$ oc get pods -n openshift-numaresources -l name=resource-topology -o wide

输出示例

NAME                                 READY   STATUS    RESTARTS   AGE    IP            NODE
numaresourcesoperator-worker-5wm2k   2/2     Running   0          2d1h   10.135.0.64   compute-0.example.com
numaresourcesoperator-worker-pb75c   2/2     Running   0          2d1h   10.132.2.33   compute-1.example.com

检查与您要故障排除的节点对应的 worker pod 上运行的 resource-topology-exporter 容器的日志。运行以下命令:

$ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c

输出示例

I0221 13:38:18.334140       1 main.go:206] using sysinfo:
reservedCpus: 0,1
reservedMemory:
  "0": 1178599424
I0221 13:38:18.334370       1 main.go:67] === System information ===
I0221 13:38:18.334381       1 sysinfo.go:231] cpus: reserved "0-1"
I0221 13:38:18.334493       1 sysinfo.go:237] cpus: online "0-103"
I0221 13:38:18.546750       1 main.go:72]
cpus: allocatable "2-103"
hugepages-1Gi:
  numa cell 0 -> 6
  numa cell 1 -> 1
hugepages-2Mi:
  numa cell 0 -> 64
  numa cell 1 -> 128
memory:
  numa cell 0 -> 45758Mi
  numa cell 1 -> 48372Mi

6.5.4. 更正缺少的资源拓扑 exporter 配置映射

如果您在配置了集群设置的集群中安装 NUMA Resources Operator，在有些情况下，Operator 会显示为 active，但资源拓扑 exporter (RTE) 守护进程集 pod 的日志显示 RTE 的配置缺失，例如：

Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"

此日志消息显示集群中未正确应用带有所需配置的 kubeletconfig，从而导致缺少 RTE configmap。例如，以下集群缺少 numaresourcesoperator-worker configmap 自定义资源 (CR)：

$ oc get configmap

输出示例

NAME                           DATA   AGE
0e2a6bd3.openshift-kni.io      0      6d21h
kube-root-ca.crt               1      6d21h
openshift-service-ca.crt       1      6d21h
topo-aware-scheduler-config    1      6d18h

在正确配置的集群中，oc get configmap 也会返回一个 numaresourcesoperator-worker configmap CR。

先决条件

安装 OpenShift Container Platform CLI（oc）。
以具有 cluster-admin 权限的用户身份登录。
安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。

流程

使用以下命令，比较 kubeletconfig 中的 spec.machineConfigPoolSelector.matchLabels 值和 MachineConfigPool (mcp) worker CR 中的 metadata.labels 的值：
1. 运行以下命令来检查 kubeletconfig 标签：
```
$ oc get kubeletconfig -o yaml
```
  输出示例
```
machineConfigPoolSelector:
  matchLabels:
    cnf-worker-tuning: enabled
```
2. 运行以下命令来检查 mcp 标签：
```
$ oc get mcp worker -o yaml
```
  输出示例
```
labels:
  machineconfiguration.openshift.io/mco-built-in: ""
  pools.operator.machineconfiguration.openshift.io/worker: ""
```
  cnf-worker-tuning: enabled 标签没有存在于 MachineConfigPool 对象中。

编辑 MachineConfigPool CR 使其包含缺少的标签，例如：

$ oc edit mcp worker -o yaml

输出示例

labels:
  machineconfiguration.openshift.io/mco-built-in: ""
  pools.operator.machineconfiguration.openshift.io/worker: ""
  cnf-worker-tuning: enabled

应用标签更改并等待集群应用更新的配置。运行以下命令:

验证

检查是否应用了缺少的 numaresourcesoperator-worker configmap CR:

$ oc get configmap

输出示例

NAME                           DATA   AGE
0e2a6bd3.openshift-kni.io      0      6d21h
kube-root-ca.crt               1      6d21h
numaresourcesoperator-worker   1      5m
openshift-service-ca.crt       1      6d21h
topo-aware-scheduler-config    1      6d18h

6.5.5. 收集 NUMA Resources Operator 数据

您可以使用 oc adm must-gather CLI 命令来收集有关集群的信息，包括与 NUMA Resources Operator 关联的功能和对象。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI(oc)。

流程

要使用 must-gather 来收集 NUMA Resources Operator 数据，您必须指定 NUMA Resources Operator must-gather 镜像。
```
$ oc adm must-gather --image=registry.redhat.io/numaresources-must-gather/numaresources-must-gather-rhel9:v4.16
```

6.5. 对 NUMA 感知调度进行故障排除

6.5.1. 报告的资源可用性更精确

6.5.2. 检查 NUMA 感知调度程序日志

6.5.3. 对资源拓扑 exporter 进行故障排除

6.5.4. 更正缺少的资源拓扑 exporter 配置映射

6.5.5. 收集 NUMA Resources Operator 数据

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Red Hat legal and privacy links

Red Hat legal and privacy links