7.3. 检查是否应用推荐的集群配置
您可以检查集群是否正在运行正确的配置。以下流程描述了如何检查在 OpenShift Container Platform 4.16 集群中部署 DU 应用程序的各种配置。
先决条件
- 您已部署了集群,并根据 vDU 工作负载对其进行调整。
-
已安装 OpenShift CLI(
oc
)。 -
您已以具有
cluster-admin
权限的用户身份登录。
流程
检查默认 OperatorHub 源是否已禁用。运行以下命令:
$ oc get operatorhub cluster -o yaml
输出示例
spec: disableAllDefaultSources: true
运行以下命令,检查所有所需的
CatalogSource
资源是否标注了工作负载分区 (PreferredDuringScheduling
):$ oc get catalogsource -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.target\.workload\.openshift\.io/management}{"\n"}{end}'
输出示例
certified-operators -- {"effect": "PreferredDuringScheduling"} community-operators -- {"effect": "PreferredDuringScheduling"} ran-operators 1 redhat-marketplace -- {"effect": "PreferredDuringScheduling"} redhat-operators -- {"effect": "PreferredDuringScheduling"}
- 1
- 未注解的
CatalogSource
资源也会返回。在本例中,ran-operators
CatalogSource
资源没有被注解,它没有PreferredDuringScheduling
注解。
注意在正确配置的 vDU 集群中,只会列出注解的一个目录源。
检查是否为工作负载分区注解了所有适用的 OpenShift Container Platform Operator 命名空间。这包括 OpenShift Container Platform 核心安装的所有 Operator,以及参考 DU 调整配置中包含的附加 Operator 集合。运行以下命令:
$ oc get namespaces -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.workload\.openshift\.io/allowed}{"\n"}{end}'
输出示例
default -- openshift-apiserver -- management openshift-apiserver-operator -- management openshift-authentication -- management openshift-authentication-operator -- management
重要对于工作负载分区,不得为其他 Operator 进行注解。在上一命令的输出中,应当列出额外的 Operator,而无需
--
分隔符右侧的任何值。检查
ClusterLogging
配置是否正确。运行以下命令:验证是否配置了适当的输入和输出日志:
$ oc get -n openshift-logging ClusterLogForwarder instance -o yaml
输出示例
apiVersion: logging.openshift.io/v1 kind: ClusterLogForwarder metadata: creationTimestamp: "2022-07-19T21:51:41Z" generation: 1 name: instance namespace: openshift-logging resourceVersion: "1030342" uid: 8c1a842d-80c5-447a-9150-40350bdf40f0 spec: inputs: - infrastructure: {} name: infra-logs outputs: - name: kafka-open type: kafka url: tcp://10.46.55.190:9092/test pipelines: - inputRefs: - audit name: audit-logs outputRefs: - kafka-open - inputRefs: - infrastructure name: infrastructure-logs outputRefs: - kafka-open ...
检查策展调度是否适合您的应用程序:
$ oc get -n openshift-logging clusterloggings.logging.openshift.io instance -o yaml
输出示例
apiVersion: logging.openshift.io/v1 kind: ClusterLogging metadata: creationTimestamp: "2022-07-07T18:22:56Z" generation: 1 name: instance namespace: openshift-logging resourceVersion: "235796" uid: ef67b9b8-0e65-4a10-88ff-ec06922ea796 spec: collection: logs: fluentd: {} type: fluentd curation: curator: schedule: 30 3 * * * type: curator managementState: Managed ...
运行以下命令,检查 Web 控制台是否已禁用 (
managementState: Removed
):$ oc get consoles.operator.openshift.io cluster -o jsonpath="{ .spec.managementState }"
输出示例
Removed
运行以下命令,检查集群节点中禁用了
chronyd
:$ oc debug node/<node_name>
检查节点上的
chronyd
状态:sh-4.4# chroot /host
sh-4.4# systemctl status chronyd
输出示例
● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: man:chronyd(8) man:chrony.conf(5)
使用连接到
linuxptp-daemon
容器和 PTP Management Client (pmc
) 工具,检查 PTP 接口是否已成功同步到主时钟:运行以下命令,使用
linuxptp-daemon
pod 的名称设置$PTP_POD_NAME
变量:$ PTP_POD_NAME=$(oc get pods -n openshift-ptp -l app=linuxptp-daemon -o name)
运行以下命令来检查 PTP 设备的同步状态:
$ oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET PORT_DATA_SET'
输出示例
sending: GET PORT_DATA_SET 3cecef.fffe.7a7020-1 seq 0 RESPONSE MANAGEMENT PORT_DATA_SET portIdentity 3cecef.fffe.7a7020-1 portState SLAVE logMinDelayReqInterval -4 peerMeanPathDelay 0 logAnnounceInterval 1 announceReceiptTimeout 3 logSyncInterval 0 delayMechanism 1 logMinPdelayReqInterval 0 versionNumber 2 3cecef.fffe.7a7020-2 seq 0 RESPONSE MANAGEMENT PORT_DATA_SET portIdentity 3cecef.fffe.7a7020-2 portState LISTENING logMinDelayReqInterval 0 peerMeanPathDelay 0 logAnnounceInterval 1 announceReceiptTimeout 3 logSyncInterval 0 delayMechanism 1 logMinPdelayReqInterval 0 versionNumber 2
运行以下
pmc
命令来检查 PTP 时钟状态:$ oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET TIME_STATUS_NP'
输出示例
sending: GET TIME_STATUS_NP 3cecef.fffe.7a7020-0 seq 0 RESPONSE MANAGEMENT TIME_STATUS_NP master_offset 10 1 ingress_time 1657275432697400530 cumulativeScaledRateOffset +0.000000000 scaledLastGmPhaseChange 0 gmTimeBaseIndicator 0 lastGmPhaseChange 0x0000'0000000000000000.0000 gmPresent true 2 gmIdentity 3c2c30.ffff.670e00
检查在
linuxptp-daemon-container
日志中有与/var/run/ptp4l.0.config
中的值对应的master offset
:$ oc logs $PTP_POD_NAME -n openshift-ptp -c linuxptp-daemon-container
输出示例
phc2sys[56020.341]: [ptp4l.1.config] CLOCK_REALTIME phc offset -1731092 s2 freq -1546242 delay 497 ptp4l[56020.390]: [ptp4l.1.config] master offset -2 s2 freq -5863 path delay 541 ptp4l[56020.390]: [ptp4l.0.config] master offset -8 s2 freq -10699 path delay 533
运行以下命令检查 SR-IOV 配置是否正确:
检查
SriovOperatorConfig
资源中的disableDrain
值是否已设置为true
:$ oc get sriovoperatorconfig -n openshift-sriov-network-operator default -o jsonpath="{.spec.disableDrain}{'\n'}"
输出示例
true
运行以下命令,检查
SriovNetworkNodeState
同步状态是否为Succeeded
:$ oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o jsonpath="{.items[*].status.syncStatus}{'\n'}"
输出示例
Succeeded
验证为 SR-IOV 配置的每个接口下的虚拟功能(
Vfs
)预期数量和配置是否存在,并在.status.interfaces
字段中是正确的。例如:$ oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o yaml
输出示例
apiVersion: v1 items: - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodeState ... status: interfaces: ... - Vfs: - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.0 vendor: "8086" vfID: 0 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.1 vendor: "8086" vfID: 1 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.2 vendor: "8086" vfID: 2 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.3 vendor: "8086" vfID: 3 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.4 vendor: "8086" vfID: 4 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.5 vendor: "8086" vfID: 5 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.6 vendor: "8086" vfID: 6 - deviceID: 154c driver: vfio-pci pciAddress: 0000:3b:0a.7 vendor: "8086" vfID: 7
检查集群性能配置集是否正确。
cpu
和hugepages
部分将根据您的硬件配置而有所不同。运行以下命令:$ oc get PerformanceProfile openshift-node-performance-profile -o yaml
输出示例
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: creationTimestamp: "2022-07-19T21:51:31Z" finalizers: - foreground-deletion generation: 1 name: openshift-node-performance-profile resourceVersion: "33558" uid: 217958c0-9122-4c62-9d4d-fdc27c31118c spec: additionalKernelArgs: - idle=poll - rcupdate.rcu_normal_after_boot=0 - efi=runtime cpu: isolated: 2-51,54-103 reserved: 0-1,52-53 hugepages: defaultHugepagesSize: 1G pages: - count: 32 size: 1G machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true status: conditions: - lastHeartbeatTime: "2022-07-19T21:51:31Z" lastTransitionTime: "2022-07-19T21:51:31Z" status: "True" type: Available - lastHeartbeatTime: "2022-07-19T21:51:31Z" lastTransitionTime: "2022-07-19T21:51:31Z" status: "True" type: Upgradeable - lastHeartbeatTime: "2022-07-19T21:51:31Z" lastTransitionTime: "2022-07-19T21:51:31Z" status: "False" type: Progressing - lastHeartbeatTime: "2022-07-19T21:51:31Z" lastTransitionTime: "2022-07-19T21:51:31Z" status: "False" type: Degraded runtimeClass: performance-openshift-node-performance-profile tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-openshift-node-performance-profile
注意CPU 设置取决于服务器上可用的内核数,应当与工作负载分区设置保持一致。
巨页
配置取决于服务器和应用程序。运行以下命令,检查
PerformanceProfile
是否已成功应用到集群:$ oc get performanceprofile openshift-node-performance-profile -o jsonpath="{range .status.conditions[*]}{ @.type }{' -- '}{@.status}{'\n'}{end}"
输出示例
Available -- True Upgradeable -- True Progressing -- False Degraded -- False
运行以下命令检查
Tuned
性能补丁设置:$ oc get tuneds.tuned.openshift.io -n openshift-cluster-node-tuning-operator performance-patch -o yaml
输出示例
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: creationTimestamp: "2022-07-18T10:33:52Z" generation: 1 name: performance-patch namespace: openshift-cluster-node-tuning-operator resourceVersion: "34024" uid: f9799811-f744-4179-bf00-32d4436c08fd spec: profile: - data: | [main] summary=Configuration changes profile inherited from performance created tuned include=openshift-node-performance-openshift-node-performance-profile [bootloader] cmdline_crash=nohz_full=2-23,26-47 1 [sysctl] kernel.timer_migration=1 [scheduler] group.ice-ptp=0:f:10:*:ice-ptp.* [service] service.stalld=start,enable service.chronyd=stop,disable name: performance-patch recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: master priority: 19 profile: performance-patch
- 1
cmdline=nohz_full=
中的 cpu 列表将根据您的硬件配置而有所不同。
运行以下命令,检查是否禁用了集群网络诊断:
$ oc get networks.operator.openshift.io cluster -o jsonpath='{.spec.disableNetworkDiagnostics}'
输出示例
true
检查
Kubelet
housekeeping 间隔是否调整为较慢的速度。这是在containerMountNS
机器配置中设置的。运行以下命令:$ oc describe machineconfig container-mount-namespace-and-kubelet-conf-master | grep OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION
输出示例
Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s"
运行以下命令,检查 Grafana 和
alertManagerMain
是否已禁用,Prometheus 保留周期是否已设置为 24h:$ oc get configmap cluster-monitoring-config -n openshift-monitoring -o jsonpath="{ .data.config\.yaml }"
输出示例
grafana: enabled: false alertmanagerMain: enabled: false prometheusK8s: retention: 24h
使用以下命令验证集群中没有找到 Grafana 和
alertManagerMain
路由:$ oc get route -n openshift-monitoring alertmanager-main
$ oc get route -n openshift-monitoring grafana
这两个查询都应返回
Error from server(NotFound)
消息。
运行以下命令,检查是否已为每个
PerformanceProfile
、Tuned
性能补丁、工作负载分区和内核命令行参数分配至少 4 个保留
CPU:$ oc get performanceprofile -o jsonpath="{ .items[0].spec.cpu.reserved }"
输出示例
0-3
注意根据您的工作负载要求,您可能需要分配额外的保留 CPU。