19.6. 推荐的 vDU 应用程序工作负载的单节点 OpenShift 集群配置
使用以下引用信息,了解在集群中部署虚拟分布式单元 (vDU) 应用程序所需的单节点 OpenShift 配置。配置包括用于高性能工作负载的集群优化、启用工作负载分区以及最大程度减少安装后所需的重启数量。
其他资源
- 要手动部署单个集群,请参阅使用 ZTP 手动安装单节点 OpenShift 集群。
- 要使用 GitOps 零接触置备 (ZTP) 部署集群集合,请参阅使用 ZTP 部署边缘站点。
19.6.1. 在 OpenShift Container Platform 上运行低延迟应用程序
OpenShift Container Platform 通过使用几个技术和专用硬件设备,为在商业现成 (COTS) 硬件上运行的应用程序启用低延迟处理:
- RHCOS 的实时内核
- 确保以高度的进程确定性处理工作负载。
- CPU 隔离
- 避免 CPU 调度延迟并确保 CPU 容量一致可用。
- NUMA 感知拓扑管理
- 将内存和巨页与 CPU 和 PCI 设备对齐,以将容器内存和巨页固定到非统一内存访问(NUMA)节点。所有服务质量 (QoS) 类的 Pod 资源保留在同一个 NUMA 节点上。这可降低延迟并提高节点的性能。
- 巨页内存管理
- 使用巨页大小可减少访问页表所需的系统资源量,从而提高系统性能。
- 使用 PTP 进行精确计时同步
- 允许以子微秒的准确性在网络中的节点之间进行同步。
19.6.2. vDU 应用程序工作负载的推荐集群主机要求
运行 vDU 应用程序工作负载需要一个具有足够资源的裸机主机来运行 OpenShift Container Platform 服务和生产工作负载。
profile | vCPU | memory | Storage |
---|---|---|---|
最小值 | 4 到 8 个 vCPU 内核 | 32GB RAM | 120GB |
当未启用并发多线程 (SMT) 或超线程时,一个 vCPU 相当于一个物理内核。启用后,使用以下公式来计算对应的比率:
- (每个内核的线程数 x 内核数)x 插槽数 = vCPU
使用虚拟介质引导时,服务器必须具有基板管理控制器(BMC)。
19.6.3. 为低延迟和高性能配置主机固件
裸机主机需要在置备主机前配置固件。固件配置取决于您的特定硬件和安装的具体要求。
流程
-
将 UEFI/BIOS Boot Mode 设置为
UEFI
。 - 在主机引导顺序中,设置 Hard drive first。
为您的硬件应用特定的固件配置。下表描述了 Intel Xeon Skylake 或 Intel Cascade Lake 服务器的代表固件配置,它基于 Intel FlexRAN 4G 和 5G 基带 PHY 参考设计。
重要确切的固件配置取决于您的特定硬件和网络要求。以下示例配置仅用于说明目的。
表 19.6. Intel Xeon Skylake 或 Cascade Lake 服务器的固件配置示例 固件设置 Configuration CPU Power 和性能策略
性能
非核心频率扩展
Disabled
性能限制
Disabled
增强的 Intel SpeedStep ® Tech
Enabled
Intel 配置的 TDP
Enabled
可配置 TDP 级别
2 级
Intel® Turbo Boost Technology
Enabled
节能 Turbo
Disabled
硬件 P-State
Disabled
软件包 C-State
C0/C1 状态
C1E
Disabled
处理器 C6
Disabled
在主机的固件中启用全局 SR-IOV 和 VT-d 设置。这些设置与裸机环境相关。
19.6.4. 受管集群网络的连接先决条件
在安装并置备带有零接触置备 (ZTP) GitOps 管道的受管集群前,受管集群主机必须满足以下网络先决条件:
- hub 集群中的 ZTP GitOps 容器和目标裸机主机的 Baseboard Management Controller (BMC) 之间必须有双向连接。
受管集群必须能够解析和访问 hub 主机名和
*.apps
主机名的 API 主机名。以下是 hub 和*.apps
主机名的 API 主机名示例:-
api.hub-cluster.internal.domain.com
-
console-openshift-console.apps.hub-cluster.internal.domain.com
-
hub 集群必须能够解析并访问受管集群的 API 和
*.app
主机名。以下是受管集群的 API 主机名和*.apps
主机名示例:-
api.sno-managed-cluster-1.internal.domain.com
-
console-openshift-console.apps.sno-managed-cluster-1.internal.domain.com
-
19.6.5. 推荐的安装时集群配置
ZTP 管道在集群安装过程中应用以下自定义资源 (CR)。这些配置 CR 确保集群满足运行 vDU 应用程序所需的功能和性能要求。
当将 ZTP GitOps 插件和 SiteConfig
CR 用于集群部署时,默认包含以下 MachineConfig
CR。
使用 SiteConfig
extraManifests
过滤器更改默认包括的 CR。如需更多信息,请参阅使用 SiteConfig CR 的高级受管集群配置。
19.6.5.1. 工作负载分区
运行 DU 工作负载的单节点 OpenShift 集群需要工作负载分区。这限制了运行平台服务的内核数,从而最大程度提高应用程序有效负载的 CPU 内核。
工作负载分区只能在集群安装过程中启用。您不能在安装后禁用工作负载分区。但是,您可以通过更新您在性能配置集中定义的 cpu
值以及相关的 MachineConfig
自定义资源 (CR) 来重新配置工作负载分区。
启用工作负载分区的 base64 编码的 CR,它包含管理工作负载受限制的 CPU 集。为 base64 中的
crio.conf
和kubelet.conf
对特定于主机的值进行编码。调整内容以匹配集群性能配置集中指定的 CPU 集。它必须与集群主机中的内核数匹配。推荐的工作负载分区配置
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 02-master-workload-partitioning spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMubWFuYWdlbWVudF0KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gInRhcmdldC53b3JrbG9hZC5vcGVuc2hpZnQuaW8vbWFuYWdlbWVudCIKYW5ub3RhdGlvbl9wcmVmaXggPSAicmVzb3VyY2VzLndvcmtsb2FkLm9wZW5zaGlmdC5pbyIKcmVzb3VyY2VzID0geyAiY3B1c2hhcmVzIiA9IDAsICJjcHVzZXQiID0gIjAtMSw1Mi01MyIgfQo= mode: 420 overwrite: true path: /etc/crio/crio.conf.d/01-workload-partitioning user: name: root - contents: source: data:text/plain;charset=utf-8;base64,ewogICJtYW5hZ2VtZW50IjogewogICAgImNwdXNldCI6ICIwLTEsNTItNTMiCiAgfQp9Cg== mode: 420 overwrite: true path: /etc/kubernetes/openshift-workload-pinning user: name: root
在集群主机上配置时,
/etc/crio/crio.conf.d/01-workload-partitioning
的内容应该类似如下:[crio.runtime.workloads.management] activation_annotation = "target.workload.openshift.io/management" annotation_prefix = "resources.workload.openshift.io" resources = { "cpushares" = 0, "cpuset" = "0-1,52-53" } 1
- 1
CPU
值因安装而异。
如果启用了超线程,请为每个内核指定两个线程。
CPU
值必须与性能配置集中指定的保留 CPU 设置匹配。在集群中配置时,
/etc/kubernetes/openshift-workload-pinning
的内容应如下所示:{ "management": { "cpuset": "0-1,52-53" 1 } }
- 1
cpuset
必须与/etc/crio/crio.conf.d/01-workload-partitioning
中的CPU
值匹配。
19.6.5.2. 减少平台管理占用空间
要减少平台的整体管理空间,需要一个 MachineConfig
自定义资源 (CR),它将所有特定于 Kubernetes 的挂载点放在独立于主机操作系统的新命名空间中。以下 base64 编码的示例 MachineConfig
CR 演示了此配置。
推荐的容器挂载命名空间配置
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: container-mount-namespace-and-kubelet-conf-master spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKCmRlYnVnKCkgewogIGVjaG8gJEAgPiYyCn0KCnVzYWdlKCkgewogIGVjaG8gVXNhZ2U6ICQoYmFzZW5hbWUgJDApIFVOSVQgW2VudmZpbGUgW3Zhcm5hbWVdXQogIGVjaG8KICBlY2hvIEV4dHJhY3QgdGhlIGNvbnRlbnRzIG9mIHRoZSBmaXJzdCBFeGVjU3RhcnQgc3RhbnphIGZyb20gdGhlIGdpdmVuIHN5c3RlbWQgdW5pdCBhbmQgcmV0dXJuIGl0IHRvIHN0ZG91dAogIGVjaG8KICBlY2hvICJJZiAnZW52ZmlsZScgaXMgcHJvdmlkZWQsIHB1dCBpdCBpbiB0aGVyZSBpbnN0ZWFkLCBhcyBhbiBlbnZpcm9ubWVudCB2YXJpYWJsZSBuYW1lZCAndmFybmFtZSciCiAgZWNobyAiRGVmYXVsdCAndmFybmFtZScgaXMgRVhFQ1NUQVJUIGlmIG5vdCBzcGVjaWZpZWQiCiAgZXhpdCAxCn0KClVOSVQ9JDEKRU5WRklMRT0kMgpWQVJOQU1FPSQzCmlmIFtbIC16ICRVTklUIHx8ICRVTklUID09ICItLWhlbHAiIHx8ICRVTklUID09ICItaCIgXV07IHRoZW4KICB1c2FnZQpmaQpkZWJ1ZyAiRXh0cmFjdGluZyBFeGVjU3RhcnQgZnJvbSAkVU5JVCIKRklMRT0kKHN5c3RlbWN0bCBjYXQgJFVOSVQgfCBoZWFkIC1uIDEpCkZJTEU9JHtGSUxFI1wjIH0KaWYgW1sgISAtZiAkRklMRSBdXTsgdGhlbgogIGRlYnVnICJGYWlsZWQgdG8gZmluZCByb290IGZpbGUgZm9yIHVuaXQgJFVOSVQgKCRGSUxFKSIKICBleGl0CmZpCmRlYnVnICJTZXJ2aWNlIGRlZmluaXRpb24gaXMgaW4gJEZJTEUiCkVYRUNTVEFSVD0kKHNlZCAtbiAtZSAnL15FeGVjU3RhcnQ9LipcXCQvLC9bXlxcXSQvIHsgcy9eRXhlY1N0YXJ0PS8vOyBwIH0nIC1lICcvXkV4ZWNTdGFydD0uKlteXFxdJC8geyBzL15FeGVjU3RhcnQ9Ly87IHAgfScgJEZJTEUpCgppZiBbWyAkRU5WRklMRSBdXTsgdGhlbgogIFZBUk5BTUU9JHtWQVJOQU1FOi1FWEVDU1RBUlR9CiAgZWNobyAiJHtWQVJOQU1FfT0ke0VYRUNTVEFSVH0iID4gJEVOVkZJTEUKZWxzZQogIGVjaG8gJEVYRUNTVEFSVApmaQo= mode: 493 path: /usr/local/bin/extractExecStart - contents: source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKbnNlbnRlciAtLW1vdW50PS9ydW4vY29udGFpbmVyLW1vdW50LW5hbWVzcGFjZS9tbnQgIiRAIgo= mode: 493 path: /usr/local/bin/nsenterCmns systemd: units: - contents: | [Unit] Description=Manages a mount namespace that both kubelet and crio can use to share their container-specific mounts [Service] Type=oneshot RemainAfterExit=yes RuntimeDirectory=container-mount-namespace Environment=RUNTIME_DIRECTORY=%t/container-mount-namespace Environment=BIND_POINT=%t/container-mount-namespace/mnt ExecStartPre=bash -c "findmnt ${RUNTIME_DIRECTORY} || mount --make-unbindable --bind ${RUNTIME_DIRECTORY} ${RUNTIME_DIRECTORY}" ExecStartPre=touch ${BIND_POINT} ExecStart=unshare --mount=${BIND_POINT} --propagation slave mount --make-rshared / ExecStop=umount -R ${RUNTIME_DIRECTORY} enabled: true name: container-mount-namespace.service - dropins: - contents: | [Unit] Wants=container-mount-namespace.service After=container-mount-namespace.service [Service] ExecStartPre=/usr/local/bin/extractExecStart %n /%t/%N-execstart.env ORIG_EXECSTART EnvironmentFile=-/%t/%N-execstart.env ExecStart= ExecStart=bash -c "nsenter --mount=%t/container-mount-namespace/mnt \ ${ORIG_EXECSTART}" name: 90-container-mount-namespace.conf name: crio.service - dropins: - contents: | [Unit] Wants=container-mount-namespace.service After=container-mount-namespace.service [Service] ExecStartPre=/usr/local/bin/extractExecStart %n /%t/%N-execstart.env ORIG_EXECSTART EnvironmentFile=-/%t/%N-execstart.env ExecStart= ExecStart=bash -c "nsenter --mount=%t/container-mount-namespace/mnt \ ${ORIG_EXECSTART} --housekeeping-interval=30s" name: 90-container-mount-namespace.conf - contents: | [Service] Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s" Environment="OPENSHIFT_EVICTION_MONITORING_PERIOD_DURATION=30s" name: 30-kubelet-interval-tuning.conf name: kubelet.service
19.6.5.3. SCTP
流控制传输协议 (SCTP) 是在 RAN 应用程序中使用的密钥协议。此 MachineConfig
对象向节点添加 SCTP 内核模块以启用此协议。
推荐的 SCTP 配置
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: load-sctp-module spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:, verification: {} filesystem: root mode: 420 path: /etc/modprobe.d/sctp-blacklist.conf - contents: source: data:text/plain;charset=utf-8,sctp filesystem: root mode: 420 path: /etc/modules-load.d/sctp-load.conf
19.6.5.4. 加速容器启动
以下 MachineConfig
CR 配置 OpenShift 核心进程和容器,以便在系统启动和关闭过程中使用所有可用的 CPU 内核。这会加快初始引导过程和重启过程中的系统恢复。
推荐的容器启动配置
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 04-accelerated-container-startup-master spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,#!/bin/bash
#
# Temporarily reset the core system processes's CPU affinity to be unrestricted to accelerate startup and shutdown
#
# The defaults below can be overridden via environment variables
#

# The default set of critical processes whose affinity should be temporarily unbound:
CRITICAL_PROCESSES=${CRITICAL_PROCESSES:-"systemd ovs crio kubelet NetworkManager conmon dbus"}

# Default wait time is 600s = 10m:
MAXIMUM_WAIT_TIME=${MAXIMUM_WAIT_TIME:-600}

# Default steady-state threshold = 2%
# Allowed values:
#  4  - absolute pod count (+/-)
#  4% - percent change (+/-)
#  -1 - disable the steady-state check
STEADY_STATE_THRESHOLD=${STEADY_STATE_THRESHOLD:-2%}

# Default steady-state window = 60s
# If the running pod count stays within the given threshold for this time
# period, return CPU utilization to normal before the maximum wait time has
# expires
STEADY_STATE_WINDOW=${STEADY_STATE_WINDOW:-60}

# Default steady-state allows any pod count to be "steady state"
# Increasing this will skip any steady-state checks until the count rises above
# this number to avoid false positives if there are some periods where the
# count doesn't increase but we know we can't be at steady-state yet.
STEADY_STATE_MINIMUM=${STEADY_STATE_MINIMUM:-0}

#######################################################

KUBELET_CPU_STATE=/var/lib/kubelet/cpu_manager_state
FULL_CPU_STATE=/sys/fs/cgroup/cpuset/cpuset.cpus
unrestrictedCpuset() {
  local cpus
  if [[ -e $KUBELET_CPU_STATE ]]; then
      cpus=$(jq -r '.defaultCpuSet' <$KUBELET_CPU_STATE)
  fi
  if [[ -z $cpus ]]; then
    # fall back to using all cpus if the kubelet state is not configured yet
    [[ -e $FULL_CPU_STATE ]] || return 1
    cpus=$(<$FULL_CPU_STATE)
  fi
  echo $cpus
}

restrictedCpuset() {
  for arg in $(</proc/cmdline); do
    if [[ $arg =~ ^systemd.cpu_affinity= ]]; then
      echo ${arg#*=}
      return 0
    fi
  done
  return 1
}

getCPUCount () {
  local cpuset="$1"
  local cpulist=()
  local cpus=0
  local mincpus=2

  if [[ -z $cpuset || $cpuset =~ [^0-9,-] ]]; then
    echo $mincpus
    return 1
  fi

  IFS=',' read -ra cpulist <<< $cpuset

  for elm in "${cpulist[@]}"; do
    if [[ $elm =~ ^[0-9]+$ ]]; then
      (( cpus++ ))
    elif [[ $elm =~ ^[0-9]+-[0-9]+$ ]]; then
      local low=0 high=0
      IFS='-' read low high <<< $elm
      (( cpus += high - low + 1 ))
    else
      echo $mincpus
      return 1
    fi
  done

  # Return a minimum of 2 cpus
  echo $(( cpus > $mincpus ? cpus : $mincpus ))
  return 0
}

resetOVSthreads () {
  local cpucount="$1"
  local curRevalidators=0
  local curHandlers=0
  local desiredRevalidators=0
  local desiredHandlers=0
  local rc=0

  curRevalidators=$(ps -Teo pid,tid,comm,cmd | grep -e revalidator | grep -c ovs-vswitchd)
  curHandlers=$(ps -Teo pid,tid,comm,cmd | grep -e handler | grep -c ovs-vswitchd)

  # Calculate the desired number of threads the same way OVS does.
  # OVS will set these thread count as a one shot process on startup, so we
  # have to adjust up or down during the boot up process. The desired outcome is
  # to not restrict the number of thread at startup until we reach a steady
  # state.  At which point we need to reset these based on our restricted  set
  # of cores.
  # See OVS function that calculates these thread counts:
  # https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L635
  (( desiredRevalidators=$cpucount / 4 + 1 ))
  (( desiredHandlers=$cpucount - $desiredRevalidators ))


  if [[ $curRevalidators -ne $desiredRevalidators || $curHandlers -ne $desiredHandlers ]]; then

    logger "Recovery: Re-setting OVS revalidator threads: ${curRevalidators} -> ${desiredRevalidators}"
    logger "Recovery: Re-setting OVS handler threads: ${curHandlers} -> ${desiredHandlers}"

    ovs-vsctl set \
      Open_vSwitch . \
      other-config:n-handler-threads=${desiredHandlers} \
      other-config:n-revalidator-threads=${desiredRevalidators}
    rc=$?
  fi

  return $rc
}

resetAffinity() {
  local cpuset="$1"
  local failcount=0
  local successcount=0
  logger "Recovery: Setting CPU affinity for critical processes \"$CRITICAL_PROCESSES\" to $cpuset"
  for proc in $CRITICAL_PROCESSES; do
    local pids="$(pgrep $proc)"
    for pid in $pids; do
      local tasksetOutput
      tasksetOutput="$(taskset -apc "$cpuset" $pid 2>&1)"
      if [[ $? -ne 0 ]]; then
        echo "ERROR: $tasksetOutput"
        ((failcount++))
      else
        ((successcount++))
      fi
    done
  done

  resetOVSthreads "$(getCPUCount ${cpuset})"
  if [[ $? -ne 0 ]]; then
    ((failcount++))
  else
    ((successcount++))
  fi

  logger "Recovery: Re-affined $successcount pids successfully"
  if [[ $failcount -gt 0 ]]; then
    logger "Recovery: Failed to re-affine $failcount processes"
    return 1
  fi
}

setUnrestricted() {
  logger "Recovery: Setting critical system processes to have unrestricted CPU access"
  resetAffinity "$(unrestrictedCpuset)"
}

setRestricted() {
  logger "Recovery: Resetting critical system processes back to normally restricted access"
  resetAffinity "$(restrictedCpuset)"
}

currentAffinity() {
  local pid="$1"
  taskset -pc $pid | awk -F': ' '{print $2}'
}

within() {
  local last=$1 current=$2 threshold=$3
  local delta=0 pchange
  delta=$(( current - last ))
  if [[ $current -eq $last ]]; then
    pchange=0
  elif [[ $last -eq 0 ]]; then
    pchange=1000000
  else
    pchange=$(( ( $delta * 100) / last ))
  fi
  echo -n "last:$last current:$current delta:$delta pchange:${pchange}%: "
  local absolute limit
  case $threshold in
    *%)
      absolute=${pchange##-} # absolute value
      limit=${threshold%%%}
      ;;
    *)
      absolute=${delta##-} # absolute value
      limit=$threshold
      ;;
  esac
  if [[ $absolute -le $limit ]]; then
    echo "within (+/-)$threshold"
    return 0
  else
    echo "outside (+/-)$threshold"
    return 1
  fi
}

steadystate() {
  local last=$1 current=$2
  if [[ $last -lt $STEADY_STATE_MINIMUM ]]; then
    echo "last:$last current:$current Waiting to reach $STEADY_STATE_MINIMUM before checking for steady-state"
    return 1
  fi
  within $last $current $STEADY_STATE_THRESHOLD
}

waitForReady() {
  logger "Recovery: Waiting ${MAXIMUM_WAIT_TIME}s for the initialization to complete"
  local lastSystemdCpuset="$(currentAffinity 1)"
  local lastDesiredCpuset="$(unrestrictedCpuset)"
  local t=0 s=10
  local lastCcount=0 ccount=0 steadyStateTime=0
  while [[ $t -lt $MAXIMUM_WAIT_TIME ]]; do
    sleep $s
    ((t += s))
    # Re-check the current affinity of systemd, in case some other process has changed it
    local systemdCpuset="$(currentAffinity 1)"
    # Re-check the unrestricted Cpuset, as the allowed set of unreserved cores may change as pods are assigned to cores
    local desiredCpuset="$(unrestrictedCpuset)"
    if [[ $systemdCpuset != $lastSystemdCpuset || $lastDesiredCpuset != $desiredCpuset ]]; then
      resetAffinity "$desiredCpuset"
      lastSystemdCpuset="$(currentAffinity 1)"
      lastDesiredCpuset="$desiredCpuset"
    fi

    # Detect steady-state pod count
    ccount=$(crictl ps | wc -l)
    if steadystate $lastCcount $ccount; then
      ((steadyStateTime += s))
      echo "Steady-state for ${steadyStateTime}s/${STEADY_STATE_WINDOW}s"
      if [[ $steadyStateTime -ge $STEADY_STATE_WINDOW ]]; then
        logger "Recovery: Steady-state (+/- $STEADY_STATE_THRESHOLD) for ${STEADY_STATE_WINDOW}s: Done"
        return 0
      fi
    else
      if [[ $steadyStateTime -gt 0 ]]; then
        echo "Resetting steady-state timer"
        steadyStateTime=0
      fi
    fi
    lastCcount=$ccount
  done
  logger "Recovery: Recovery Complete Timeout"
}

main() {
  if ! unrestrictedCpuset >&/dev/null; then
    logger "Recovery: No unrestricted Cpuset could be detected"
    return 1
  fi

  if ! restrictedCpuset >&/dev/null; then
    logger "Recovery: No restricted Cpuset has been configured.  We are already running unrestricted."
    return 0
  fi

  # Ensure we reset the CPU affinity when we exit this script for any reason
  # This way either after the timer expires or after the process is interrupted
  # via ^C or SIGTERM, we return things back to the way they should be.
  trap setRestricted EXIT

  logger "Recovery: Recovery Mode Starting"
  setUnrestricted
  waitForReady
}

if [[ "${BASH_SOURCE[0]}" = "${0}" ]]; then
  main "${@}"
  exit $?
fi
 mode: 493 path: /usr/local/bin/accelerated-container-startup.sh systemd: units: - contents: | [Unit] Description=Unlocks more CPUs for critical system processes during container startup [Service] Type=simple ExecStart=/usr/local/bin/accelerated-container-startup.sh # Maximum wait time is 600s = 10m: Environment=MAXIMUM_WAIT_TIME=600 # Steady-state threshold = 2% # Allowed values: # 4 - absolute pod count (+/-) # 4% - percent change (+/-) # -1 - disable the steady-state check # Note: '%' must be escaped as '%%' in systemd unit files Environment=STEADY_STATE_THRESHOLD=2%% # Steady-state window = 120s # If the running pod count stays within the given threshold for this time # period, return CPU utilization to normal before the maximum wait time has # expires Environment=STEADY_STATE_WINDOW=120 # Steady-state minimum = 40 # Increasing this will skip any steady-state checks until the count rises above # this number to avoid false positives if there are some periods where the # count doesn't increase but we know we can't be at steady-state yet. Environment=STEADY_STATE_MINIMUM=40 [Install] WantedBy=multi-user.target enabled: true name: accelerated-container-startup.service - contents: | [Unit] Description=Unlocks more CPUs for critical system processes during container shutdown DefaultDependencies=no [Service] Type=simple ExecStart=/usr/local/bin/accelerated-container-startup.sh # Maximum wait time is 600s = 10m: Environment=MAXIMUM_WAIT_TIME=600 # Steady-state threshold # Allowed values: # 4 - absolute pod count (+/-) # 4% - percent change (+/-) # -1 - disable the steady-state check # Note: '%' must be escaped as '%%' in systemd unit files Environment=STEADY_STATE_THRESHOLD=-1 # Steady-state window = 60s # If the running pod count stays within the given threshold for this time # period, return CPU utilization to normal before the maximum wait time has # expires Environment=STEADY_STATE_WINDOW=60 [Install] WantedBy=shutdown.target reboot.target halt.target enabled: true name: accelerated-container-shutdown.service
19.6.5.5. 使用 kdump 自动内核崩溃转储
kdump
是一个 Linux 内核功能,可在内核崩溃时创建内核崩溃转储。kdump
使用以下 MachineConfig
CR 启用:
推荐的 kdump 配置
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 06-kdump-enable-master spec: config: ignition: version: 3.2.0 systemd: units: - enabled: true name: kdump.service kernelArguments: - crashkernel=512M
19.6.6. 推荐的安装后集群配置
当集群安装完成后,ZTP 管道会应用运行 DU 工作负载所需的以下自定义资源 (CR)。
在 GitOps ZTP v4.10 及更早版本中,您可以使用 MachineConfig
CR 配置 UEFI 安全引导。GitOps ZTP v4.11 及更新的版本中不再需要。在 v4.11 中,您可以使用 Performance 配置集 CR 为单节点 OpenShift 集群配置 UEFI 安全引导。如需更多信息,请参阅性能配置集。
19.6.6.1. Operator 命名空间和 Operator 组
运行 DU 工作负载的单节点 OpenShift 集群需要以下 OperatorGroup
和 Namespace
自定义资源 (CR):
- Local Storage Operator
- Logging Operator
- PTP Operator
- Cluster Network Operator
以下 YAML 总结了这些 CR:
推荐的 Operator 命名空间和 OperatorGroup 配置
apiVersion: v1 kind: Namespace metadata: annotations: workload.openshift.io/allowed: management name: openshift-local-storage --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: openshift-local-storage namespace: openshift-local-storage spec: targetNamespaces: - openshift-local-storage --- apiVersion: v1 kind: Namespace metadata: annotations: workload.openshift.io/allowed: management name: openshift-logging --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: cluster-logging namespace: openshift-logging spec: targetNamespaces: - openshift-logging --- apiVersion: v1 kind: Namespace metadata: annotations: workload.openshift.io/allowed: management labels: openshift.io/cluster-monitoring: "true" name: openshift-ptp --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: ptp-operators namespace: openshift-ptp spec: targetNamespaces: - openshift-ptp --- apiVersion: v1 kind: Namespace metadata: annotations: workload.openshift.io/allowed: management name: openshift-sriov-network-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator
19.6.6.2. Operator 订阅
运行 DU 工作负载的单节点 OpenShift 集群需要以下 Subscription
CR。订阅提供下载以下 Operator 的位置:
- Local Storage Operator
- Logging Operator
- PTP Operator
- Cluster Network Operator
推荐的 Operator 订阅
apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: cluster-logging namespace: openshift-logging spec: channel: "stable" 1 name: cluster-logging source: redhat-operators sourceNamespace: openshift-marketplace installPlanApproval: Manual 2 --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: local-storage-operator namespace: openshift-local-storage spec: channel: "stable" installPlanApproval: Automatic name: local-storage-operator source: redhat-operators sourceNamespace: openshift-marketplace installPlanApproval: Manual --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ptp-operator-subscription namespace: openshift-ptp spec: channel: "stable" name: ptp-operator source: redhat-operators sourceNamespace: openshift-marketplace installPlanApproval: Manual --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator-subscription namespace: openshift-sriov-network-operator spec: channel: "stable" name: sriov-network-operator source: redhat-operators sourceNamespace: openshift-marketplace installPlanApproval: Manual
19.6.6.3. 集群日志记录和日志转发
运行 DU 工作负载的单节点 OpenShift 集群需要日志记录和日志转发以进行调试。以下示例 YAML 演示了所需的 ClusterLogging
和 ClusterLogForwarder
CR。
推荐的集群日志记录和日志转发配置
apiVersion: logging.openshift.io/v1 kind: ClusterLogging 1 metadata: name: instance namespace: openshift-logging spec: collection: logs: fluentd: {} type: fluentd curation: type: "curator" curator: schedule: "30 3 * * *" managementState: Managed --- apiVersion: logging.openshift.io/v1 kind: ClusterLogForwarder 2 metadata: name: instance namespace: openshift-logging spec: inputs: - infrastructure: {} name: infra-logs outputs: - name: kafka-open type: kafka url: tcp://10.46.55.190:9092/test 3 pipelines: - inputRefs: - audit name: audit-logs outputRefs: - kafka-open - inputRefs: - infrastructure name: infrastructure-logs outputRefs: - kafka-open
19.6.6.4. 性能配置集
运行 DU 工作负载的单节点 OpenShift 集群需要 Node Tuning Operator 性能配置集才能使用实时主机功能和服务。
在早期版本的 OpenShift Container Platform 中,Performance Addon Operator 用来实现自动性能优化,以便为 OpenShift 应用程序实现低延迟性能。在 OpenShift Container Platform 4.11 中,这些功能是 Node Tuning Operator 的一部分。
以下示例 PerformanceProfile
CR 演示了所需的集群配置。
推荐的性能配置集配置
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: openshift-node-performance-profile 1 spec: additionalKernelArgs: - rcupdate.rcu_normal_after_boot=0 - "efi=runtime" 2 cpu: isolated: 2-51,54-103 3 reserved: 0-1,52-53 4 hugepages: defaultHugepagesSize: 1G pages: - count: 32 5 size: 1G 6 node: 1 7 machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: "restricted" realTimeKernel: enabled: true 8
- 1
- 确保
name
的值与Tuned PerformancePatch.yaml
的spec.profile.data
字段中指定的值和validatorCR/informDuValidator.yaml
的status.configuration.source.name
字段匹配。 - 2 3
- 为集群主机配置 UEFI 安全引导。
- 4
- 设置隔离的 CPU。确保所有 Hyper-Threading 对都匹配。
- 5
- 设置保留的 CPU。启用工作负载分区时,系统进程、内核线程和系统容器线程仅限于这些 CPU。所有不是隔离的 CPU 都应保留。
- 6
- 设置巨页数量。
- 7
- 设置巨页大小。
- 8
- 将
enabled
设置为true
以安装实时 Linux 内核。
19.6.6.5. PTP
单节点 OpenShift 集群使用 Precision Time Protocol (PTP) 进行网络时间同步。以下示例 PtpConfig
CR 演示了所需的 PTP slave 配置。
推荐的 PTP 配置
apiVersion: ptp.openshift.io/v1
kind: PtpConfig
metadata:
name: du-ptp-slave
namespace: openshift-ptp
spec:
profile:
- interface: ens5f0 1
name: slave
phc2sysOpts: -a -r -n 24
ptp4lConf: |
[global]
#
# Default Data Set
#
twoStepFlag 1
slaveOnly 0
priority1 128
priority2 128
domainNumber 24
#utc_offset 37
clockClass 248
clockAccuracy 0xFE
offsetScaledLogVariance 0xFFFF
free_running 0
freq_est_interval 1
dscp_event 0
dscp_general 0
dataset_comparison ieee1588
G.8275.defaultDS.localPriority 128
#
# Port Data Set
#
logAnnounceInterval -3
logSyncInterval -4
logMinDelayReqInterval -4
logMinPdelayReqInterval -4
announceReceiptTimeout 3
syncReceiptTimeout 0
delayAsymmetry 0
fault_reset_interval 4
neighborPropDelayThresh 20000000
masterOnly 0
G.8275.portDS.localPriority 128
#
# Run time options
#
assume_two_step 0
logging_level 6
path_trace_enabled 0
follow_up_info 0
hybrid_e2e 0
inhibit_multicast_service 0
net_sync_monitor 0
tc_spanning_tree 0
tx_timestamp_timeout 1
unicast_listen 0
unicast_master_table 0
unicast_req_duration 3600
use_syslog 1
verbose 0
summary_interval 0
kernel_leap 1
check_fup_sync 0
#
# Servo Options
#
pi_proportional_const 0.0
pi_integral_const 0.0
pi_proportional_scale 0.0
pi_proportional_exponent -0.3
pi_proportional_norm_max 0.7
pi_integral_scale 0.0
pi_integral_exponent 0.4
pi_integral_norm_max 0.3
step_threshold 2.0
first_step_threshold 0.00002
max_frequency 900000000
clock_servo pi
sanity_freq_limit 200000000
ntpshm_segment 0
#
# Transport options
#
transportSpecific 0x0
ptp_dst_mac 01:1B:19:00:00:00
p2p_dst_mac 01:80:C2:00:00:0E
udp_ttl 1
udp6_scope 0x0E
uds_address /var/run/ptp4l
#
# Default interface options
#
clock_type OC
network_transport L2
delay_mechanism E2E
time_stamping hardware
tsproc_mode filter
delay_filter moving_median
delay_filter_length 10
egressLatency 0
ingressLatency 0
boundary_clock_jbod 0
#
# Clock description
#
productDescription ;;
revisionData ;;
manufacturerIdentity 00:00:00
userDescription ;
timeSource 0xA0
ptp4lOpts: -2 -s --summary_interval -4
recommend:
- match:
- nodeLabel: node-role.kubernetes.io/master
priority: 4
profile: slave
- 1
- 设置用于接收 PTP 时钟信号的接口。
19.6.6.6. 扩展的 Tuned 配置集
运行 DU 工作负载的单节点 OpenShift 集群需要额外的高性能工作负载所需的性能调优配置。以下 Tuned
CR 示例扩展了 Tuned
配置集:
推荐的扩展 Tuned 配置集配置
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: performance-patch namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Configuration changes profile inherited from performance created tuned include=openshift-node-performance-openshift-node-performance-profile [bootloader] cmdline_crash=nohz_full=2-51,54-103 [sysctl] kernel.timer_migration=1 [scheduler] group.ice-ptp=0:f:10:*:ice-ptp.* [service] service.stalld=start,enable service.chronyd=stop,disable name: performance-patch recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: master priority: 19 profile: performance-patch
19.6.6.7. SR-IOV
单根 I/O 虚拟化 (SR-IOV) 通常用于启用前端和中间网络。以下 YAML 示例为单节点 OpenShift 集群配置 SR-IOV。
推荐的 SR-IOV 配置
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: configDaemonNodeSelector: node-role.kubernetes.io/master: "" disableDrain: true enableInjector: true enableOperatorWebhook: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-nw-du-mh namespace: openshift-sriov-network-operator spec: networkNamespace: openshift-sriov-network-operator resourceName: du_mh vlan: 150 1 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-nnp-du-mh namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci 2 isRdma: false nicSelector: pfNames: - ens7f0 3 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 8 4 priority: 10 resourceName: du_mh --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-nw-du-fh namespace: openshift-sriov-network-operator spec: networkNamespace: openshift-sriov-network-operator resourceName: du_fh vlan: 140 5 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-nnp-du-fh namespace: openshift-sriov-network-operator spec: deviceType: netdevice 6 isRdma: true nicSelector: pfNames: - ens5f0 7 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 8 8 priority: 10 resourceName: du_fh
19.6.6.8. Console Operator
console-operator 在集群中安装并维护 web 控制台。当节点被集中管理时,不需要 Operator,并为应用程序工作负载腾出空间。以下 Console
自定义资源 (CR) 示例禁用控制台。
推荐的控制台配置
apiVersion: operator.openshift.io/v1 kind: Console metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "false" include.release.openshift.io/self-managed-high-availability: "false" include.release.openshift.io/single-node-developer: "false" release.openshift.io/create-only: "true" name: cluster spec: logLevel: Normal managementState: Removed operatorLogLevel: Normal
19.6.6.9. Grafana 和 Alertmanager
运行 DU 工作负载的单节点 OpenShift 集群需要减少 OpenShift Container Platform 监控组件所消耗的 CPU 资源。以下 ConfigMap
自定义资源 (CR) 禁用 Grafana 和 Alertmanager。
推荐的集群监控配置
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | grafana: enabled: false alertmanagerMain: enabled: false prometheusK8s: retention: 24h
19.6.6.10. 网络诊断
运行 DU 工作负载的单节点 OpenShift 集群需要较少的 pod 网络连接检查,以减少这些 pod 创建的额外负载。以下自定义资源 (CR) 禁用这些检查。
推荐的网络诊断配置
apiVersion: operator.openshift.io/v1 kind: Network metadata: name: cluster spec: disableNetworkDiagnostics: true
其他资源