19.6. 推荐的 vDU 应用程序工作负载的单节点 OpenShift 集群配置


使用以下引用信息,了解在集群中部署虚拟分布式单元 (vDU) 应用程序所需的单节点 OpenShift 配置。配置包括用于高性能工作负载的集群优化、启用工作负载分区以及最大程度减少安装后所需的重启数量。

其他资源

19.6.1. 在 OpenShift Container Platform 上运行低延迟应用程序

OpenShift Container Platform 通过使用几个技术和专用硬件设备,为在商业现成 (COTS) 硬件上运行的应用程序启用低延迟处理:

RHCOS 的实时内核
确保以高度的进程确定性处理工作负载。
CPU 隔离
避免 CPU 调度延迟并确保 CPU 容量一致可用。
NUMA 感知拓扑管理
将内存和巨页与 CPU 和 PCI 设备对齐,以将容器内存和巨页固定到非统一内存访问(NUMA)节点。所有服务质量 (QoS) 类的 Pod 资源保留在同一个 NUMA 节点上。这可降低延迟并提高节点的性能。
巨页内存管理
使用巨页大小可减少访问页表所需的系统资源量,从而提高系统性能。
使用 PTP 进行精确计时同步
允许以子微秒的准确性在网络中的节点之间进行同步。

19.6.2. vDU 应用程序工作负载的推荐集群主机要求

运行 vDU 应用程序工作负载需要一个具有足够资源的裸机主机来运行 OpenShift Container Platform 服务和生产工作负载。

表 19.5. 最低资源要求
profilevCPUmemoryStorage

最小值

4 到 8 个 vCPU 内核

32GB RAM

120GB

注意

当未启用并发多线程 (SMT) 或超线程时,一个 vCPU 相当于一个物理内核。启用后,使用以下公式来计算对应的比率:

  • (每个内核的线程数 x 内核数)x 插槽数 = vCPU
重要

使用虚拟介质引导时,服务器必须具有基板管理控制器(BMC)。

19.6.3. 为低延迟和高性能配置主机固件

裸机主机需要在置备主机前配置固件。固件配置取决于您的特定硬件和安装的具体要求。

流程

  1. UEFI/BIOS Boot Mode 设置为 UEFI
  2. 在主机引导顺序中,设置 Hard drive first
  3. 为您的硬件应用特定的固件配置。下表描述了 Intel Xeon Skylake 或 Intel Cascade Lake 服务器的代表固件配置,它基于 Intel FlexRAN 4G 和 5G 基带 PHY 参考设计。

    重要

    确切的固件配置取决于您的特定硬件和网络要求。以下示例配置仅用于说明目的。

    表 19.6. Intel Xeon Skylake 或 Cascade Lake 服务器的固件配置示例
    固件设置Configuration

    CPU Power 和性能策略

    性能

    非核心频率扩展

    Disabled

    性能限制

    Disabled

    增强的 Intel SpeedStep ® Tech

    Enabled

    Intel 配置的 TDP

    Enabled

    可配置 TDP 级别

    2 级

    Intel® Turbo Boost Technology

    Enabled

    节能 Turbo

    Disabled

    硬件 P-State

    Disabled

    软件包 C-State

    C0/C1 状态

    C1E

    Disabled

    处理器 C6

    Disabled

注意

在主机的固件中启用全局 SR-IOV 和 VT-d 设置。这些设置与裸机环境相关。

19.6.4. 受管集群网络的连接先决条件

在安装并置备带有零接触置备 (ZTP) GitOps 管道的受管集群前,受管集群主机必须满足以下网络先决条件:

  • hub 集群中的 ZTP GitOps 容器和目标裸机主机的 Baseboard Management Controller (BMC) 之间必须有双向连接。
  • 受管集群必须能够解析和访问 hub 主机名和 *.apps 主机名的 API 主机名。以下是 hub 和 *.apps 主机名的 API 主机名示例:

    • api.hub-cluster.internal.domain.com
    • console-openshift-console.apps.hub-cluster.internal.domain.com
  • hub 集群必须能够解析并访问受管集群的 API 和 *.app 主机名。以下是受管集群的 API 主机名和 *.apps 主机名示例:

    • api.sno-managed-cluster-1.internal.domain.com
    • console-openshift-console.apps.sno-managed-cluster-1.internal.domain.com

19.6.5. 推荐的安装时集群配置

ZTP 管道在集群安装过程中应用以下自定义资源 (CR)。这些配置 CR 确保集群满足运行 vDU 应用程序所需的功能和性能要求。

注意

当将 ZTP GitOps 插件和 SiteConfig CR 用于集群部署时,默认包含以下 MachineConfig CR。

使用 SiteConfig extraManifests 过滤器更改默认包括的 CR。如需更多信息,请参阅使用 SiteConfig CR 的高级受管集群配置

19.6.5.1. 工作负载分区

运行 DU 工作负载的单节点 OpenShift 集群需要工作负载分区。这限制了运行平台服务的内核数,从而最大程度提高应用程序有效负载的 CPU 内核。

注意

工作负载分区只能在集群安装过程中启用。您不能在安装后禁用工作负载分区。但是,您可以通过更新您在性能配置集中定义的 cpu 值以及相关的 MachineConfig 自定义资源 (CR) 来重新配置工作负载分区。

  • 启用工作负载分区的 base64 编码的 CR,它包含管理工作负载受限制的 CPU 集。为 base64 中的 crio.confkubelet.conf 对特定于主机的值进行编码。调整内容以匹配集群性能配置集中指定的 CPU 集。它必须与集群主机中的内核数匹配。

    推荐的工作负载分区配置

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: master
      name: 02-master-workload-partitioning
    spec:
      config:
        ignition:
          version: 3.2.0
        storage:
          files:
          - contents:
              source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMubWFuYWdlbWVudF0KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gInRhcmdldC53b3JrbG9hZC5vcGVuc2hpZnQuaW8vbWFuYWdlbWVudCIKYW5ub3RhdGlvbl9wcmVmaXggPSAicmVzb3VyY2VzLndvcmtsb2FkLm9wZW5zaGlmdC5pbyIKcmVzb3VyY2VzID0geyAiY3B1c2hhcmVzIiA9IDAsICJjcHVzZXQiID0gIjAtMSw1Mi01MyIgfQo=
            mode: 420
            overwrite: true
            path: /etc/crio/crio.conf.d/01-workload-partitioning
            user:
              name: root
          - contents:
              source: data:text/plain;charset=utf-8;base64,ewogICJtYW5hZ2VtZW50IjogewogICAgImNwdXNldCI6ICIwLTEsNTItNTMiCiAgfQp9Cg==
            mode: 420
            overwrite: true
            path: /etc/kubernetes/openshift-workload-pinning
            user:
              name: root

  • 在集群主机上配置时,/etc/crio/crio.conf.d/01-workload-partitioning 的内容应该类似如下:

    [crio.runtime.workloads.management]
    activation_annotation = "target.workload.openshift.io/management"
    annotation_prefix = "resources.workload.openshift.io"
    resources = { "cpushares" = 0, "cpuset" = "0-1,52-53" } 1
    1
    CPU 值因安装而异。

    如果启用了超线程,请为每个内核指定两个线程。CPU 值必须与性能配置集中指定的保留 CPU 设置匹配。

  • 在集群中配置时,/etc/kubernetes/openshift-workload-pinning 的内容应如下所示:

    {
      "management": {
        "cpuset": "0-1,52-53" 1
      }
    }
    1
    cpuset 必须与 /etc/crio/crio.conf.d/01-workload-partitioning 中的 CPU 值匹配。

19.6.5.2. 减少平台管理占用空间

要减少平台的整体管理空间,需要一个 MachineConfig 自定义资源 (CR),它将所有特定于 Kubernetes 的挂载点放在独立于主机操作系统的新命名空间中。以下 base64 编码的示例 MachineConfig CR 演示了此配置。

推荐的容器挂载命名空间配置

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: container-mount-namespace-and-kubelet-conf-master
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKCmRlYnVnKCkgewogIGVjaG8gJEAgPiYyCn0KCnVzYWdlKCkgewogIGVjaG8gVXNhZ2U6ICQoYmFzZW5hbWUgJDApIFVOSVQgW2VudmZpbGUgW3Zhcm5hbWVdXQogIGVjaG8KICBlY2hvIEV4dHJhY3QgdGhlIGNvbnRlbnRzIG9mIHRoZSBmaXJzdCBFeGVjU3RhcnQgc3RhbnphIGZyb20gdGhlIGdpdmVuIHN5c3RlbWQgdW5pdCBhbmQgcmV0dXJuIGl0IHRvIHN0ZG91dAogIGVjaG8KICBlY2hvICJJZiAnZW52ZmlsZScgaXMgcHJvdmlkZWQsIHB1dCBpdCBpbiB0aGVyZSBpbnN0ZWFkLCBhcyBhbiBlbnZpcm9ubWVudCB2YXJpYWJsZSBuYW1lZCAndmFybmFtZSciCiAgZWNobyAiRGVmYXVsdCAndmFybmFtZScgaXMgRVhFQ1NUQVJUIGlmIG5vdCBzcGVjaWZpZWQiCiAgZXhpdCAxCn0KClVOSVQ9JDEKRU5WRklMRT0kMgpWQVJOQU1FPSQzCmlmIFtbIC16ICRVTklUIHx8ICRVTklUID09ICItLWhlbHAiIHx8ICRVTklUID09ICItaCIgXV07IHRoZW4KICB1c2FnZQpmaQpkZWJ1ZyAiRXh0cmFjdGluZyBFeGVjU3RhcnQgZnJvbSAkVU5JVCIKRklMRT0kKHN5c3RlbWN0bCBjYXQgJFVOSVQgfCBoZWFkIC1uIDEpCkZJTEU9JHtGSUxFI1wjIH0KaWYgW1sgISAtZiAkRklMRSBdXTsgdGhlbgogIGRlYnVnICJGYWlsZWQgdG8gZmluZCByb290IGZpbGUgZm9yIHVuaXQgJFVOSVQgKCRGSUxFKSIKICBleGl0CmZpCmRlYnVnICJTZXJ2aWNlIGRlZmluaXRpb24gaXMgaW4gJEZJTEUiCkVYRUNTVEFSVD0kKHNlZCAtbiAtZSAnL15FeGVjU3RhcnQ9LipcXCQvLC9bXlxcXSQvIHsgcy9eRXhlY1N0YXJ0PS8vOyBwIH0nIC1lICcvXkV4ZWNTdGFydD0uKlteXFxdJC8geyBzL15FeGVjU3RhcnQ9Ly87IHAgfScgJEZJTEUpCgppZiBbWyAkRU5WRklMRSBdXTsgdGhlbgogIFZBUk5BTUU9JHtWQVJOQU1FOi1FWEVDU1RBUlR9CiAgZWNobyAiJHtWQVJOQU1FfT0ke0VYRUNTVEFSVH0iID4gJEVOVkZJTEUKZWxzZQogIGVjaG8gJEVYRUNTVEFSVApmaQo=
        mode: 493
        path: /usr/local/bin/extractExecStart
      - contents:
          source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKbnNlbnRlciAtLW1vdW50PS9ydW4vY29udGFpbmVyLW1vdW50LW5hbWVzcGFjZS9tbnQgIiRAIgo=
        mode: 493
        path: /usr/local/bin/nsenterCmns
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Manages a mount namespace that both kubelet and crio can use to share their container-specific mounts

          [Service]
          Type=oneshot
          RemainAfterExit=yes
          RuntimeDirectory=container-mount-namespace
          Environment=RUNTIME_DIRECTORY=%t/container-mount-namespace
          Environment=BIND_POINT=%t/container-mount-namespace/mnt
          ExecStartPre=bash -c "findmnt ${RUNTIME_DIRECTORY} || mount --make-unbindable --bind ${RUNTIME_DIRECTORY} ${RUNTIME_DIRECTORY}"
          ExecStartPre=touch ${BIND_POINT}
          ExecStart=unshare --mount=${BIND_POINT} --propagation slave mount --make-rshared /
          ExecStop=umount -R ${RUNTIME_DIRECTORY}
        enabled: true
        name: container-mount-namespace.service
      - dropins:
        - contents: |
            [Unit]
            Wants=container-mount-namespace.service
            After=container-mount-namespace.service

            [Service]
            ExecStartPre=/usr/local/bin/extractExecStart %n /%t/%N-execstart.env ORIG_EXECSTART
            EnvironmentFile=-/%t/%N-execstart.env
            ExecStart=
            ExecStart=bash -c "nsenter --mount=%t/container-mount-namespace/mnt \
                ${ORIG_EXECSTART}"
          name: 90-container-mount-namespace.conf
        name: crio.service
      - dropins:
        - contents: |
            [Unit]
            Wants=container-mount-namespace.service
            After=container-mount-namespace.service

            [Service]
            ExecStartPre=/usr/local/bin/extractExecStart %n /%t/%N-execstart.env ORIG_EXECSTART
            EnvironmentFile=-/%t/%N-execstart.env
            ExecStart=
            ExecStart=bash -c "nsenter --mount=%t/container-mount-namespace/mnt \
                ${ORIG_EXECSTART} --housekeeping-interval=30s"
          name: 90-container-mount-namespace.conf
        - contents: |
            [Service]
            Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s"
            Environment="OPENSHIFT_EVICTION_MONITORING_PERIOD_DURATION=30s"
          name: 30-kubelet-interval-tuning.conf
        name: kubelet.service

19.6.5.3. SCTP

流控制传输协议 (SCTP) 是在 RAN 应用程序中使用的密钥协议。此 MachineConfig 对象向节点添加 SCTP 内核模块以启用此协议。

推荐的 SCTP 配置

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: load-sctp-module
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
        - contents:
            source: data:,
            verification: {}
          filesystem: root
            mode: 420
            path: /etc/modprobe.d/sctp-blacklist.conf
        - contents:
            source: data:text/plain;charset=utf-8,sctp
          filesystem: root
            mode: 420
            path: /etc/modules-load.d/sctp-load.conf

19.6.5.4. 加速容器启动

以下 MachineConfig CR 配置 OpenShift 核心进程和容器,以便在系统启动和关闭过程中使用所有可用的 CPU 内核。这会加快初始引导过程和重启过程中的系统恢复。

推荐的容器启动配置

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 04-accelerated-container-startup-master
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,#!/bin/bash
#
# Temporarily reset the core system processes's CPU affinity to be unrestricted to accelerate startup and shutdown
#
# The defaults below can be overridden via environment variables
#

# The default set of critical processes whose affinity should be temporarily unbound:
CRITICAL_PROCESSES=${CRITICAL_PROCESSES:-"systemd ovs crio kubelet NetworkManager conmon dbus"}

# Default wait time is 600s = 10m:
MAXIMUM_WAIT_TIME=${MAXIMUM_WAIT_TIME:-600}

# Default steady-state threshold = 2%
# Allowed values:
#  4  - absolute pod count (+/-)
#  4% - percent change (+/-)
#  -1 - disable the steady-state check
STEADY_STATE_THRESHOLD=${STEADY_STATE_THRESHOLD:-2%}

# Default steady-state window = 60s
# If the running pod count stays within the given threshold for this time
# period, return CPU utilization to normal before the maximum wait time has
# expires
STEADY_STATE_WINDOW=${STEADY_STATE_WINDOW:-60}

# Default steady-state allows any pod count to be "steady state"
# Increasing this will skip any steady-state checks until the count rises above
# this number to avoid false positives if there are some periods where the
# count doesn't increase but we know we can't be at steady-state yet.
STEADY_STATE_MINIMUM=${STEADY_STATE_MINIMUM:-0}

#######################################################

KUBELET_CPU_STATE=/var/lib/kubelet/cpu_manager_state
FULL_CPU_STATE=/sys/fs/cgroup/cpuset/cpuset.cpus
unrestrictedCpuset() {
  local cpus
  if [[ -e $KUBELET_CPU_STATE ]]; then
      cpus=$(jq -r '.defaultCpuSet' <$KUBELET_CPU_STATE)
  fi
  if [[ -z $cpus ]]; then
    # fall back to using all cpus if the kubelet state is not configured yet
    [[ -e $FULL_CPU_STATE ]] || return 1
    cpus=$(<$FULL_CPU_STATE)
  fi
  echo $cpus
}

restrictedCpuset() {
  for arg in $(</proc/cmdline); do
    if [[ $arg =~ ^systemd.cpu_affinity= ]]; then
      echo ${arg#*=}
      return 0
    fi
  done
  return 1
}

getCPUCount () {
  local cpuset="$1"
  local cpulist=()
  local cpus=0
  local mincpus=2

  if [[ -z $cpuset || $cpuset =~ [^0-9,-] ]]; then
    echo $mincpus
    return 1
  fi

  IFS=',' read -ra cpulist <<< $cpuset

  for elm in "${cpulist[@]}"; do
    if [[ $elm =~ ^[0-9]+$ ]]; then
      (( cpus++ ))
    elif [[ $elm =~ ^[0-9]+-[0-9]+$ ]]; then
      local low=0 high=0
      IFS='-' read low high <<< $elm
      (( cpus += high - low + 1 ))
    else
      echo $mincpus
      return 1
    fi
  done

  # Return a minimum of 2 cpus
  echo $(( cpus > $mincpus ? cpus : $mincpus ))
  return 0
}

resetOVSthreads () {
  local cpucount="$1"
  local curRevalidators=0
  local curHandlers=0
  local desiredRevalidators=0
  local desiredHandlers=0
  local rc=0

  curRevalidators=$(ps -Teo pid,tid,comm,cmd | grep -e revalidator | grep -c ovs-vswitchd)
  curHandlers=$(ps -Teo pid,tid,comm,cmd | grep -e handler | grep -c ovs-vswitchd)

  # Calculate the desired number of threads the same way OVS does.
  # OVS will set these thread count as a one shot process on startup, so we
  # have to adjust up or down during the boot up process. The desired outcome is
  # to not restrict the number of thread at startup until we reach a steady
  # state.  At which point we need to reset these based on our restricted  set
  # of cores.
  # See OVS function that calculates these thread counts:
  # https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L635
  (( desiredRevalidators=$cpucount / 4 + 1 ))
  (( desiredHandlers=$cpucount - $desiredRevalidators ))


  if [[ $curRevalidators -ne $desiredRevalidators || $curHandlers -ne $desiredHandlers ]]; then

    logger "Recovery: Re-setting OVS revalidator threads: ${curRevalidators} -> ${desiredRevalidators}"
    logger "Recovery: Re-setting OVS handler threads: ${curHandlers} -> ${desiredHandlers}"

    ovs-vsctl set \
      Open_vSwitch . \
      other-config:n-handler-threads=${desiredHandlers} \
      other-config:n-revalidator-threads=${desiredRevalidators}
    rc=$?
  fi

  return $rc
}

resetAffinity() {
  local cpuset="$1"
  local failcount=0
  local successcount=0
  logger "Recovery: Setting CPU affinity for critical processes \"$CRITICAL_PROCESSES\" to $cpuset"
  for proc in $CRITICAL_PROCESSES; do
    local pids="$(pgrep $proc)"
    for pid in $pids; do
      local tasksetOutput
      tasksetOutput="$(taskset -apc "$cpuset" $pid 2>&1)"
      if [[ $? -ne 0 ]]; then
        echo "ERROR: $tasksetOutput"
        ((failcount++))
      else
        ((successcount++))
      fi
    done
  done

  resetOVSthreads "$(getCPUCount ${cpuset})"
  if [[ $? -ne 0 ]]; then
    ((failcount++))
  else
    ((successcount++))
  fi

  logger "Recovery: Re-affined $successcount pids successfully"
  if [[ $failcount -gt 0 ]]; then
    logger "Recovery: Failed to re-affine $failcount processes"
    return 1
  fi
}

setUnrestricted() {
  logger "Recovery: Setting critical system processes to have unrestricted CPU access"
  resetAffinity "$(unrestrictedCpuset)"
}

setRestricted() {
  logger "Recovery: Resetting critical system processes back to normally restricted access"
  resetAffinity "$(restrictedCpuset)"
}

currentAffinity() {
  local pid="$1"
  taskset -pc $pid | awk -F': ' '{print $2}'
}

within() {
  local last=$1 current=$2 threshold=$3
  local delta=0 pchange
  delta=$(( current - last ))
  if [[ $current -eq $last ]]; then
    pchange=0
  elif [[ $last -eq 0 ]]; then
    pchange=1000000
  else
    pchange=$(( ( $delta * 100) / last ))
  fi
  echo -n "last:$last current:$current delta:$delta pchange:${pchange}%: "
  local absolute limit
  case $threshold in
    *%)
      absolute=${pchange##-} # absolute value
      limit=${threshold%%%}
      ;;
    *)
      absolute=${delta##-} # absolute value
      limit=$threshold
      ;;
  esac
  if [[ $absolute -le $limit ]]; then
    echo "within (+/-)$threshold"
    return 0
  else
    echo "outside (+/-)$threshold"
    return 1
  fi
}

steadystate() {
  local last=$1 current=$2
  if [[ $last -lt $STEADY_STATE_MINIMUM ]]; then
    echo "last:$last current:$current Waiting to reach $STEADY_STATE_MINIMUM before checking for steady-state"
    return 1
  fi
  within $last $current $STEADY_STATE_THRESHOLD
}

waitForReady() {
  logger "Recovery: Waiting ${MAXIMUM_WAIT_TIME}s for the initialization to complete"
  local lastSystemdCpuset="$(currentAffinity 1)"
  local lastDesiredCpuset="$(unrestrictedCpuset)"
  local t=0 s=10
  local lastCcount=0 ccount=0 steadyStateTime=0
  while [[ $t -lt $MAXIMUM_WAIT_TIME ]]; do
    sleep $s
    ((t += s))
    # Re-check the current affinity of systemd, in case some other process has changed it
    local systemdCpuset="$(currentAffinity 1)"
    # Re-check the unrestricted Cpuset, as the allowed set of unreserved cores may change as pods are assigned to cores
    local desiredCpuset="$(unrestrictedCpuset)"
    if [[ $systemdCpuset != $lastSystemdCpuset || $lastDesiredCpuset != $desiredCpuset ]]; then
      resetAffinity "$desiredCpuset"
      lastSystemdCpuset="$(currentAffinity 1)"
      lastDesiredCpuset="$desiredCpuset"
    fi

    # Detect steady-state pod count
    ccount=$(crictl ps | wc -l)
    if steadystate $lastCcount $ccount; then
      ((steadyStateTime += s))
      echo "Steady-state for ${steadyStateTime}s/${STEADY_STATE_WINDOW}s"
      if [[ $steadyStateTime -ge $STEADY_STATE_WINDOW ]]; then
        logger "Recovery: Steady-state (+/- $STEADY_STATE_THRESHOLD) for ${STEADY_STATE_WINDOW}s: Done"
        return 0
      fi
    else
      if [[ $steadyStateTime -gt 0 ]]; then
        echo "Resetting steady-state timer"
        steadyStateTime=0
      fi
    fi
    lastCcount=$ccount
  done
  logger "Recovery: Recovery Complete Timeout"
}

main() {
  if ! unrestrictedCpuset >&/dev/null; then
    logger "Recovery: No unrestricted Cpuset could be detected"
    return 1
  fi

  if ! restrictedCpuset >&/dev/null; then
    logger "Recovery: No restricted Cpuset has been configured.  We are already running unrestricted."
    return 0
  fi

  # Ensure we reset the CPU affinity when we exit this script for any reason
  # This way either after the timer expires or after the process is interrupted
  # via ^C or SIGTERM, we return things back to the way they should be.
  trap setRestricted EXIT

  logger "Recovery: Recovery Mode Starting"
  setUnrestricted
  waitForReady
}

if [[ "${BASH_SOURCE[0]}" = "${0}" ]]; then
  main "${@}"
  exit $?
fi

        mode: 493
        path: /usr/local/bin/accelerated-container-startup.sh
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Unlocks more CPUs for critical system processes during container startup

          [Service]
          Type=simple
          ExecStart=/usr/local/bin/accelerated-container-startup.sh

          # Maximum wait time is 600s = 10m:
          Environment=MAXIMUM_WAIT_TIME=600

          # Steady-state threshold = 2%
          # Allowed values:
          #  4  - absolute pod count (+/-)
          #  4% - percent change (+/-)
          #  -1 - disable the steady-state check
          # Note: '%' must be escaped as '%%' in systemd unit files
          Environment=STEADY_STATE_THRESHOLD=2%%

          # Steady-state window = 120s
          # If the running pod count stays within the given threshold for this time
          # period, return CPU utilization to normal before the maximum wait time has
          # expires
          Environment=STEADY_STATE_WINDOW=120

          # Steady-state minimum = 40
          # Increasing this will skip any steady-state checks until the count rises above
          # this number to avoid false positives if there are some periods where the
          # count doesn't increase but we know we can't be at steady-state yet.
          Environment=STEADY_STATE_MINIMUM=40

          [Install]
          WantedBy=multi-user.target
        enabled: true
        name: accelerated-container-startup.service
      - contents: |
          [Unit]
          Description=Unlocks more CPUs for critical system processes during container shutdown
          DefaultDependencies=no

          [Service]
          Type=simple
          ExecStart=/usr/local/bin/accelerated-container-startup.sh

          # Maximum wait time is 600s = 10m:
          Environment=MAXIMUM_WAIT_TIME=600

          # Steady-state threshold
          # Allowed values:
          #  4  - absolute pod count (+/-)
          #  4% - percent change (+/-)
          #  -1 - disable the steady-state check
          # Note: '%' must be escaped as '%%' in systemd unit files
          Environment=STEADY_STATE_THRESHOLD=-1

          # Steady-state window = 60s
          # If the running pod count stays within the given threshold for this time
          # period, return CPU utilization to normal before the maximum wait time has
          # expires
          Environment=STEADY_STATE_WINDOW=60

          [Install]
          WantedBy=shutdown.target reboot.target halt.target
        enabled: true
        name: accelerated-container-shutdown.service

19.6.5.5. 使用 kdump 自动内核崩溃转储

kdump 是一个 Linux 内核功能,可在内核崩溃时创建内核崩溃转储。kdump 使用以下 MachineConfig CR 启用:

推荐的 kdump 配置

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 06-kdump-enable-master
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - enabled: true
        name: kdump.service
  kernelArguments:
    - crashkernel=512M

19.6.6. 推荐的安装后集群配置

当集群安装完成后,ZTP 管道会应用运行 DU 工作负载所需的以下自定义资源 (CR)。

注意

在 GitOps ZTP v4.10 及更早版本中,您可以使用 MachineConfig CR 配置 UEFI 安全引导。GitOps ZTP v4.11 及更新的版本中不再需要。在 v4.11 中,您可以使用 Performance 配置集 CR 为单节点 OpenShift 集群配置 UEFI 安全引导。如需更多信息,请参阅性能配置集

19.6.6.1. Operator 命名空间和 Operator 组

运行 DU 工作负载的单节点 OpenShift 集群需要以下 OperatorGroupNamespace 自定义资源 (CR):

  • Local Storage Operator
  • Logging Operator
  • PTP Operator
  • Cluster Network Operator

以下 YAML 总结了这些 CR:

推荐的 Operator 命名空间和 OperatorGroup 配置

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    workload.openshift.io/allowed: management
  name: openshift-local-storage
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-local-storage
  namespace: openshift-local-storage
spec:
  targetNamespaces:
    - openshift-local-storage
---
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    workload.openshift.io/allowed: management
  name: openshift-logging
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: cluster-logging
  namespace: openshift-logging
spec:
  targetNamespaces:
    - openshift-logging
---
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    workload.openshift.io/allowed: management
  labels:
    openshift.io/cluster-monitoring: "true"
  name: openshift-ptp
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: ptp-operators
  namespace: openshift-ptp
spec:
  targetNamespaces:
    - openshift-ptp
---
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    workload.openshift.io/allowed: management
    name: openshift-sriov-network-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: sriov-network-operators
  namespace: openshift-sriov-network-operator
spec:
  targetNamespaces:
    - openshift-sriov-network-operator

19.6.6.2. Operator 订阅

运行 DU 工作负载的单节点 OpenShift 集群需要以下 Subscription CR。订阅提供下载以下 Operator 的位置:

  • Local Storage Operator
  • Logging Operator
  • PTP Operator
  • Cluster Network Operator

推荐的 Operator 订阅

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: cluster-logging
  namespace: openshift-logging
spec:
  channel: "stable" 1
  name: cluster-logging
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Manual 2
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-operator
  namespace: openshift-local-storage
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: local-storage-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Manual
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
    name: ptp-operator-subscription
    namespace: openshift-ptp
spec:
  channel: "stable"
  name: ptp-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Manual
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: sriov-network-operator-subscription
  namespace: openshift-sriov-network-operator
spec:
  channel: "stable"
  name: sriov-network-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Manual

1
指定要从中获取 Operator 的频道。stable 是推荐的频道。
2
指定 ManualAutomatic。在 Automatic 模式中,Operator 会在 registry 中可用时自动更新到频道中最新版本。在 Manual 模式中,只有在被明确批准后,才会安装新的 Operator 版本。

19.6.6.3. 集群日志记录和日志转发

运行 DU 工作负载的单节点 OpenShift 集群需要日志记录和日志转发以进行调试。以下示例 YAML 演示了所需的 ClusterLoggingClusterLogForwarder CR。

推荐的集群日志记录和日志转发配置

apiVersion: logging.openshift.io/v1
kind: ClusterLogging 1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      fluentd: {}
      type: fluentd
  curation:
    type: "curator"
    curator:
      schedule: "30 3 * * *"
  managementState: Managed
---
apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder 2
metadata:
  name: instance
  namespace: openshift-logging
spec:
  inputs:
    - infrastructure: {}
      name: infra-logs
  outputs:
    - name: kafka-open
      type: kafka
      url: tcp://10.46.55.190:9092/test    3
  pipelines:
    - inputRefs:
      - audit
      name: audit-logs
      outputRefs:
      - kafka-open
    - inputRefs:
      - infrastructure
      name: infrastructure-logs
      outputRefs:
      - kafka-open

1
更新现有的 ClusterLogging 实例,如果实例不存在,则创建该实例。
2
更新现有的 ClusterLogForwarder 实例,如果实例不存在,则创建该实例。
3
指定日志转发到的 Kafka 服务器的 URL。

19.6.6.4. 性能配置集

运行 DU 工作负载的单节点 OpenShift 集群需要 Node Tuning Operator 性能配置集才能使用实时主机功能和服务。

注意

在早期版本的 OpenShift Container Platform 中,Performance Addon Operator 用来实现自动性能优化,以便为 OpenShift 应用程序实现低延迟性能。在 OpenShift Container Platform 4.11 中,这些功能是 Node Tuning Operator 的一部分。

以下示例 PerformanceProfile CR 演示了所需的集群配置。

推荐的性能配置集配置

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: openshift-node-performance-profile 1
spec:
  additionalKernelArgs:
    - rcupdate.rcu_normal_after_boot=0
    - "efi=runtime" 2
  cpu:
    isolated: 2-51,54-103 3
    reserved: 0-1,52-53   4
  hugepages:
    defaultHugepagesSize: 1G
    pages:
      - count: 32 5
        size: 1G  6
        node: 1 7
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: "restricted"
  realTimeKernel:
    enabled: true    8

1
确保 name 的值与 Tuned PerformancePatch.yamlspec.profile.data 字段中指定的值和 validatorCR/informDuValidator.yamlstatus.configuration.source.name 字段匹配。
2 3
为集群主机配置 UEFI 安全引导。
4
设置隔离的 CPU。确保所有 Hyper-Threading 对都匹配。
5
设置保留的 CPU。启用工作负载分区时,系统进程、内核线程和系统容器线程仅限于这些 CPU。所有不是隔离的 CPU 都应保留。
6
设置巨页数量。
7
设置巨页大小。
8
enabled 设置为 true 以安装实时 Linux 内核。

19.6.6.5. PTP

单节点 OpenShift 集群使用 Precision Time Protocol (PTP) 进行网络时间同步。以下示例 PtpConfig CR 演示了所需的 PTP slave 配置。

推荐的 PTP 配置

apiVersion: ptp.openshift.io/v1
kind: PtpConfig
metadata:
  name: du-ptp-slave
  namespace: openshift-ptp
spec:
  profile:
    - interface: ens5f0     1
      name: slave
      phc2sysOpts: -a -r -n 24
      ptp4lConf: |
        [global]
        #
        # Default Data Set
        #
        twoStepFlag 1
        slaveOnly 0
        priority1 128
        priority2 128
        domainNumber 24
        #utc_offset 37
        clockClass 248
        clockAccuracy 0xFE
        offsetScaledLogVariance 0xFFFF
        free_running 0
        freq_est_interval 1
        dscp_event 0
        dscp_general 0
        dataset_comparison ieee1588
        G.8275.defaultDS.localPriority 128
        #
        # Port Data Set
        #
        logAnnounceInterval -3
        logSyncInterval -4
        logMinDelayReqInterval -4
        logMinPdelayReqInterval -4
        announceReceiptTimeout 3
        syncReceiptTimeout 0
        delayAsymmetry 0
        fault_reset_interval 4
        neighborPropDelayThresh 20000000
        masterOnly 0
        G.8275.portDS.localPriority 128
        #
        # Run time options
        #
        assume_two_step 0
        logging_level 6
        path_trace_enabled 0
        follow_up_info 0
        hybrid_e2e 0
        inhibit_multicast_service 0
        net_sync_monitor 0
        tc_spanning_tree 0
        tx_timestamp_timeout 1
        unicast_listen 0
        unicast_master_table 0
        unicast_req_duration 3600
        use_syslog 1
        verbose 0
        summary_interval 0
        kernel_leap 1
        check_fup_sync 0
        #
        # Servo Options
        #
        pi_proportional_const 0.0
        pi_integral_const 0.0
        pi_proportional_scale 0.0
        pi_proportional_exponent -0.3
        pi_proportional_norm_max 0.7
        pi_integral_scale 0.0
        pi_integral_exponent 0.4
        pi_integral_norm_max 0.3
        step_threshold 2.0
        first_step_threshold 0.00002
        max_frequency 900000000
        clock_servo pi
        sanity_freq_limit 200000000
        ntpshm_segment 0
        #
        # Transport options
        #
        transportSpecific 0x0
        ptp_dst_mac 01:1B:19:00:00:00
        p2p_dst_mac 01:80:C2:00:00:0E
        udp_ttl 1
        udp6_scope 0x0E
        uds_address /var/run/ptp4l
        #
        # Default interface options
        #
        clock_type OC
        network_transport L2
        delay_mechanism E2E
        time_stamping hardware
        tsproc_mode filter
        delay_filter moving_median
        delay_filter_length 10
        egressLatency 0
        ingressLatency 0
        boundary_clock_jbod 0
        #
        # Clock description
        #
        productDescription ;;
        revisionData ;;
        manufacturerIdentity 00:00:00
        userDescription ;
        timeSource 0xA0
      ptp4lOpts: -2 -s --summary_interval -4
recommend:
  - match:
      - nodeLabel: node-role.kubernetes.io/master
    priority: 4
    profile: slave

1
设置用于接收 PTP 时钟信号的接口。

19.6.6.6. 扩展的 Tuned 配置集

运行 DU 工作负载的单节点 OpenShift 集群需要额外的高性能工作负载所需的性能调优配置。以下 Tuned CR 示例扩展了 Tuned 配置集:

推荐的扩展 Tuned 配置集配置

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
    - data: |
        [main]
        summary=Configuration changes profile inherited from performance created tuned
        include=openshift-node-performance-openshift-node-performance-profile
        [bootloader]
        cmdline_crash=nohz_full=2-51,54-103
        [sysctl]
        kernel.timer_migration=1
        [scheduler]
        group.ice-ptp=0:f:10:*:ice-ptp.*
        [service]
        service.stalld=start,enable
        service.chronyd=stop,disable
      name: performance-patch
  recommend:
    - machineConfigLabels:
        machineconfiguration.openshift.io/role: master
      priority: 19
      profile: performance-patch

19.6.6.7. SR-IOV

单根 I/O 虚拟化 (SR-IOV) 通常用于启用前端和中间网络。以下 YAML 示例为单节点 OpenShift 集群配置 SR-IOV。

推荐的 SR-IOV 配置

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
  namespace: openshift-sriov-network-operator
spec:
  configDaemonNodeSelector:
    node-role.kubernetes.io/master: ""
  disableDrain: true
  enableInjector: true
  enableOperatorWebhook: true
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-nw-du-mh
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: openshift-sriov-network-operator
  resourceName: du_mh
  vlan: 150 1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-nnp-du-mh
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci 2
  isRdma: false
  nicSelector:
    pfNames:
      - ens7f0 3
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numVfs: 8 4
  priority: 10
  resourceName: du_mh
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-nw-du-fh
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: openshift-sriov-network-operator
  resourceName: du_fh
  vlan: 140 5
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-nnp-du-fh
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice 6
  isRdma: true
  nicSelector:
    pfNames:
      - ens5f0 7
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numVfs: 8 8
  priority: 10
  resourceName: du_fh

1
指定 midhaul 网络的 VLAN。
2
根据需要选择 vfio-pcinetdevice
3
指定连接到中间网络的接口。
4
指定 midhaul 网络的 VF 数量。
5
前端网络的 VLAN。
6
根据需要选择 vfio-pcinetdevice
7
指定连接到前端网络的接口。
8
指定前端网络的 VF 数量。

19.6.6.8. Console Operator

console-operator 在集群中安装并维护 web 控制台。当节点被集中管理时,不需要 Operator,并为应用程序工作负载腾出空间。以下 Console 自定义资源 (CR) 示例禁用控制台。

推荐的控制台配置

apiVersion: operator.openshift.io/v1
kind: Console
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "false"
    include.release.openshift.io/self-managed-high-availability: "false"
    include.release.openshift.io/single-node-developer: "false"
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  logLevel: Normal
  managementState: Removed
  operatorLogLevel: Normal

19.6.6.9. Grafana 和 Alertmanager

运行 DU 工作负载的单节点 OpenShift 集群需要减少 OpenShift Container Platform 监控组件所消耗的 CPU 资源。以下 ConfigMap 自定义资源 (CR) 禁用 Grafana 和 Alertmanager。

推荐的集群监控配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    grafana:
      enabled: false
    alertmanagerMain:
      enabled: false
    prometheusK8s:
       retention: 24h

19.6.6.10. 网络诊断

运行 DU 工作负载的单节点 OpenShift 集群需要较少的 pod 网络连接检查,以减少这些 pod 创建的额外负载。以下自定义资源 (CR) 禁用这些检查。

推荐的网络诊断配置

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  disableNetworkDiagnostics: true

Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.