2.2. 在 Azure 上创建计算机器设置

2.2.1. Azure 上计算机器设置自定义资源的 YAML 示例
复制链接

此 YAML 示例定义了一个在区域(region)的 1 Microsoft Azure 区域(zone)中运行的计算机器集，并创建通过 node-role.kubernetes.io/<role>: "" 标记的节点。

在本例中，<infrastructure_id> 是基础架构 ID 标签，该标签基于您在置备集群时设定的集群 ID，而 <role> 则是要添加的节点标签。

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructure_id> 
    machine.openshift.io/cluster-api-machine-role: <role> 
    machine.openshift.io/cluster-api-machine-type: <role>
  name: <infrastructure_id>-<role>-<region> 
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructure_id>
      machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>-<region>
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructure_id>
        machine.openshift.io/cluster-api-machine-role: <role>
        machine.openshift.io/cluster-api-machine-type: <role>
        machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>-<region>
    spec:
      metadata:
        creationTimestamp: null
        labels:
          machine.openshift.io/cluster-api-machineset: <machineset_name>
          node-role.kubernetes.io/<role>: ""
      providerSpec:
        value:
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image: 
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/<infrastructure_id>-rg/providers/Microsoft.Compute/galleries/gallery_<infrastructure_id>/images/<infrastructure_id>-gen2/versions/latest 
            sku: ""
            version: ""
          internalLoadBalancer: ""
          kind: AzureMachineProviderSpec
          location: <region> 
          managedIdentity: <infrastructure_id>-identity
          metadata:
            creationTimestamp: null
          natRule: null
          networkResourceGroup: ""
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: ""
          resourceGroup: <infrastructure_id>-rg
          sshPrivateKey: ""
          sshPublicKey: ""
          tags:
            - name: <custom_tag_name> 
              value: <custom_tag_value>
          subnet: <infrastructure_id>-<role>-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: <infrastructure_id>-vnet
          zone: "1"

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructure_id>

1


    machine.openshift.io/cluster-api-machine-role: <role>

2


    machine.openshift.io/cluster-api-machine-type: <role>
  name: <infrastructure_id>-<role>-<region>

3


  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructure_id>
      machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>-<region>
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructure_id>
        machine.openshift.io/cluster-api-machine-role: <role>
        machine.openshift.io/cluster-api-machine-type: <role>
        machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>-<region>
    spec:
      metadata:
        creationTimestamp: null
        labels:
          machine.openshift.io/cluster-api-machineset: <machineset_name>
          node-role.kubernetes.io/<role>: ""
      providerSpec:
        value:
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:

4


            offer: ""
            publisher: ""
            resourceID: /resourceGroups/<infrastructure_id>-rg/providers/Microsoft.Compute/galleries/gallery_<infrastructure_id>/images/<infrastructure_id>-gen2/versions/latest

5


            sku: ""
            version: ""
          internalLoadBalancer: ""
          kind: AzureMachineProviderSpec
          location: <region>

6


          managedIdentity: <infrastructure_id>-identity
          metadata:
            creationTimestamp: null
          natRule: null
          networkResourceGroup: ""
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: ""
          resourceGroup: <infrastructure_id>-rg
          sshPrivateKey: ""
          sshPublicKey: ""
          tags:
            - name: <custom_tag_name>

7


              value: <custom_tag_value>
          subnet: <infrastructure_id>-<role>-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: <infrastructure_id>-vnet
          zone: "1"

8

Copy to Clipboard

Toggle word wrap

1

指定基于置备集群时所设置的集群 ID 的基础架构 ID。如果已安装 OpenShift CLI，您可以通过运行以下命令来获取基础架构 ID：

oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster

$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster

Copy to Clipboard

Toggle word wrap

您可以运行以下命令来获取子网：

 oc -n openshift-machine-api \
    -o jsonpath='{.spec.template.spec.providerSpec.value.subnet}{"\n"}' \
    get machineset/<infrastructure_id>-worker-centralus1

$  oc -n openshift-machine-api \
    -o jsonpath='{.spec.template.spec.providerSpec.value.subnet}{"\n"}' \
    get machineset/<infrastructure_id>-worker-centralus1

Copy to Clipboard

Toggle word wrap

您可以运行以下命令来获取 vnet：

 oc -n openshift-machine-api \
    -o jsonpath='{.spec.template.spec.providerSpec.value.vnet}{"\n"}' \
    get machineset/<infrastructure_id>-worker-centralus1

$  oc -n openshift-machine-api \
    -o jsonpath='{.spec.template.spec.providerSpec.value.vnet}{"\n"}' \
    get machineset/<infrastructure_id>-worker-centralus1

Copy to Clipboard

Toggle word wrap

2

指定要添加的节点标签。

3

指定基础架构 ID、节点标签和地区。

4

指定计算机器设置的镜像详情。如果要使用 Azure Marketplace 镜像，请参阅"使用 Azure Marketplace 产品"。

5

指定与实例类型兼容的镜像。安装程序创建的 Hyper-V 生成 V2 镜像具有 -gen2 后缀，而 V1 镜像则与没有后缀的名称相同。

6

指定要放置机器的区域。

7

可选：在机器集中指定自定义标签。在 <custom_tag_name> 字段中提供标签名称，并在 <custom_tag_value> 字段中提供对应的标签值。

8

指定您所在地区（region）内要放置机器的区域 (zone) 。确保您的地区支持您指定的区域。

重要

如果您的区域支持可用区，您必须指定区域。指定区可避免在 pod 需要持久性卷时卷节点关联性失败。为此，您可以为同一区域中的每个区域创建一个计算机器集。

2.2.2. 创建计算机器集
复制链接

除了安装程序创建的计算机器集外，您还可以创建自己的来动态管理您选择的特定工作负载的机器计算资源。

先决条件

部署一个 OpenShift Container Platform 集群。
安装 OpenShift CLI（oc）。
以具有 cluster-admin 权限的用户身份登录 oc。

流程

创建一个包含计算机器集自定义资源（CR）示例的新 YAML 文件，并将其命名为 <file_name>.yaml。
确保设置 <clusterID> 和 <role> 参数值。

可选：如果您不确定要为特定字段设置哪个值，您可以从集群中检查现有计算机器集：

要列出集群中的计算机器集，请运行以下命令：

oc get machinesets -n openshift-machine-api

$ oc get machinesets -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1d   0         0                             55m
agl030519-vplxk-worker-us-east-1e   0         0                             55m
agl030519-vplxk-worker-us-east-1f   0         0                             55m

NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1d   0         0                             55m
agl030519-vplxk-worker-us-east-1e   0         0                             55m
agl030519-vplxk-worker-us-east-1f   0         0                             55m

Copy to Clipboard

Toggle word wrap

要查看特定计算机器集自定义资源 (CR) 的值，请运行以下命令：

oc get machineset <machineset_name> \
  -n openshift-machine-api -o yaml

$ oc get machineset <machineset_name> \
  -n openshift-machine-api -o yaml

Copy to Clipboard

Toggle word wrap

输出示例

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructure_id> 
  name: <infrastructure_id>-<role> 
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructure_id>
      machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructure_id>
        machine.openshift.io/cluster-api-machine-role: <role>
        machine.openshift.io/cluster-api-machine-type: <role>
        machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>
    spec:
      providerSpec: 
        ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructure_id>

1


  name: <infrastructure_id>-<role>

2


  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructure_id>
      machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructure_id>
        machine.openshift.io/cluster-api-machine-role: <role>
        machine.openshift.io/cluster-api-machine-type: <role>
        machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>
    spec:
      providerSpec:

3

...

Copy to Clipboard

Toggle word wrap

1: 集群基础架构 ID。
2: 默认节点标签。
注意
对于具有用户置备的基础架构的集群，计算机器集只能创建 worker 和 infra 类型机器。
3: 计算机器设置 CR 的 <providerSpec> 部分中的值是特定于平台的。有关 CR 中的 <providerSpec> 参数的更多信息，请参阅您的供应商计算机器设置 CR 配置示例。

运行以下命令来创建 MachineSet CR：
```
oc create -f <file_name>.yaml
```
```
$ oc create -f <file_name>.yaml
```
Copy to Clipboard Toggle word wrap

验证

运行以下命令，查看计算机器集列表：

oc get machineset -n openshift-machine-api

$ oc get machineset -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
agl030519-vplxk-infra-us-east-1a    1         1         1       1           11m
agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1d   0         0                             55m
agl030519-vplxk-worker-us-east-1e   0         0                             55m
agl030519-vplxk-worker-us-east-1f   0         0                             55m

NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
agl030519-vplxk-infra-us-east-1a    1         1         1       1           11m
agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
agl030519-vplxk-worker-us-east-1d   0         0                             55m
agl030519-vplxk-worker-us-east-1e   0         0                             55m
agl030519-vplxk-worker-us-east-1f   0         0                             55m

Copy to Clipboard

Toggle word wrap

当新的计算机器集可用时，DESIRED 和 CURRENT 的值会匹配。如果 compute 机器集不可用，请等待几分钟，然后再次运行命令。

2.2.3. 为集群自动扩展标记 GPU 机器集
复制链接

您可以使用机器集标签来指示集群自动扩展可以用来部署启用了 GPU 的节点的机器。

先决条件

集群使用集群自动扩展。

流程

在您要为集群自动扩展创建机器的机器集中，用来部署启用了 GPU 的节点，添加 cluster-api/accelerator 标签：
```
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: machine-set-name
spec:
  template:
    spec:
      metadata:
        labels:
          cluster-api/accelerator: nvidia-t4 
```
```
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: machine-set-name
spec:
  template:
    spec:
      metadata:
        labels:
          cluster-api/accelerator: nvidia-t4 
```
1
Copy to Clipboard Toggle word wrap
1
指定您选择的标签，其由字母数字字符、-、_ 或 . 组成，并以字母数字字符开头，并以字母数字字符结尾。例如，您可以使用 nvidia-t4 代表 Nvidia T4 GPU，或使用 nvidia-a10g 代表 A10G GPU。
注意
您必须为 ClusterAutoscaler CR 中的 spec.resourceLimits.gpus.type 参数指定此标签的值。如需更多信息，请参阅"集群自动扩展资源定义"。

2.2.4. 使用 Azure Marketplace 产品
复制链接

您可以创建在 Azure 上运行的机器集，以部署使用 Azure Marketplace 产品的机器。要使用此产品，您必须首先获取 Azure Marketplace 镜像。在获取您的镜像时，请考虑以下事项：

虽然镜像相同，但 Azure Marketplace publisher 根据您的区域。如果您位于北美，请将 redhat 指定为发布者。如果您位于 EMEA，请将 redhat-limited 指定为发布者。
此项优惠包括 rh-ocp-worker SKU 和 rh-ocp-worker-gen1 SKU。rh-ocp-worker SKU 代表 Hyper-V 生成版本 2 虚拟机镜像。OpenShift Container Platform 中使用的默认实例类型与版本 2 兼容。如果您计划使用与版本 1 兼容的实例类型，请使用与 rh-ocp-worker-gen1 SKU 关联的镜像。rh-ocp-worker-gen1 SKU 代表 Hyper-V 版本 1 虚拟机镜像。

重要

在使用 64 位 ARM 实例的集群上不支持使用 Azure marketplace 安装镜像。

您应该只修改计算机器的 RHCOS 镜像以使用 Azure Marketplace 镜像。control plane 机器和基础架构节点不需要 OpenShift Container Platform 订阅，且默认使用公共 RHCOS 默认镜像，这不会对 Azure bill 造成订阅成本。因此，您不应该修改集群默认引导镜像或 control plane 引导镜像。将 Azure Marketplace 镜像应用到它们时，会生成无法恢复的额外许可成本。

先决条件

已安装 Azure CLI 客户端 (az)。
您的 Azure 帐户为产品授权，您使用 Azure CLI 客户端登录到此帐户。

流程

运行以下命令之一，显示所有可用的 OpenShift Container Platform 镜像：

北美：

 az vm image list --all --offer rh-ocp-worker --publisher redhat -o table

$  az vm image list --all --offer rh-ocp-worker --publisher redhat -o table

Copy to Clipboard

Toggle word wrap

输出示例

Offer          Publisher       Sku                 Urn                                                             Version
-------------  --------------  ------------------  --------------------------------------------------------------  -----------------
rh-ocp-worker  RedHat          rh-ocp-worker       RedHat:rh-ocp-worker:rh-ocp-worker:4.15.2024072409              4.15.2024072409
rh-ocp-worker  RedHat          rh-ocp-worker-gen1  RedHat:rh-ocp-worker:rh-ocp-worker-gen1:4.15.2024072409         4.15.2024072409

Offer          Publisher       Sku                 Urn                                                             Version
-------------  --------------  ------------------  --------------------------------------------------------------  -----------------
rh-ocp-worker  RedHat          rh-ocp-worker       RedHat:rh-ocp-worker:rh-ocp-worker:4.15.2024072409              4.15.2024072409
rh-ocp-worker  RedHat          rh-ocp-worker-gen1  RedHat:rh-ocp-worker:rh-ocp-worker-gen1:4.15.2024072409         4.15.2024072409

Copy to Clipboard

Toggle word wrap

欧洲、中东和非洲地区：

 az vm image list --all --offer rh-ocp-worker --publisher redhat-limited -o table

$  az vm image list --all --offer rh-ocp-worker --publisher redhat-limited -o table

Copy to Clipboard

Toggle word wrap

输出示例

Offer          Publisher       Sku                 Urn                                                                     Version
-------------  --------------  ------------------  --------------------------------------------------------------          -----------------
rh-ocp-worker  redhat-limited  rh-ocp-worker       redhat-limited:rh-ocp-worker:rh-ocp-worker:4.15.2024072409              4.15.2024072409
rh-ocp-worker  redhat-limited  rh-ocp-worker-gen1  redhat-limited:rh-ocp-worker:rh-ocp-worker-gen1:4.15.2024072409         4.15.2024072409

Offer          Publisher       Sku                 Urn                                                                     Version
-------------  --------------  ------------------  --------------------------------------------------------------          -----------------
rh-ocp-worker  redhat-limited  rh-ocp-worker       redhat-limited:rh-ocp-worker:rh-ocp-worker:4.15.2024072409              4.15.2024072409
rh-ocp-worker  redhat-limited  rh-ocp-worker-gen1  redhat-limited:rh-ocp-worker:rh-ocp-worker-gen1:4.15.2024072409         4.15.2024072409

Copy to Clipboard

Toggle word wrap

注意

使用可用于 compute 和 control plane 节点的最新镜像。如果需要，您的虚拟机会在安装过程中自动升级。

运行以下命令之一检查您的所提供的镜像：

北美：

az vm image show --urn redhat:rh-ocp-worker:rh-ocp-worker:<version>

$ az vm image show --urn redhat:rh-ocp-worker:rh-ocp-worker:<version>

Copy to Clipboard

Toggle word wrap

欧洲、中东和非洲地区：

az vm image show --urn redhat-limited:rh-ocp-worker:rh-ocp-worker:<version>

$ az vm image show --urn redhat-limited:rh-ocp-worker:rh-ocp-worker:<version>

Copy to Clipboard

Toggle word wrap

运行以下命令之一查看提供的术语：

北美：

az vm image terms show --urn redhat:rh-ocp-worker:rh-ocp-worker:<version>

$ az vm image terms show --urn redhat:rh-ocp-worker:rh-ocp-worker:<version>

Copy to Clipboard

Toggle word wrap

欧洲、中东和非洲地区：

az vm image terms show --urn redhat-limited:rh-ocp-worker:rh-ocp-worker:<version>

$ az vm image terms show --urn redhat-limited:rh-ocp-worker:rh-ocp-worker:<version>

Copy to Clipboard

Toggle word wrap

运行以下命令之一接受产品条款：

北美：

az vm image terms accept --urn redhat:rh-ocp-worker:rh-ocp-worker:<version>

$ az vm image terms accept --urn redhat:rh-ocp-worker:rh-ocp-worker:<version>

Copy to Clipboard

Toggle word wrap

欧洲、中东和非洲地区：

az vm image terms accept --urn redhat-limited:rh-ocp-worker:rh-ocp-worker:<version>

$ az vm image terms accept --urn redhat-limited:rh-ocp-worker:rh-ocp-worker:<version>

Copy to Clipboard

Toggle word wrap

记录您所提供的镜像详情，特别是 publisher, offer, sku, 和 version 的值。

使用您提供的镜像详情，在机器集 YAML 文件的 providerSpec 部分添加以下参数：

Azure Marketplace 机器的 providerSpec 镜像值示例

providerSpec:
  value:
    image:
      offer: rh-ocp-worker
      publisher: redhat
      resourceID: ""
      sku: rh-ocp-worker
      type: MarketplaceWithPlan
      version: 413.92.2023101700

providerSpec:
  value:
    image:
      offer: rh-ocp-worker
      publisher: redhat
      resourceID: ""
      sku: rh-ocp-worker
      type: MarketplaceWithPlan
      version: 413.92.2023101700

Copy to Clipboard

Toggle word wrap

2.2.5. 启用 Azure 引导诊断
复制链接

您可以在机器集创建的 Azure 机器上启用引导诊断。

先决条件

已有 Microsoft Azure 集群。

流程

将适用于您的存储类型的 diagnostics 配置添加到机器集 YAML 文件中的 providerSpec 字段中：

对于 Azure Managed 存储帐户：

providerSpec:
  diagnostics:
    boot:
      storageAccountType: AzureManaged

providerSpec:
  diagnostics:
    boot:
      storageAccountType: AzureManaged

1

Copy to Clipboard

Toggle word wrap

1: 指定 Azure Managed 存储帐户。

对于 Azure Unmanaged 存储帐户：

providerSpec:
  diagnostics:
    boot:
      storageAccountType: CustomerManaged 
      customerManaged:
        storageAccountURI: https://<storage-account>.blob.core.windows.net

providerSpec:
  diagnostics:
    boot:
      storageAccountType: CustomerManaged

1


      customerManaged:
        storageAccountURI: https://<storage-account>.blob.core.windows.net

2

Copy to Clipboard

Toggle word wrap

1: 指定 Azure Unmanaged 存储帐户。
2: 将 <storage-account> 替换为存储帐户的名称。

注意

仅支持 Azure Blob Storage 数据服务。

验证

在 Microsoft Azure 门户上，查看机器集部署的机器的 Boot diagnostics 页面，并验证您可以看到机器的串行日志。

2.2.6. 将机器部署为 Spot 虚拟机的机器
复制链接

您可以通过创建一个在 Azure 上运行的计算机器集来节约成本，该机器集将机器部署为非保障的 Spot 虚拟机。Spot VM 使用未使用的 Azure 容量，且比标准虚拟机的成本要低。您可以将 Spot 虚拟机用于可容许中断的工作负载，如批处理或无状态工作负载、横向可扩展工作负载。

Azure 可随时终止 Spot 虚拟机。Azure 在发生中断时向用户发出 30 秒警告。当 Azure 发出终止警告时，OpenShift Container Platform 开始从受影响的实例中删除工作负载。

使用 Spot 虚拟机时可能会因为以下原因造成中断：

实例价格超过您的最大价格
Spot 虚拟机的提供减少
Azure 需要容量退回

当 Azure 终止实例时，在 Spot VM 节点上运行的终止处理器会删除机器资源。为了满足计算机器设置副本数量，计算机器会创建一个请求 Spot 虚拟机的机器。

2.2.6.1. 使用计算机器集创建 Spot 虚拟机
复制链接

您可以通过在计算机器设置 YAML 文件中添加 spotVMOptions，在 Azure 上启动 Spot 虚拟机。

流程

在 providerSpec 字段中添加以下行：
```
providerSpec:
  value:
    spotVMOptions: {}
```
```
providerSpec:
  value:
    spotVMOptions: {}
```
Copy to Clipboard Toggle word wrap
您可以选择设置 spotVMOptions.maxPrice 字段来限制 Spot 虚拟机的成本。例如，您可以设置 maxPrice: '0.98765'。如果设置了 maxPrice，则将此值用作每小时最大即时价格。如果没有设置，则最大价格默认为 -1 且不超过标准虚拟机价格。
Azure 封顶 Spot VM 价格以标准价格为基础。如果实例使用默认的 maxPrice 设置，Azure 不会因为定价而驱除实例。但是，一个实例仍然可能会因为容量限制而被驱除。

注意

强烈建议您使用默认标准 VM 价格作为 maxPrice 值，而不为 Spot 虚拟机设置最大价格。

2.2.7. 在临时操作系统磁盘中部署机器的机器集
复制链接

您可以创建在 Azure 上运行的计算机器，用于在 Ephemeral OS 磁盘上部署机器。临时 OS 磁盘使用本地虚拟机容量，而不是远程 Azure 存储。因此，此配置不会产生额外费用，并提供了较低的读、写和重新处理延迟。

2.2.7.1. 使用计算机器在临时磁盘上创建机器
复制链接

您可以通过编辑计算机器设置 YAML 文件在 Azure 上的 Ephemeral OS 磁盘上启动机器。

先决条件

已有 Microsoft Azure 集群。

流程

运行以下命令来编辑自定义资源(CR)：
```
oc edit machineset <machine-set-name>
```
```
$ oc edit machineset <machine-set-name>
```
Copy to Clipboard Toggle word wrap
其中 <machine-set-name> 是您希望在 Ephemeral OS 磁盘上置备机器的计算机器集。

在 providerSpec 字段中添加以下内容：

providerSpec:
  value:
    ...
    osDisk:
       ...
       diskSettings: 
         ephemeralStorageLocation: Local 
       cachingType: ReadOnly 
       managedDisk:
         storageAccountType: Standard_LRS 
       ...

providerSpec:
  value:
    ...
    osDisk:
       ...
       diskSettings:

1


         ephemeralStorageLocation: Local

2


       cachingType: ReadOnly

3


       managedDisk:
         storageAccountType: Standard_LRS

4

...

Copy to Clipboard

Toggle word wrap

1 2 3: 这些行允许使用 Ephemeral OS 磁盘。
4: 临时磁盘仅支持虚拟机或扩展使用标准 LRS 存储帐户类型的实例。

重要

OpenShift Container Platform 中支持 Ephemeral OS 磁盘实现只支持 CacheDisk 放置类型。不要更改 placement 配置设置。

使用更新的配置创建计算机器集：
```
oc create -f <machine-set-config>.yaml
```
```
$ oc create -f <machine-set-config>.yaml
```
Copy to Clipboard Toggle word wrap

验证

在 Microsoft Azure 门户上，查看由计算机器设置部署的机器的 Overview 页面，并验证 Ephemeral OS 磁盘 字段是否已设置为 OS 缓存放置。

2.2.8. 使用计算磁盘部署机器的机器集作为数据磁盘
复制链接

您可以创建在 Azure 上运行的机器集，该机器集用来部署带有巨型磁盘的机器。ultra 磁盘是高性能存储，用于要求最苛刻的数据工作负载。

您还可以创建一个持久性卷声明(PVC)来动态绑定到 Azureultra 磁盘支持的存储类，并将它们挂载到 pod。

注意

数据磁盘不支持指定磁盘吞吐量或磁盘 IOPS。您可以使用 PVC 配置这些属性。

2.2.8.1. 使用机器集创建带有巨型磁盘的机器
复制链接

您可以通过编辑机器集 YAML 文件在 Azure 上部署带有巨型磁盘的机器。

先决条件

已有 Microsoft Azure 集群。

流程

运行以下命令，使用 worker 数据 secret 在 openshift-machine-api 命名空间中创建自定义 secret：

oc -n openshift-machine-api \
get secret <role>-user-data \
--template='{{index .data.userData | base64decode}}' | jq > userData.txt

$ oc -n openshift-machine-api \
get secret <role>-user-data \

1


--template='{{index .data.userData | base64decode}}' | jq > userData.txt

2

Copy to Clipboard

Toggle word wrap

1: 将 <role> 替换为 worker。
2: 指定 userData.txt 作为新自定义 secret 的名称。

在文本编辑器中，打开 userData.txt 文件，并在文件中找到最后的 } 字符。

在紧接下来的行中，添加一个 ,

在 , 之后创建一个新行并添加以下配置详情：

"storage": {
  "disks": [ 
    {
      "device": "/dev/disk/azure/scsi1/lun0", 
      "partitions": [ 
        {
          "label": "lun0p1", 
          "sizeMiB": 1024, 
          "startMiB": 0
        }
      ]
    }
  ],
  "filesystems": [ 
    {
      "device": "/dev/disk/by-partlabel/lun0p1",
      "format": "xfs",
      "path": "/var/lib/lun0p1"
    }
  ]
},
"systemd": {
  "units": [ 
    {
      "contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhere=/var/lib/lun0p1\nWhat=/dev/disk/by-partlabel/lun0p1\nOptions=defaults,pquota\n[Install]\nWantedBy=local-fs.target\n", 
      "enabled": true,
      "name": "var-lib-lun0p1.mount"
    }
  ]
}

"storage": {
  "disks": [

1


    {
      "device": "/dev/disk/azure/scsi1/lun0",

2


      "partitions": [

3


        {
          "label": "lun0p1",

4


          "sizeMiB": 1024,

5


          "startMiB": 0
        }
      ]
    }
  ],
  "filesystems": [

6


    {
      "device": "/dev/disk/by-partlabel/lun0p1",
      "format": "xfs",
      "path": "/var/lib/lun0p1"
    }
  ]
},
"systemd": {
  "units": [

7


    {
      "contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhere=/var/lib/lun0p1\nWhat=/dev/disk/by-partlabel/lun0p1\nOptions=defaults,pquota\n[Install]\nWantedBy=local-fs.target\n",

8


      "enabled": true,
      "name": "var-lib-lun0p1.mount"
    }
  ]
}

Copy to Clipboard

Toggle word wrap

1: 您要作为 ultra 磁盘附加到节点的磁盘的配置详情。
2: 指定您使用的机器集的 dataDisks 小节中定义的 lun 值。例如，如果机器集包含 lun: 0，请指定 lun0。您可以通过在这个配置文件中指定多个 "disks" 条目来初始化多个数据磁盘。如果您指定多个 "disks" 条目，请确保每个条目的 lun 值与机器集中的值匹配。
3: 磁盘上新分区的配置详情。
4: 为分区指定标签。使用分层的名称可能会有帮助，如 lun0p1 代表 lun0 的第一个分区。
5: 指定分区的总大小（以 MiB 为单位）。
6: 指定在格式化分区时要使用的文件系统。使用分区标签来指定分区。
7: 指定一个 systemd 单元来在引导时挂载分区。使用分区标签来指定分区。您可以通过在这个配置文件中指定多个 "partitions" 条目来创建多个分区。如果指定多个 "partitions" 条目，则必须为每个条目指定一个 systemd 单元。
8: 对于 where，指定 storage.filesystems.path 的值。对于 What，指定 storage.filesystems.device 的值。

运行以下命令，将禁用模板值提取到名为 disableTemplating.txt 的文件：

oc -n openshift-machine-api get secret <role>-user-data \
--template='{{index .data.disableTemplating | base64decode}}' | jq > disableTemplating.txt

$ oc -n openshift-machine-api get secret <role>-user-data \

1


--template='{{index .data.disableTemplating | base64decode}}' | jq > disableTemplating.txt

Copy to Clipboard

Toggle word wrap

1: 将 <role> 替换为 worker。

运行以下命令组合 userData.txt 文件和 disableTemplating.txt 文件来创建数据 secret 文件：

oc -n openshift-machine-api create secret generic <role>-user-data-x5 \
--from-file=userData=userData.txt \
--from-file=disableTemplating=disableTemplating.txt

$ oc -n openshift-machine-api create secret generic <role>-user-data-x5 \

1


--from-file=userData=userData.txt \
--from-file=disableTemplating=disableTemplating.txt

Copy to Clipboard

Toggle word wrap

1: 对于 <role>-user-data-x5，请指定 secret 的名称。将 <role> 替换为 worker。

运行以下命令，复制现有的 Azure MachineSet 自定义资源(CR)并编辑它：
```
oc edit machineset <machine_set_name>
```
```
$ oc edit machineset <machine_set_name>
```
Copy to Clipboard Toggle word wrap
其中 <machine_set_name> 是您要使用巨型磁盘置备机器的机器集。

在指示的位置中添加以下行：

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
spec:
  template:
    spec:
      metadata:
        labels:
          disk: ultrassd 
      providerSpec:
        value:
          ultraSSDCapability: Enabled 
          dataDisks: 
          - nameSuffix: ultrassd
            lun: 0
            diskSizeGB: 4
            deletionPolicy: Delete
            cachingType: None
            managedDisk:
              storageAccountType: UltraSSD_LRS
          userDataSecret:
            name: <role>-user-data-x5

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
spec:
  template:
    spec:
      metadata:
        labels:
          disk: ultrassd

1


      providerSpec:
        value:
          ultraSSDCapability: Enabled

2


          dataDisks:

3


          - nameSuffix: ultrassd
            lun: 0
            diskSizeGB: 4
            deletionPolicy: Delete
            cachingType: None
            managedDisk:
              storageAccountType: UltraSSD_LRS
          userDataSecret:
            name: <role>-user-data-x5

4

Copy to Clipboard

Toggle word wrap

1: 指定标签，用于选择此机器集创建的节点。此流程使用 disk.ulssd 用于这个值。
2 3: 这些行支持使用 ultra 磁盘。对于 dataDisks，请包括整个小节。
4: 指定之前创建的用户数据 secret。将 <role> 替换为 worker。

运行以下命令，使用更新的配置创建机器集：
```
oc create -f <machine_set_name>.yaml
```
```
$ oc create -f <machine_set_name>.yaml
```
Copy to Clipboard Toggle word wrap

验证

运行以下命令验证机器是否已创建：
```
oc get machines
```
```
$ oc get machines
```
Copy to Clipboard Toggle word wrap
机器应处于 Running 状态。
对于正在运行并附加节点的机器，请运行以下命令验证分区：
```
oc debug node/<node_name> -- chroot /host lsblk
```
```
$ oc debug node/<node_name> -- chroot /host lsblk
```
Copy to Clipboard Toggle word wrap
在这个命令中，oc debug node/<node_name> 在节点 <node_name> 上启动一个调试 shell，并传递一个带有 -- 的命令。传递的命令 chroot /host 提供对底层主机操作系统二进制文件的访问，lsblk 显示连接至主机操作系统计算机的块设备。

后续步骤

要在 pod 中使用大量磁盘，请创建使用挂载点的工作负载。创建一个类似以下示例的 YAML 文件：

apiVersion: v1
kind: Pod
metadata:
  name: ssd-benchmark1
spec:
  containers:
  - name: ssd-benchmark1
    image: nginx
    ports:
      - containerPort: 80
        name: "http-server"
    volumeMounts:
    - name: lun0p1
      mountPath: "/tmp"
  volumes:
    - name: lun0p1
      hostPath:
        path: /var/lib/lun0p1
        type: DirectoryOrCreate
  nodeSelector:
    disktype: ultrassd

apiVersion: v1
kind: Pod
metadata:
  name: ssd-benchmark1
spec:
  containers:
  - name: ssd-benchmark1
    image: nginx
    ports:
      - containerPort: 80
        name: "http-server"
    volumeMounts:
    - name: lun0p1
      mountPath: "/tmp"
  volumes:
    - name: lun0p1
      hostPath:
        path: /var/lib/lun0p1
        type: DirectoryOrCreate
  nodeSelector:
    disktype: ultrassd

Copy to Clipboard

Toggle word wrap

2.2.8.2. 启用 ultra 磁盘的机器集的故障排除资源
复制链接

使用本节中的信息从您可能会遇到的问题了解和恢复。

2.2.8.2.1. 不正确的 ultra 磁盘配置
复制链接

如果在机器集中指定 ultraSSDCapability 参数的配置不正确，则机器置备会失败。

例如，如果 ultraSSDCapability 参数设置为 Disabled，但在 dataDisks 参数中指定了 ultra 磁盘，则会出现以下出错信息：

StorageAccountType UltraSSD_LRS can be used only when additionalCapabilities.ultraSSDEnabled is set.

StorageAccountType UltraSSD_LRS can be used only when additionalCapabilities.ultraSSDEnabled is set.

Copy to Clipboard

Toggle word wrap

要解决这个问题，请验证机器集配置是否正确。

2.2.8.2.2. 不支持的磁盘参数
复制链接

如果在机器集中指定与 ultra 磁盘不兼容的区域、可用性区域或实例大小，则机器置备会失败。检查日志中的以下出错信息：

failed to create vm <machine_name>: failure sending request for machine <machine_name>: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="Storage Account type 'UltraSSD_LRS' is not supported <more_information_about_why>."

failed to create vm <machine_name>: failure sending request for machine <machine_name>: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="Storage Account type 'UltraSSD_LRS' is not supported <more_information_about_why>."

Copy to Clipboard

Toggle word wrap

要解决这个问题，请验证您是否在受支持的环境中使用此功能，以及机器设置配置是否正确。

2.2.8.2.3. 无法删除磁盘
复制链接

如果因为数据磁盘无法按预期工作，则会删除大量磁盘，则机器会被删除，数据磁盘会孤立。如果需要，您必须手动删除孤立的磁盘。

2.2.9. 为机器集启用客户管理的加密密钥
复制链接

您可以为 Azure 提供加密密钥，以便加密受管磁盘上的数据。您可以使用 Machine API 使用客户管理的密钥启用服务器端加密。

使用客户管理的密钥需要 Azure Key Vault、磁盘加密集和加密密钥。磁盘加密集必须在 Cloud Credential Operator（CCO）授予权限的资源组中。如果没有，则需要在磁盘加密集中授予额外的 reader 角色。

先决条件

流程

在机器集 YAML 文件中的 providerSpec 字段中配置磁盘加密集。例如：

providerSpec:
  value:
    osDisk:
      diskSizeGB: 128
      managedDisk:
        diskEncryptionSet:
          id: /subscriptions/<subscription_id>/resourceGroups/<resource_group_name>/providers/Microsoft.Compute/diskEncryptionSets/<disk_encryption_set_name>
        storageAccountType: Premium_LRS

providerSpec:
  value:
    osDisk:
      diskSizeGB: 128
      managedDisk:
        diskEncryptionSet:
          id: /subscriptions/<subscription_id>/resourceGroups/<resource_group_name>/providers/Microsoft.Compute/diskEncryptionSets/<disk_encryption_set_name>
        storageAccountType: Premium_LRS

Copy to Clipboard

Toggle word wrap

2.2.10. 使用机器集为 Azure 虚拟机配置可信启动
复制链接

重要

为 Azure 虚拟机使用可信启动只是一个技术预览功能。技术预览功能不受红帽产品服务等级协议（SLA）支持，且功能可能并不完整。红帽不推荐在生产环境中使用它们。这些技术预览功能可以使用户提早试用新的功能，并有机会在开发阶段提供反馈意见。

有关红帽技术预览功能支持范围的更多信息，请参阅技术预览功能支持范围。

OpenShift Container Platform 4.16 支持 Azure 虚拟机 (VM) 的可信启动。通过编辑机器集 YAML 文件，您可以配置机器集用于部署的机器的可信启动选项。例如，您可以将这些机器配置为使用 UEFI 安全功能，如安全引导或专用虚拟信任平台模块 (vTPM) 实例。

注意

有些功能组合会导致配置无效。

Expand

表 2.1. UEFI 功能组合兼容性
安全引导 ^[1]	vTPM^[2]	有效配置
Enabled	Enabled	是
Enabled	Disabled	是
Enabled	省略	是
Disabled	Enabled	是
省略	Enabled	是
Disabled	Disabled	否
省略	Disabled	否
省略	省略	否

使用 secureBoot 字段。
使用 virtualizedTrustedPlatformModule 字段。

有关相关特性和功能的更多信息，请参阅 Microsoft Azure 文档中有关 Azure 虚拟机的受信任的启动文档。

流程

在文本编辑器中，为现有机器集打开 YAML 文件或创建新机器。

编辑 providerSpec 字段下的以下部分以提供有效的配置：

启用 UEFI 安全引导和 vTPM 的有效配置示例

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
# ...
spec:
  template:
    machines_v1beta1_machine_openshift_io:
      spec:
        providerSpec:
          value:
            securityProfile:
              settings:
                securityType: TrustedLaunch 
                trustedLaunch:
                  uefiSettings: 
                    secureBoot: Enabled 
                    virtualizedTrustedPlatformModule: Enabled 
# ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
# ...
spec:
  template:
    machines_v1beta1_machine_openshift_io:
      spec:
        providerSpec:
          value:
            securityProfile:
              settings:
                securityType: TrustedLaunch

1


                trustedLaunch:
                  uefiSettings:

2


                    secureBoot: Enabled

3


                    virtualizedTrustedPlatformModule: Enabled

4


# ...

Copy to Clipboard

Toggle word wrap

1: 为 Azure 虚拟机启用可信启动。所有有效配置都需要这个值。
2: 指定要使用的 UEFI 安全功能。所有有效配置都需要这个部分。
3: 启用 UEFI 安全引导。
4: 启用使用 vTPM。

验证

在 Azure 门户中，查看机器集部署的机器的详情，并验证可信启动选项是否与您配置的值匹配。

2.2.11. 使用机器集配置 Azure 机密虚拟机
复制链接

重要

使用 Azure 机密虚拟机只是一个技术预览功能。技术预览功能不受红帽产品服务等级协议（SLA）支持，且功能可能并不完整。红帽不推荐在生产环境中使用它们。这些技术预览功能可以使用户提早试用新的功能，并有机会在开发阶段提供反馈意见。

有关红帽技术预览功能支持范围的更多信息，请参阅技术预览功能支持范围。

OpenShift Container Platform 4.16 支持 Azure 机密虚拟机 (VM)。

注意

64 位 ARM 架构目前不支持机密虚拟机。

通过编辑机器集 YAML 文件，您可以配置机器集用于部署的机器的机密虚拟机选项。例如，您可以将这些机器配置为使用 UEFI 安全功能，如安全引导或专用虚拟信任平台模块 (vTPM) 实例。

有关相关特性和功能的更多信息，请参阅 Microsoft Azure 文档中有关机密虚拟机的信息。

流程

在文本编辑器中，为现有机器集打开 YAML 文件或创建新机器。

在 providerSpec 字段中编辑以下部分：

配置示例

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
# ...
spec:
  template:
    spec:
      providerSpec:
        value:
          osDisk:
            # ...
            managedDisk:
              securityProfile: 
                securityEncryptionType: VMGuestStateOnly 
            # ...
          securityProfile: 
            settings:
                securityType: ConfidentialVM 
                confidentialVM:
                  uefiSettings: 
                    secureBoot: Disabled 
                    virtualizedTrustedPlatformModule: Enabled 
          vmSize: Standard_DC16ads_v5 
# ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
# ...
spec:
  template:
    spec:
      providerSpec:
        value:
          osDisk:
            # ...
            managedDisk:
              securityProfile:

1


                securityEncryptionType: VMGuestStateOnly

2


            # ...
          securityProfile:

3


            settings:
                securityType: ConfidentialVM

4


                confidentialVM:
                  uefiSettings:

5


                    secureBoot: Disabled

6


                    virtualizedTrustedPlatformModule: Enabled

7


          vmSize: Standard_DC16ads_v5

8


# ...

Copy to Clipboard

Toggle word wrap

1: 在使用机密虚拟机时，指定受管磁盘的安全配置集设置。
2: 启用 Azure VM Guest State (VMGS) blob 加密。此设置需要使用 vTPM。
3: 指定机密虚拟机的安全配置集设置。
4: 启用使用机密虚拟机。所有有效配置都需要这个值。
5: 指定要使用的 UEFI 安全功能。所有有效配置都需要这个部分。
6: 禁用 UEFI 安全引导。
7: 启用使用 vTPM。
8: 指定支持机密虚拟机的实例类型。

验证

在 Azure 门户中，查看机器集部署的机器的详情，并验证机密虚拟机选项是否与您配置的值匹配。

2.2.12. Microsoft Azure 虚拟机的加速网络
复制链接

加速网络使用单一根 I/O 虚拟化(SR-IOV)为 Microsoft Azure 虚拟机提供更直接的路径到交换机。这提高了网络性能。此功能可在安装过程中或安装后启用。

2.2.12.1. 限制
复制链接

在决定是否使用加速网络时，请考虑以下限制：

只有在 Machine API 操作的集群中支持加速网络。
虽然 Azure worker 节点的最低要求是两个 vCPU，但 Accelerated Networking 需要包含至少四个 vCPU 的 Azure 虚拟机大小。为了满足此要求，您可以在机器集中更改 vmSize 的值。有关 Azure VM 大小的信息，请参阅 Microsoft Azure 文档。

当在现有 Azure 集群上启用这个功能时，只有新置备的节点会受到影响。当前运行的节点不会被协调。要在所有节点上启用这个功能，必须替换每个现有机器。这可以为每个机器单独完成，或者将副本缩减为零，然后备份到所需的副本数。

2.2.13. 使用机器集配置容量保留
复制链接

OpenShift Container Platform 版本 4.16.3 及更新的版本支持使用 Microsoft Azure 集群上的 Capacity Reservation groups on-demand Capacity Reservation。

您可以配置机器集，以便在与您定义容量请求参数匹配的可用资源上部署机器。这些参数指定您要保留的虚拟机大小、地区和实例数量。如果您的 Azure 订阅配额可以容纳容量请求，部署会成功。

如需更多信息，包括此 Azure 实例类型的限制和推荐的用例，请参阅 Microsoft Azure 文档有关 On-demand Capacity Reservation。

注意

您无法更改机器集的现有 Capacity Reservation 配置。要使用不同的 Capacity Reservation 组，您必须替换机器集以及之前部署的机器。

先决条件

您可以使用 cluster-admin 权限访问集群。
已安装 OpenShift CLI（oc）。
您创建了 Capacity Reservation 组。
如需更多信息，请参阅 Microsoft Azure 文档创建一个 Capacity Reservation。

流程

在文本编辑器中，为现有机器集打开 YAML 文件或创建新机器。

在 providerSpec 字段中编辑以下部分：

配置示例

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
# ...
spec:
  template:
    spec:
      providerSpec:
        value:
          capacityReservationGroupID: <capacity_reservation_group> 
# ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
# ...
spec:
  template:
    spec:
      providerSpec:
        value:
          capacityReservationGroupID: <capacity_reservation_group>

1


# ...

Copy to Clipboard

Toggle word wrap

1: 指定要部署机器的 Capacity Reservation 组的 ID。

验证

要验证机器部署，请运行以下命令列出机器集创建的机器：
```
oc get machines.machine.openshift.io \
  -n openshift-machine-api \
  -l machine.openshift.io/cluster-api-machineset=<machine_set_name>
```
```
$ oc get machines.machine.openshift.io \
  -n openshift-machine-api \
  -l machine.openshift.io/cluster-api-machineset=<machine_set_name>
```
Copy to Clipboard Toggle word wrap
其中 <machine_set_name> 是计算机器设置的名称。
在输出中，验证列出的机器的特征是否与 Capacity Reservation 的参数匹配。

2.2.14. 将 GPU 节点添加到现有 OpenShift Container Platform 集群中
复制链接

您可以复制并修改默认计算机器集配置，以便为 Azure 云供应商创建启用了 GPU 的机器集和机器。

下表列出了经过验证的实例类型：

Expand

vmSize	NVIDIA GPU 加速器	最大 GPU 数	架构
`Standard_NC24s_v3`	V100	4	x86
`Standard_NC4as_T4_v3`	T4	1	x86
`ND A100 v4`	A100	8	x86

注意

默认情况下，Azure 订阅没有 GPU 的 Azure 实例类型的配额。客户必须为上面列出的 Azure 实例系列请求配额增加。

流程

运行以下命令，查看 openshift-machine-api 命名空间中存在的机器和机器集。每个计算机器集都与 Azure 区域的不同可用区关联。安装程序会在可用区之间自动负载平衡计算机器。

oc get machineset -n openshift-machine-api

$ oc get machineset -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-worker-centralus1   1         1         1       1           6h9m
myclustername-worker-centralus2   1         1         1       1           6h9m
myclustername-worker-centralus3   1         1         1       1           6h9m

NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-worker-centralus1   1         1         1       1           6h9m
myclustername-worker-centralus2   1         1         1       1           6h9m
myclustername-worker-centralus3   1         1         1       1           6h9m

Copy to Clipboard

Toggle word wrap

运行以下命令，复制现有计算 MachineSet 定义并将结果输出到 YAML 文件。这将是启用了 GPU 的计算机器集定义的基础。

oc get machineset -n openshift-machine-api myclustername-worker-centralus1 -o yaml > machineset-azure.yaml

$ oc get machineset -n openshift-machine-api myclustername-worker-centralus1 -o yaml > machineset-azure.yaml

Copy to Clipboard

Toggle word wrap

查看 machineset 的内容：

cat machineset-azure.yaml

$ cat machineset-azure.yaml

Copy to Clipboard

Toggle word wrap

machineset-azure.yaml 文件示例

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-06T14:08:19Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: myclustername
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: myclustername-worker-centralus1
  namespace: openshift-machine-api
  resourceVersion: "23601"
  uid: acd56e0c-7612-473a-ae37-8704f34b80de
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: myclustername
      machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: myclustername
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: myclustername-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: myclustername-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: myclustername
          resourceGroup: myclustername-rg
          spotVMOptions: {}
          subnet: myclustername-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: myclustername-vnet
          zone: "1"
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-06T14:08:19Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: myclustername
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: myclustername-worker-centralus1
  namespace: openshift-machine-api
  resourceVersion: "23601"
  uid: acd56e0c-7612-473a-ae37-8704f34b80de
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: myclustername
      machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: myclustername
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: myclustername-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: myclustername-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: myclustername
          resourceGroup: myclustername-rg
          spotVMOptions: {}
          subnet: myclustername-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: myclustername-vnet
          zone: "1"
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

Copy to Clipboard

Toggle word wrap

运行以下命令，生成 machineset-azure.yaml 文件的副本：
```
cp machineset-azure.yaml machineset-azure-gpu.yaml
```
```
$ cp machineset-azure.yaml machineset-azure-gpu.yaml
```
Copy to Clipboard Toggle word wrap

更新 machineset-azure-gpu.yaml 中的以下字段：

将 .metadata.name 更改为包含 gpu 的名称。
更改 .spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"]，以匹配新的 .metadata.name。
更改 .spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"]，以匹配新的 .metadata.name。

将 .spec.template.spec.providerSpec.value.vmSize 更改为 Standard_NC4as_T4_v3。

machineset-azure-gpu.yaml 文件示例

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "1"
    machine.openshift.io/memoryMb: "28672"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-06T20:27:12Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: myclustername
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: myclustername-nc4ast4-gpu-worker-centralus1
  namespace: openshift-machine-api
  resourceVersion: "166285"
  uid: 4eedce7f-6a57-4abe-b529-031140f02ffa
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: myclustername
      machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: myclustername
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: myclustername-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: myclustername-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: myclustername
          resourceGroup: myclustername-rg
          spotVMOptions: {}
          subnet: myclustername-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_NC4as_T4_v3
          vnet: myclustername-vnet
          zone: "1"
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "1"
    machine.openshift.io/memoryMb: "28672"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-06T20:27:12Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: myclustername
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: myclustername-nc4ast4-gpu-worker-centralus1
  namespace: openshift-machine-api
  resourceVersion: "166285"
  uid: 4eedce7f-6a57-4abe-b529-031140f02ffa
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: myclustername
      machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: myclustername
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: myclustername-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: myclustername-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: myclustername
          resourceGroup: myclustername-rg
          spotVMOptions: {}
          subnet: myclustername-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_NC4as_T4_v3
          vnet: myclustername-vnet
          zone: "1"
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

Copy to Clipboard

Toggle word wrap

要验证您的更改，请运行以下命令对原始计算定义和新的 GPU 节点定义执行 diff ：

diff machineset-azure.yaml machineset-azure-gpu.yaml

$ diff machineset-azure.yaml machineset-azure-gpu.yaml

Copy to Clipboard

Toggle word wrap

输出示例

14c14
<   name: myclustername-worker-centralus1
---
>   name: myclustername-nc4ast4-gpu-worker-centralus1
23c23
<       machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
>       machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
30c30
<         machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
>         machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
67c67
<           vmSize: Standard_D4s_v3
---
>           vmSize: Standard_NC4as_T4_v3

14c14
<   name: myclustername-worker-centralus1
---
>   name: myclustername-nc4ast4-gpu-worker-centralus1
23c23
<       machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
>       machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
30c30
<         machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
>         machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
67c67
<           vmSize: Standard_D4s_v3
---
>           vmSize: Standard_NC4as_T4_v3

Copy to Clipboard

Toggle word wrap

运行以下命令，从定义文件创建启用了 GPU 的计算机器集：

oc create -f machineset-azure-gpu.yaml

$ oc create -f machineset-azure-gpu.yaml

Copy to Clipboard

Toggle word wrap

输出示例

machineset.machine.openshift.io/myclustername-nc4ast4-gpu-worker-centralus1 created

machineset.machine.openshift.io/myclustername-nc4ast4-gpu-worker-centralus1 created

Copy to Clipboard

Toggle word wrap

运行以下命令，查看 openshift-machine-api 命名空间中存在的机器和机器集。每个计算机器集都与 Azure 区域的不同可用区关联。安装程序会在可用区之间自动负载平衡计算机器。

oc get machineset -n openshift-machine-api

$ oc get machineset -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                               DESIRED   CURRENT   READY   AVAILABLE   AGE
clustername-n6n4r-nc4ast4-gpu-worker-centralus1    1         1         1       1           122m
clustername-n6n4r-worker-centralus1                1         1         1       1           8h
clustername-n6n4r-worker-centralus2                1         1         1       1           8h
clustername-n6n4r-worker-centralus3                1         1         1       1           8h

NAME                                               DESIRED   CURRENT   READY   AVAILABLE   AGE
clustername-n6n4r-nc4ast4-gpu-worker-centralus1    1         1         1       1           122m
clustername-n6n4r-worker-centralus1                1         1         1       1           8h
clustername-n6n4r-worker-centralus2                1         1         1       1           8h
clustername-n6n4r-worker-centralus3                1         1         1       1           8h

Copy to Clipboard

Toggle word wrap

运行以下命令，查看 openshift-machine-api 命名空间中存在的机器。您只能为每个集合配置一个计算机器，但您可以扩展计算机器集，以便在特定地区和区中添加节点。

oc get machines -n openshift-machine-api

$ oc get machines -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                                PHASE     TYPE                   REGION      ZONE   AGE
myclustername-master-0                              Running   Standard_D8s_v3        centralus   2      6h40m
myclustername-master-1                              Running   Standard_D8s_v3        centralus   1      6h40m
myclustername-master-2                              Running   Standard_D8s_v3        centralus   3      6h40m
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Running      centralus   1      21m
myclustername-worker-centralus1-rbh6b               Running   Standard_D4s_v3        centralus   1      6h38m
myclustername-worker-centralus2-dbz7w               Running   Standard_D4s_v3        centralus   2      6h38m
myclustername-worker-centralus3-p9b8c               Running   Standard_D4s_v3        centralus   3      6h38m

NAME                                                PHASE     TYPE                   REGION      ZONE   AGE
myclustername-master-0                              Running   Standard_D8s_v3        centralus   2      6h40m
myclustername-master-1                              Running   Standard_D8s_v3        centralus   1      6h40m
myclustername-master-2                              Running   Standard_D8s_v3        centralus   3      6h40m
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Running      centralus   1      21m
myclustername-worker-centralus1-rbh6b               Running   Standard_D4s_v3        centralus   1      6h38m
myclustername-worker-centralus2-dbz7w               Running   Standard_D4s_v3        centralus   2      6h38m
myclustername-worker-centralus3-p9b8c               Running   Standard_D4s_v3        centralus   3      6h38m

Copy to Clipboard

Toggle word wrap

运行以下命令，查看现有节点、机器和机器集。请注意，每个节点都是带有特定 Azure 区域和 OpenShift Container Platform 角色的机器定义实例。

oc get nodes

$ oc get nodes

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                                STATUS   ROLES                  AGE     VERSION
myclustername-master-0                              Ready    control-plane,master   6h39m   v1.29.4
myclustername-master-1                              Ready    control-plane,master   6h41m   v1.29.4
myclustername-master-2                              Ready    control-plane,master   6h39m   v1.29.4
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Ready    worker                 14m     v1.29.4
myclustername-worker-centralus1-rbh6b               Ready    worker                 6h29m   v1.29.4
myclustername-worker-centralus2-dbz7w               Ready    worker                 6h29m   v1.29.4
myclustername-worker-centralus3-p9b8c               Ready    worker                 6h31m   v1.29.4

NAME                                                STATUS   ROLES                  AGE     VERSION
myclustername-master-0                              Ready    control-plane,master   6h39m   v1.29.4
myclustername-master-1                              Ready    control-plane,master   6h41m   v1.29.4
myclustername-master-2                              Ready    control-plane,master   6h39m   v1.29.4
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Ready    worker                 14m     v1.29.4
myclustername-worker-centralus1-rbh6b               Ready    worker                 6h29m   v1.29.4
myclustername-worker-centralus2-dbz7w               Ready    worker                 6h29m   v1.29.4
myclustername-worker-centralus3-p9b8c               Ready    worker                 6h31m   v1.29.4

Copy to Clipboard

Toggle word wrap

查看计算机器集的列表：

oc get machineset -n openshift-machine-api

$ oc get machineset -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-worker-centralus1        1         1         1       1           8h
myclustername-worker-centralus2        1         1         1       1           8h
myclustername-worker-centralus3        1         1         1       1           8h

NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-worker-centralus1        1         1         1       1           8h
myclustername-worker-centralus2        1         1         1       1           8h
myclustername-worker-centralus3        1         1         1       1           8h

Copy to Clipboard

Toggle word wrap

运行以下命令，从定义文件创建启用了 GPU 的计算机器集：
```
oc create -f machineset-azure-gpu.yaml
```
```
$ oc create -f machineset-azure-gpu.yaml
```
Copy to Clipboard Toggle word wrap

查看计算机器集的列表：

oc get machineset -n openshift-machine-api

oc get machineset -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-nc4ast4-gpu-worker-centralus1   1         1         1       1           121m
myclustername-worker-centralus1               1         1         1       1           8h
myclustername-worker-centralus2               1         1         1       1           8h
myclustername-worker-centralus3               1         1         1       1           8h

NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-nc4ast4-gpu-worker-centralus1   1         1         1       1           121m
myclustername-worker-centralus1               1         1         1       1           8h
myclustername-worker-centralus2               1         1         1       1           8h
myclustername-worker-centralus3               1         1         1       1           8h

Copy to Clipboard

Toggle word wrap

验证

运行以下命令，查看您创建的机器集：

oc get machineset -n openshift-machine-api | grep gpu

$ oc get machineset -n openshift-machine-api | grep gpu

Copy to Clipboard

Toggle word wrap

MachineSet 副本数被设置为 1，以便自动创建新的 Machine 对象。

输出示例

myclustername-nc4ast4-gpu-worker-centralus1   1         1         1       1           121m

myclustername-nc4ast4-gpu-worker-centralus1   1         1         1       1           121m

Copy to Clipboard

Toggle word wrap

运行以下命令，查看创建机器集的 Machine 对象：

oc -n openshift-machine-api get machines | grep gpu

$ oc -n openshift-machine-api get machines | grep gpu

Copy to Clipboard

Toggle word wrap

输出示例

myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Running   Standard_NC4as_T4_v3   centralus   1      21m

myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Running   Standard_NC4as_T4_v3   centralus   1      21m

Copy to Clipboard

Toggle word wrap

注意

不需要为节点指定命名空间。节点定义是在集群范围之内。

2.2.15. 部署 Node Feature Discovery Operator
复制链接

创建启用了 GPU 的节点后，您需要发现启用了 GPU 的节点，以便调度它。为此，请安装 Node Feature Discovery (NFD) Operator。NFD Operator 识别节点中的硬件设备功能。它解决了在基础架构节点中识别和目录硬件资源的一般问题，以便 OpenShift Container Platform 可以使用它们。

流程

在 OpenShift Container Platform 控制台中，从 OperatorHub 安装 Node Feature Discovery Operator。
将 NFD Operator 安装到 OperatorHub 后，从已安装的 Operator 列表中选择 Node Feature Discovery，然后选择 Create instance。这会在 openshift-nfd 命名空间中安装 nfd-master 和 nfd-worker pod，每个计算节点一个 nfd-worker pod。

运行以下命令验证 Operator 是否已安装并正在运行：

oc get pods -n openshift-nfd

$ oc get pods -n openshift-nfd

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                       READY    STATUS     RESTARTS   AGE

nfd-controller-manager-8646fcbb65-x5qgk    2/2      Running 7  (8h ago)   1d

NAME                                       READY    STATUS     RESTARTS   AGE

nfd-controller-manager-8646fcbb65-x5qgk    2/2      Running 7  (8h ago)   1d

Copy to Clipboard

Toggle word wrap

浏览到控制台中的已安装的 Oerator，再选择 Create Node Feature Discovery。
选择 Create 以构建 NFD 自定义资源。这会在 openshift-nfd 命名空间中创建 NFD pod，为硬件资源和目录轮询 OpenShift Container Platform 节点。

验证

构建成功后，运行以下命令来验证 NFD pod 是否在每个节点上运行：

oc get pods -n openshift-nfd

$ oc get pods -n openshift-nfd

Copy to Clipboard

Toggle word wrap

输出示例

NAME                                       READY   STATUS      RESTARTS        AGE
nfd-controller-manager-8646fcbb65-x5qgk    2/2     Running     7 (8h ago)      12d
nfd-master-769656c4cb-w9vrv                1/1     Running     0               12d
nfd-worker-qjxb2                           1/1     Running     3 (3d14h ago)   12d
nfd-worker-xtz9b                           1/1     Running     5 (3d14h ago)   12d

NAME                                       READY   STATUS      RESTARTS        AGE
nfd-controller-manager-8646fcbb65-x5qgk    2/2     Running     7 (8h ago)      12d
nfd-master-769656c4cb-w9vrv                1/1     Running     0               12d
nfd-worker-qjxb2                           1/1     Running     3 (3d14h ago)   12d
nfd-worker-xtz9b                           1/1     Running     5 (3d14h ago)   12d

Copy to Clipboard

Toggle word wrap

NFD Operator 使用厂商 PCI ID 来识别节点的硬件。NVIDIA 使用 PCI ID 10de。

运行以下命令，查看 NFD Operator 发现的 NVIDIA GPU：

oc describe node ip-10-0-132-138.us-east-2.compute.internal | egrep 'Roles|pci'

$ oc describe node ip-10-0-132-138.us-east-2.compute.internal | egrep 'Roles|pci'

Copy to Clipboard

Toggle word wrap

输出示例

Roles: worker

feature.node.kubernetes.io/pci-1013.present=true

feature.node.kubernetes.io/pci-10de.present=true

feature.node.kubernetes.io/pci-1d0f.present=true

Roles: worker

feature.node.kubernetes.io/pci-1013.present=true

feature.node.kubernetes.io/pci-10de.present=true

feature.node.kubernetes.io/pci-1d0f.present=true

Copy to Clipboard

Toggle word wrap

10de 会出现在启用了 GPU 的节点的节点功能列表中。这意味着 NFD Operator 可以正确地识别启用了 GPU 的 MachineSet 的节点。

2.2.15.1. 在现有 Microsoft Azure 集群上启用加速网络
复制链接

您可以通过在机器集 YAML 文件中添加 acceleratedNetworking，在 Azure 上启用加速网络。

先决条件

有一个现有的 Microsoft Azure 集群，其中的 Machine API 正常运行。

流程

在 providerSpec 字段中添加以下内容：
```
providerSpec:
  value:
    acceleratedNetworking: true 
    vmSize: <azure-vm-size> 
```
```
providerSpec:
  value:
    acceleratedNetworking: true 
```
1
```
    vmSize: <azure-vm-size> 
```
2
Copy to Clipboard Toggle word wrap
1
此行启用加速网络。
2
指定包含至少四个 vCPU 的 Azure VM 大小。有关 VM 大小的信息，请参阅 Microsoft Azure 文档。

后续步骤

要在当前运行的节点上启用这个功能，必须替换每个现有机器。这可以为每个机器单独完成，或者将副本缩减为零，然后备份到所需的副本数。

验证

在 Microsoft Azure 门户上，查看机器集调配的机器的 Networking 设置页面，并验证 Accelerated networking 字段设置为 Enabled。

2.2.1. Azure 上计算机器设置自定义资源的 YAML 示例
复制链接

2.2.2. 创建计算机器集
复制链接

2.2.3. 为集群自动扩展标记 GPU 机器集
复制链接

2.2.4. 使用 Azure Marketplace 产品
复制链接

2.2.5. 启用 Azure 引导诊断
复制链接

2.2.6. 将机器部署为 Spot 虚拟机的机器
复制链接

2.2.6.1. 使用计算机器集创建 Spot 虚拟机
复制链接

2.2.7. 在临时操作系统磁盘中部署机器的机器集
复制链接

2.2.7.1. 使用计算机器在临时磁盘上创建机器
复制链接

2.2.8. 使用计算磁盘部署机器的机器集作为数据磁盘
复制链接

2.2.8.1. 使用机器集创建带有巨型磁盘的机器
复制链接

2.2.8.2. 启用 ultra 磁盘的机器集的故障排除资源
复制链接

2.2.8.2.1. 不正确的 ultra 磁盘配置
复制链接

2.2.8.2.2. 不支持的磁盘参数
复制链接

2.2.8.2.3. 无法删除磁盘
复制链接

2.2.9. 为机器集启用客户管理的加密密钥
复制链接

2.2.10. 使用机器集为 Azure 虚拟机配置可信启动
复制链接

2.2.11. 使用机器集配置 Azure 机密虚拟机
复制链接

2.2.12. Microsoft Azure 虚拟机的加速网络
复制链接

2.2.12.1. 限制
复制链接

2.2.13. 使用机器集配置容量保留
复制链接

2.2.14. 将 GPU 节点添加到现有 OpenShift Container Platform 集群中
复制链接

2.2.15. 部署 Node Feature Discovery Operator
复制链接

2.2.15.1. 在现有 Microsoft Azure 集群上启用加速网络
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.2. 在 Azure 上创建计算机器设置

2.2.1. Azure 上计算机器设置自定义资源的 YAML 示例复制链接链接已复制到粘贴板!

2.2.2. 创建计算机器集复制链接链接已复制到粘贴板!

2.2.3. 为集群自动扩展标记 GPU 机器集复制链接链接已复制到粘贴板!

2.2.4. 使用 Azure Marketplace 产品复制链接链接已复制到粘贴板!

2.2.5. 启用 Azure 引导诊断复制链接链接已复制到粘贴板!

2.2.6. 将机器部署为 Spot 虚拟机的机器复制链接链接已复制到粘贴板!

2.2.6.1. 使用计算机器集创建 Spot 虚拟机复制链接链接已复制到粘贴板!

2.2.7. 在临时操作系统磁盘中部署机器的机器集复制链接链接已复制到粘贴板!

2.2.7.1. 使用计算机器在临时磁盘上创建机器复制链接链接已复制到粘贴板!

2.2.8. 使用计算磁盘部署机器的机器集作为数据磁盘复制链接链接已复制到粘贴板!

2.2.8.1. 使用机器集创建带有巨型磁盘的机器复制链接链接已复制到粘贴板!

2.2.8.2. 启用 ultra 磁盘的机器集的故障排除资源复制链接链接已复制到粘贴板!

2.2.8.2.1. 不正确的 ultra 磁盘配置复制链接链接已复制到粘贴板!

2.2.8.2.2. 不支持的磁盘参数复制链接链接已复制到粘贴板!

2.2.8.2.3. 无法删除磁盘复制链接链接已复制到粘贴板!

2.2.9. 为机器集启用客户管理的加密密钥复制链接链接已复制到粘贴板!

2.2.10. 使用机器集为 Azure 虚拟机配置可信启动复制链接链接已复制到粘贴板!

2.2.11. 使用机器集配置 Azure 机密虚拟机复制链接链接已复制到粘贴板!

2.2.12. Microsoft Azure 虚拟机的加速网络复制链接链接已复制到粘贴板!

2.2.12.1. 限制复制链接链接已复制到粘贴板!

2.2.13. 使用机器集配置容量保留复制链接链接已复制到粘贴板!

2.2.14. 将 GPU 节点添加到现有 OpenShift Container Platform 集群中复制链接链接已复制到粘贴板!

2.2.15. 部署 Node Feature Discovery Operator复制链接链接已复制到粘贴板!

2.2.15.1. 在现有 Microsoft Azure 集群上启用加速网络复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.2.1. Azure 上计算机器设置自定义资源的 YAML 示例
复制链接

2.2.2. 创建计算机器集
复制链接

2.2.3. 为集群自动扩展标记 GPU 机器集
复制链接

2.2.4. 使用 Azure Marketplace 产品
复制链接

2.2.5. 启用 Azure 引导诊断
复制链接

2.2.6. 将机器部署为 Spot 虚拟机的机器
复制链接

2.2.6.1. 使用计算机器集创建 Spot 虚拟机
复制链接

2.2.7. 在临时操作系统磁盘中部署机器的机器集
复制链接

2.2.7.1. 使用计算机器在临时磁盘上创建机器
复制链接

2.2.8. 使用计算磁盘部署机器的机器集作为数据磁盘
复制链接

2.2.8.1. 使用机器集创建带有巨型磁盘的机器
复制链接

2.2.8.2. 启用 ultra 磁盘的机器集的故障排除资源
复制链接

2.2.8.2.1. 不正确的 ultra 磁盘配置
复制链接

2.2.8.2.2. 不支持的磁盘参数
复制链接

2.2.8.2.3. 无法删除磁盘
复制链接

2.2.9. 为机器集启用客户管理的加密密钥
复制链接

2.2.10. 使用机器集为 Azure 虚拟机配置可信启动
复制链接

2.2.11. 使用机器集配置 Azure 机密虚拟机
复制链接

2.2.12. Microsoft Azure 虚拟机的加速网络
复制链接

2.2.12.1. 限制
复制链接

2.2.13. 使用机器集配置容量保留
复制链接

2.2.14. 将 GPU 节点添加到现有 OpenShift Container Platform 集群中
复制链接

2.2.15. 部署 Node Feature Discovery Operator
复制链接

2.2.15.1. 在现有 Microsoft Azure 集群上启用加速网络
复制链接