5.6. 配置 NVIDIA network Operator
NVIDIA network Operator 管理 NVIDIA 网络资源和网络相关组件,如驱动程序和设备插件来启用 NVIDIA GPUDirect RDMA 工作负载。
先决条件
- 已安装 NVIDIA network Operator。
流程
运行以下命令,验证 network Operator 是否已安装并运行,确认控制器是否在
nvidia-network-operator命名空间中运行:oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5mCopy to Clipboard Copied! Toggle word wrap Toggle overflow 在 Operator 运行时,创建
NicClusterPolicy自定义资源文件。您选择的设备取决于您的系统配置。在这个示例中,Infiniband 接口ibs2f0是硬编码的,用作共享的 NVIDIA GPUDirect RDMA 设备。Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令在集群中创建
NicClusterPolicy自定义资源:oc create -f network-sharedrdma-nic-cluster-policy.yaml
$ oc create -f network-sharedrdma-nic-cluster-policy.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
nicclusterpolicy.mellanox.com/nic-cluster-policy created
nicclusterpolicy.mellanox.com/nic-cluster-policy createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow 在 DOCA/MOFED 容器中运行以下命令来验证
NicClusterPolicy:oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rsh到mofed容器,运行以下命令来检查状态:MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD}$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) $ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD} sh-5.1# ofed_info -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
OFED-internal-24.07-0.6.1:
OFED-internal-24.07-0.6.1:Copy to Clipboard Copied! Toggle word wrap Toggle overflow ibdev2netdev -v
sh-5.1# ibdev2netdev -vCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)Copy to Clipboard Copied! Toggle word wrap Toggle overflow 创建
IPoIBNetwork自定义资源文件:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令在集群中创建
IPoIBNetwork资源:oc create -f ipoib-network.yaml
$ oc create -f ipoib-network.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
ipoibnetwork.mellanox.com/example-ipoibnetwork created
ipoibnetwork.mellanox.com/example-ipoibnetwork createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow 为其他接口创建一个
MacvlanNetwork自定义资源文件:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令在集群中创建资源:
oc create -f macvlan-network.yaml
$ oc create -f macvlan-network.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
macvlannetwork.mellanox.com/rdmashared-net created
macvlannetwork.mellanox.com/rdmashared-net createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow