4.5. 被配置为使用 RDMA 的 Training Operator PyTorchJob 资源示例


本例演示了如何创建配置为使用 Remote Direct Memory Access (RDMA)运行的 Training Operator PyTorch training 作业。

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: job
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: "example-net"
        spec:
          containers:
          - command:
            - /bin/bash
            - -c
            - "your container command"
            env:
            - name: NCCL_SOCKET_IFNAME
              value: "net1"
            - name: NCCL_IB_HCA
              value: "mlx5_1"
            image: quay.io/modh/training:py311-cuda121-torch241
            name: pytorch
            resources:
              limits:
                nvidia.com/gpu: "1"
                rdma/rdma_shared_device_eth: "1"
              requests:
                nvidia.com/gpu: "1"
                rdma/rdma_shared_device_eth: "1"
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: "example-net"
        spec:
          containers:
          - command:
            - /bin/bash
            - -c
            - "your container command"
            env:
            - name: NCCL_SOCKET_IFNAME
              value: "net1"
            - name: NCCL_IB_HCA
              value: "mlx5_1"
            image: quay.io/modh/training:py311-cuda121-torch241
            name: pytorch
            resources:
              limits:
                nvidia.com/gpu: "1"
                rdma/rdma_shared_device_eth: "1"
              requests:
                nvidia.com/gpu: "1"
                rdma/rdma_shared_device_eth: "1"
Copy to Clipboard Toggle word wrap
返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat