9.2. 分布式工作负载的 Kueue 资源配置示例
您可以使用这些示例配置作为创建 Kue 资源来管理分布式培训工作负载的起点。
这些示例演示了如何为常见的分布式培训场景配置 Kueue 资源类别和集群队列。
注意
在 OpenShift AI 中,红帽不支持共享 cohorts。
9.2.1. 没有共享 cohort 的 NVIDIA GPU 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
9.2.1.1. NVIDIA RTX A400 GPU 资源类别 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "a400node"
spec:
nodeLabels:
instance-type: nvidia-a400-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
9.2.1.2. NVIDIA RTX A1000 GPU 资源类别 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "a1000node"
spec:
nodeLabels:
instance-type: nvidia-a1000-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
9.2.1.3. NVIDIA RTX A400 GPU 集群队列 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "a400queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "a400node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "nvidia.com/gpu"
nominalQuota: 2
9.2.1.4. NVIDIA RTX A1000 GPU cluster queue 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "a1000queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "a1000node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "nvidia.com/gpu"
nominalQuota: 2
9.2.2. Nvidia GPU 和 AMD GPU 没有共享 cohort 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
9.2.2.1. AMD GPU 资源类型 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "amd-node"
spec:
nodeLabels:
instance-type: amd-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
9.2.2.2. NVIDIA GPU 资源类型 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "nvidia-node"
spec:
nodeLabels:
instance-type: nvidia-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
9.2.2.3. AMD GPU 集群队列 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "team-a-amd-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "amd.com/gpu"]
flavors:
- name: "amd-node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "amd.com/gpu"
nominalQuota: 2
9.2.2.4. NVIDIA GPU 集群队列 复制链接链接已复制到粘贴板!
复制链接链接已复制到粘贴板!
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "team-a-nvidia-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "nvidia-node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "nvidia.com/gpu"
nominalQuota: 2