홈
제품
OpenShift Container Platform
4.21
AI workloads
4.4. Managing workloads with the JobSet Operator

4.4. Managing workloads with the JobSet Operator

Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.

4.4.1. Deploying a JobSet
링크 복사

You can use the JobSet Operator to deploy a JobSet to manage and run large-scale, coordinated workloads.

Prerequisites

You have installed the JobSet Operator.
You have a cluster with available NVIDIA GPUs.

Procedure

Create a new project by running the following command:
```
$ oc new-project <my_namespace>
```

Create a file named jobset.yaml:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: pytorch
spec:
  replicatedJobs:
  - name: workers
    template:
      spec:
        parallelism: 3
        completions: 3
        backoffLimit: 0
        template:
          spec:
            imagePullSecrets:
              - name: my-registry-secret
            initContainers:
              - name: prepare
                image: docker.io/alpine/git:v2.52.0
                args: ['clone', 'https://github.com/pytorch/examples']
                volumeMounts:
                  - name: workdir
                    mountPath: /git
            containers:
              - name: pytorch
                image: docker.io/pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                  requests:
                    nvidia.com/gpu: "1"
                ports:
                - containerPort: 4321
                env:
                - name: MASTER_ADDR
                  value: "pytorch-workers-0-0.pytorch"
                - name: MASTER_PORT
                  value: "4321"
                - name: RANK
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
                - name: PYTHONUNBUFFERED
                  value: "0"
                command:
                - /bin/sh
                - -c
                - |
                  cd examples/distributed/ddp-tutorial-series
                  torchrun --nproc_per_node=1 --nnodes=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT multinode.py 1000 100
                volumeMounts:
                  - name: workdir
                    mountPath: /workspace
            volumes:
              - name: workdir
                emptyDir: {}

where:

spec.replicatedJobs.template.spec.parallelism: Specifies the number of pods running at the same time.
spec.replicatedJobs.template.spec.completions: Specifies the total number of pods that must finish successfully for the job to be marked complete.

Apply the JobSet configuration by running the following command:
```
$ oc apply -f jobset.yaml
```

Verification

Verify that pods were started by running the following command:

$ oc get pods -n <my_namespace>

Example output

NAME                        READY   STATUS    RESTARTS   AGE
pytorch-workers-0-0-2lzwt   1/1     Running   0          2m17s
pytorch-workers-0-1-g2lrv   1/1     Running   0          2m17s
pytorch-workers-0-2-dpljq   1/1     Running   0          2m17s

4.4. Managing workloads with the JobSet Operator

4.4.1. Deploying a JobSet
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

4.4. Managing workloads with the JobSet Operator

4.4.1. Deploying a JobSet링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

4.4.1. Deploying a JobSet
링크 복사