4.4. Managing workloads with the JobSet Operator


Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.

4.4.1. Deploying a JobSet

You can use the JobSet Operator to deploy a JobSet to manage and run large-scale, coordinated workloads.

Prerequisites

  • You have installed the JobSet Operator.
  • You have a cluster with available NVIDIA GPUs.

Procedure

  1. Create a new project by running the following command:

    $ oc new-project <my_namespace>
  2. Create a file named jobset.yaml:

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: pytorch
    spec:
      replicatedJobs:
      - name: workers
        template:
          spec:
            parallelism: 3
            completions: 3
            backoffLimit: 0
            template:
              spec:
                imagePullSecrets:
                  - name: my-registry-secret
                initContainers:
                  - name: prepare
                    image: docker.io/alpine/git:v2.52.0
                    args: ['clone', 'https://github.com/pytorch/examples']
                    volumeMounts:
                      - name: workdir
                        mountPath: /git
                containers:
                  - name: pytorch
                    image: docker.io/pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime
                    resources:
                      limits:
                        nvidia.com/gpu: "1"
                      requests:
                        nvidia.com/gpu: "1"
                    ports:
                    - containerPort: 4321
                    env:
                    - name: MASTER_ADDR
                      value: "pytorch-workers-0-0.pytorch"
                    - name: MASTER_PORT
                      value: "4321"
                    - name: RANK
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
                    - name: PYTHONUNBUFFERED
                      value: "0"
                    command:
                    - /bin/sh
                    - -c
                    - |
                      cd examples/distributed/ddp-tutorial-series
                      torchrun --nproc_per_node=1 --nnodes=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT multinode.py 1000 100
                    volumeMounts:
                      - name: workdir
                        mountPath: /workspace
                volumes:
                  - name: workdir
                    emptyDir: {}

    where:

    spec.replicatedJobs.template.spec.parallelism
    Specifies the number of pods running at the same time.
    spec.replicatedJobs.template.spec.completions
    Specifies the total number of pods that must finish successfully for the job to be marked complete.
  3. Apply the JobSet configuration by running the following command:

    $ oc apply -f jobset.yaml

Verification

  • Verify that pods were started by running the following command:

    $ oc get pods -n <my_namespace>

    Example output

    NAME                        READY   STATUS    RESTARTS   AGE
    pytorch-workers-0-0-2lzwt   1/1     Running   0          2m17s
    pytorch-workers-0-1-g2lrv   1/1     Running   0          2m17s
    pytorch-workers-0-2-dpljq   1/1     Running   0          2m17s

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 문서 정보

Legal Notice

Theme

© 2026 Red Hat
맨 위로 이동