Chapter 4. JobSet Operator


4.1. JobSet Operator overview

Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.

4.1.1. About the JobSet Operator

Use the JobSet Operator on OpenShift Container Platform to manage large, distributed, and coordinated computing workloads, such as high-performance computing (HPC) or artificial intelligence (AI) training, and gain automatic stability, coordination, and failure recovery.

The JobSet Operator is based on the JobSet open source project.

JobSet Operator is designed to manage a group of jobs as a single, coordinated unit. This is especially useful for fields like HPC and training massive AI models where you need a team of machines to run for hours or days.

You can use the JobSet Operator to solve problems that are too big or too complex for a standard OpenShift Container Platform job. The JobSet Operator provides coordination, stability, and recovery.

The JobSet Operator automatically sets up stable headless service to get an IP address so workers can find and communicate with each other, even after a failure and restart. It also provides automatic failure recovery. If one small part of a large training job fails, the Operator can be configured to restart the entire group of workers from a saved checkpoint. This saves time and computing costs.

The JobSet Operator offers startup control, allowing you to define a specific startup sequence to ensure dependencies are met. For example, making sure the leader is running before any workers attempt to connect.

JobSet Operator makes managing large, distributed, and coordinated computing tasks on OpenShift Container Platform easier, turning many individual components into one resilient and manageable system.

4.2. JobSet Operator release notes

Track the development, features, and fixes for the JobSet Operator, which manages coordinated, large-scale computing workloads on OpenShift Container Platform.

For more information, see About the JobSet Operator.

4.2.1. Release notes for JobSet Operator 1.0

Review the new features and advisories for the initial release of JobSet Operator 1.0.

Issued: 12 February 2026

The following advisories are available for the JobSet Operator 1.0:

4.2.1.1. New features and enhancements

  • This is the initial Generally Available release of the JobSet Operator.

4.3. Installing the JobSet Operator

Install the JobSet Operator on OpenShift Container Platform to enable management of large-scale, coordinated computing workloads, giving your applications a unified API and failure recovery.

4.3.1. Installing the JobSet Operator

Install the JobSet Operator on OpenShift Container Platform using the web console to begin managing large-scale, coordinated computing workloads.

Prerequisites

  • You have access to the cluster with cluster-admin privileges.
  • You have access to the OpenShift Container Platform web console.
  • You have installed the cert-manager Operator for Red Hat OpenShift.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Verify that the cert-manager Operator for Red Hat OpenShift is installed.
  3. Install the JobSet Operator.

    1. Navigate to Ecosystem Software Catalog.
    2. Search for and select the openshift-operators project.
    3. Enter JobSet Operator into the filter box.
    4. Select the JobSet Operator and click Install.
    5. On the Install Operator page:

      1. The Update channel is set to stable-v1.0, which installs the latest stable release of JobSet Operator.
      2. Under Installation mode, select A specific namespace on the cluster.
      3. Under Installed Namespace, select Operator recommended Namespace: openshift-jobset-operator.
      4. Under Update approval, select one of the following update strategies:

        • The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
        • The Manual strategy requires a user with appropriate credentials to approve the Operator update.
      5. Click Install.
  4. Create the custom resource (CR) for the JobSet Operator:

    1. Navigate to Installed Operators JobSet Operator.
    2. Under Provided APIs, click Create instance in the JobSetOperator pane.
    3. Set the name to cluster.
    4. Set the managementState to Managed.
    5. Click Create.

Verification

  • Check that the JobSet Operator and operand pods are running by entering the following command:

    $ oc get pod -n openshift-jobset-operator
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                        READY   STATUS    RESTARTS   AGE
    jobset-controller-manager-5595547fb-b4g2x   1/1     Running   0          48s
    jobset-operator-596cb848c6-q2dmp            1/1     Running   0          2m33s
    Copy to Clipboard Toggle word wrap

4.4. Managing workloads with the JobSet Operator

Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.

4.4.1. Deploying a JobSet

You can use the JobSet Operator to deploy a JobSet to manage and run large-scale, coordinated workloads.

Prerequisites

  • You have installed the JobSet Operator.
  • You have a cluster with available NVIDIA GPUs.

Procedure

  1. Create a new project by running the following command:

    $ oc new-project <my_namespace>
    Copy to Clipboard Toggle word wrap
  2. Create a file named jobset.yaml:

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: pytorch
    spec:
      replicatedJobs:
      - name: workers
        template:
          spec:
            parallelism: <pods_running_number>
            completions: <pods_finish_number>
            backoffLimit: 0
            template:
              spec:
                imagePullSecrets:
                  - name: my-registry-secret
                initContainers:
                  - name: prepare
                    image: docker.io/alpine/git:v2.52.0
                    args: ['clone', 'https://github.com/pytorch/examples']
                    volumeMounts:
                      - name: workdir
                        mountPath: /git
                containers:
                  - name: pytorch
                    image: docker.io/pytorch/pytorch:2.10.0-cuda13.0-cudnn9-runtime
                    resources:
                      limits:
                        nvidia.com/gpu: "1"
                      requests:
                        nvidia.com/gpu: "1"
                    ports:
                    - containerPort: 4321
                    env:
                    - name: MASTER_ADDR
                      value: "pytorch-workers-0-0.pytorch"
                    - name: MASTER_PORT
                      value: "4321"
                    - name: RANK
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
                    - name: PYTHONUNBUFFERED
                      value: "0"
                    command:
                    - /bin/sh
                    - -c
                    - |
                      cd examples/distributed/ddp-tutorial-series
                      torchrun --nproc_per_node=1 --nnodes=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT multinode.py 1000 100
                    volumeMounts:
                      - name: workdir
                        mountPath: /workspace
                volumes:
                  - name: workdir
                    emptyDir: {}
    Copy to Clipboard Toggle word wrap

    where:

    <pods_running_number>
    Specifies the number of pods running at the same time.
    <pods_finish_number>
    Specifies the total number of pods that must finish successfully for the job to be marked complete.
  3. Apply the JobSet configuration by running the following command:

    $ oc apply -f jobset.yaml
    Copy to Clipboard Toggle word wrap

Verification

  • Verify that pods were started by running the following command:

    $ oc get pods -n <my_namespace>
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                        READY   STATUS    RESTARTS   AGE
    pytorch-workers-0-0-2lzwt   1/1     Running   0          2m17s
    pytorch-workers-0-1-g2lrv   1/1     Running   0          2m17s
    pytorch-workers-0-2-dpljq   1/1     Running   0          2m17s
    Copy to Clipboard Toggle word wrap

4.4.2. Specifying a JobSet coordinator

To manage communication between JobSet pods, you can assign a specific JobSet coordinator pod. This ensures that your distributed workloads can reference a stable network endpoint as a central point of coordination for task synchronization and data exchange.

Prerequisites

  • You have installed the JobSet Operator.

Procedure

  1. Create a new namespace by running the following command.

    $ oc new-project <new_namespace>
    Copy to Clipboard Toggle word wrap
  2. Create a YAML file called jobset-coordinator.yaml:

    Example YAML file

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: coordinator
    spec:
      coordinator:
        replicatedJob: driver
        jobIndex: 0
        podIndex: 0
      replicatedJobs:
      - name: workers
        template:
          spec:
            parallelism: <pods_running_number>
            completions: <pods_finish_number>
            backoffLimit: 0
            template:
              spec:
                containers:
                - name: worker
                  env:
                    - name: COORDINATOR_ENDPOINT
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.labels['jobset.sigs.k8s.io/coordinator']
                  image: quay.io/nginx/nginx-unprivileged:1.29-alpine
                  command: [ "/bin/sh", "-c" ]
                  args:
                    - |
                      while ! curl -s "${COORDINATOR_ENDPOINT}:8080" | grep Welcome; do
                        sleep 3
                      done
                      sleep 100
      - name: driver
        template:
          spec:
            parallelism: <pods_running_number>
            completions: <pods_finish_number>
            backoffLimit: 0
            template:
              spec:
                containers:
                - name: driver
                  image: quay.io/nginx/nginx-unprivileged:1.29-alpine
                ports:
                  - containerPort: 8080
                    protocol: TCP
    Copy to Clipboard Toggle word wrap

    where:

    <pods_running_number>
    Specifies the number of pods running at the same time.
    <pods_finish_number>
    Specifies the total number of pods that must finish successfully for the job to be marked complete.
  3. Apply the jobset-coordinator.yaml file by running the following command:

    $ oc apply -f jobset-coordinator.yaml
    Copy to Clipboard Toggle word wrap

Verification

  • Verify that pods were created by running the following command:

    $ oc get pods -n <new_namespace>
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                            READY   STATUS              RESTARTS   AGE
    coordinator-driver-0-0-svgk7    1/1     Running             0          67s
    coordinator-workers-0-0-57jvg   1/1     Running             0          67s
    coordinator-workers-0-1-mghvx   1/1     Running             0          67s
    coordinator-workers-0-2-7cnvv   1/1     Running             0          67s
    Copy to Clipboard Toggle word wrap

To control workload behavior in response to child job failures, you can configure a JobSet failure policy. This enables you to define specific actions, such as restarting or failing the entire JobSet, based on the failure reason or the specific replicated job affected.

4.4.3.1. Failure policy actions

These actions are available when a job failure matches a defined rule.

Expand
ActionDescription

FailJobSet

Marks the entire JobSet as failed immediately.

RestartJobSet

Restarts the JobSet by recreating all child jobs. This action counts toward the maxRestarts limit. This is the default action if no rules match.

RestartJobSetAndIgnoreMaxRestarts

Restarts the JobSet without counting toward the maxRestarts limit.

4.4.3.2. Rule-targeting attributes

Use the following attributes to define failure rules.

Expand
AttributeDescription

targetReplicatedJobs

Specifies which replicated jobs trigger the rule.

onJobFailureReasons

Triggers the rule based on the specific job failure reason. Valid values include BackoffLimitExceeded, DeadlineExceeded, and PodFailurePolicy.

4.4.3.3. Configuration example

This configuration marks the JobSet as failed if the leader job fails.

Example of a YAML file to mark the job set failed if the leader fails

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: failjobset-action-example
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
      - action: FailJobSet
        targetReplicatedJobs:
        - leader
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: leader
              image: docker.io/bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 10 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: docker.io/bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000
Copy to Clipboard Toggle word wrap

You can configure a JobSet to automatically create and manage shared persistent volume claims (PVCs) across multiple replicated jobs. This is useful for workloads that require shared access to datasets, models, or checkpoints.

Prerequisites

  • You have the JobSet Operator installed in your cluster.
  • You have set a default storage class or chosen a storage class for your workload.

Procedure

  1. Define the volume templates in the spec.volumeClaimPolicies section of your JobSet YAML file.

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: <job_name>
    spec:
      volumeClaimPolicies:
        - templates:
            - metadata:
                name: <persistent_volume_claim_name_prefix>
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: mystorageclass
                resources:
                  requests:
                    storage: 1Gi
          retentionPolicy:
            whenDeleted: Retain
    Copy to Clipboard Toggle word wrap

    where:

    <job_name>
    Specifies your unique identifier for your jobs within your namespace.
    <persistent_volume_claim_name>
    Specifies the name for the PVC. The name used here will also be used as the volumeMounts name. A volume will be automatically added to the pod that will mount a PVC created with a name in the format of <persistent_volume_claim_name>-<job_name>.
    <deletion_retention_policy>
    Specifies the deletion retention policy. Optionally, you can keep data after the JobSet is deleted by setting this value to Retain.
  2. In your replicatedJobs configuration, add a volumeMount that matches the template name you defined.

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: <job_name>
    spec:
      replicatedJobs:
      - name: workers
        template:
          spec:
            parallelism: 2
            completions: 2
            backoffLimit: 0
            template:
              spec:
                imagePullSecrets:
                  - name: my-registry-secret
                initContainers:
                  - name: prepare
                    image: docker.io/alpine/git:v2.52.0
                    args: ['clone', 'https://github.com/pytorch/examples']
                    volumeMounts:
                      - name: <persistent_volume_claim_name>
                        mountPath: /git/checkpoint
    #...
    Copy to Clipboard Toggle word wrap
  3. Apply the JobSet configuration by running the following command:

    $ oc apply -f <jobset_yaml>
    Copy to Clipboard Toggle word wrap

Verification

  • Verify that the PVCs were created using the naming convention <claim_name>-<jobset_name>:

    $ oc get pvc
    Copy to Clipboard Toggle word wrap

    Example output

    NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    pvc-1       Bound    pvc-385996a0-70af-4791-aa8e-9e6459e6b123   3Gi        RWO            file-storage   3d
    pvc-2       Bound    pvc-8aeddd4d-aad5-4039-8d04-640a71c9a72d   12Gi       RWO            file-storage   3d
    pvc-3       Bound    pvc-0050144d-940c-4c4e-a23a-2a660a5490eb   12Gi       RWO            file-storage   3d
    Copy to Clipboard Toggle word wrap

4.5. Uninstalling the JobSet Operator

Uninstall the JobSet Operator by using the OpenShift Container Platform web console to remove the Operator instance and its resources from your cluster.

4.5.1. Uninstalling the JobSet Operator

Uninstall the JobSet Operator by using the OpenShift Container Platform web console to remove the Operator instance.

Prerequisites

  • You have access to the cluster with cluster-admin privileges.
  • You have access to the OpenShift Container Platform web console.
  • You have installed the JobSet Operator.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Navigate to Operators Installed Operators.
  3. Select openshift-js-operator from the Project dropdown list.
  4. Delete the JobSetOperator instance.

    1. Click JobSet Operator and select the JobSetOperator tab.
    2. Click the Options menu kebab next to the cluster entry and select Delete JobSetOperator.
    3. In the confirmation dialog, click Delete.
  5. Uninstall the JobSet Operator.

    1. Navigate to Operators Installed Operators.
    2. Click the Options menu kebab next to the JobSet Operator entry and click Uninstall Operator.
    3. In the confirmation dialog, click Uninstall.

4.5.2. Uninstalling JobSet Operator resources

Optionally, after uninstalling the JobSet Operator, you can remove its related resources from your cluster.

Prerequisites

  • You have access to the cluster with cluster-admin privileges.
  • You have access to the OpenShift Container Platform web console.
  • You have uninstalled the JobSet Operator.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Remove CRDs that were created when the JobSet Operator was installed:

    1. Navigate to Administration CustomResourceDefinitions.
    2. Enter JobSetOperator in the Name field to filter the CRDs.
    3. Click the Options menu kebab next to the JobSetOperator CRD and select Delete CustomResourceDefinition.
    4. In the confirmation dialog, click Delete.
  3. Delete the openshift-jobset-operator namespace.

    1. Navigate to Administration Namespaces.
    2. Fine openshift-jobset-operator in the list of namespaces.
    3. Click the Options menu kebab next to the openshift-jobset-operator entry and select Delete Namespace.
    4. In the confirmation dialog, enter openshift-jobset-operator and click Delete.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top