Chapter 4. JobSet Operator
4.1. JobSet Operator overview Copy linkLink copied to clipboard!
Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.
4.1.1. About the JobSet Operator Copy linkLink copied to clipboard!
Use the JobSet Operator on OpenShift Container Platform to manage large, distributed, and coordinated computing workloads, such as high-performance computing (HPC) or artificial intelligence (AI) training, and gain automatic stability, coordination, and failure recovery.
The JobSet Operator is based on the JobSet open source project.
JobSet Operator is designed to manage a group of jobs as a single, coordinated unit. This is especially useful for fields like HPC and training massive AI models where you need a team of machines to run for hours or days.
You can use the JobSet Operator to solve problems that are too big or too complex for a standard OpenShift Container Platform job. The JobSet Operator provides coordination, stability, and recovery.
The JobSet Operator automatically sets up stable headless service to get an IP address so workers can find and communicate with each other, even after a failure and restart. It also provides automatic failure recovery. If one small part of a large training job fails, the Operator can be configured to restart the entire group of workers from a saved checkpoint. This saves time and computing costs.
The JobSet Operator offers startup control, allowing you to define a specific startup sequence to ensure dependencies are met. For example, making sure the leader is running before any workers attempt to connect.
JobSet Operator makes managing large, distributed, and coordinated computing tasks on OpenShift Container Platform easier, turning many individual components into one resilient and manageable system.
4.2. JobSet Operator release notes Copy linkLink copied to clipboard!
Track the development, features, and fixes for the JobSet Operator, which manages coordinated, large-scale computing workloads on OpenShift Container Platform.
For more information, see About the JobSet Operator.
4.2.1. Release notes for JobSet Operator 1.0 Copy linkLink copied to clipboard!
Review the new features and advisories for the initial release of JobSet Operator 1.0.
Issued: 12 February 2026
The following advisories are available for the JobSet Operator 1.0:
4.2.1.1. New features and enhancements Copy linkLink copied to clipboard!
- This is the initial Generally Available release of the JobSet Operator.
4.3. Installing the JobSet Operator Copy linkLink copied to clipboard!
Install the JobSet Operator on OpenShift Container Platform to enable management of large-scale, coordinated computing workloads, giving your applications a unified API and failure recovery.
4.3.1. Installing the JobSet Operator Copy linkLink copied to clipboard!
Install the JobSet Operator on OpenShift Container Platform using the web console to begin managing large-scale, coordinated computing workloads.
Prerequisites
-
You have access to the cluster with
cluster-adminprivileges. - You have access to the OpenShift Container Platform web console.
- You have installed the cert-manager Operator for Red Hat OpenShift.
Procedure
- Log in to the OpenShift Container Platform web console.
- Verify that the cert-manager Operator for Red Hat OpenShift is installed.
Install the JobSet Operator.
-
Navigate to Ecosystem
Software Catalog. -
Search for and select the
openshift-operatorsproject. - Enter JobSet Operator into the filter box.
- Select the JobSet Operator and click Install.
On the Install Operator page:
- The Update channel is set to stable-v1.0, which installs the latest stable release of JobSet Operator.
- Under Installation mode, select A specific namespace on the cluster.
- Under Installed Namespace, select Operator recommended Namespace: openshift-jobset-operator.
Under Update approval, select one of the following update strategies:
- The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
- The Manual strategy requires a user with appropriate credentials to approve the Operator update.
- Click Install.
-
Navigate to Ecosystem
Create the custom resource (CR) for the JobSet Operator:
-
Navigate to Installed Operators
JobSet Operator. - Under Provided APIs, click Create instance in the JobSetOperator pane.
- Set the name to cluster.
- Set the managementState to Managed.
- Click Create.
-
Navigate to Installed Operators
Verification
Check that the JobSet Operator and operand pods are running by entering the following command:
oc get pod -n openshift-jobset-operator
$ oc get pod -n openshift-jobset-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE jobset-controller-manager-5595547fb-b4g2x 1/1 Running 0 48s jobset-operator-596cb848c6-q2dmp 1/1 Running 0 2m33s
NAME READY STATUS RESTARTS AGE jobset-controller-manager-5595547fb-b4g2x 1/1 Running 0 48s jobset-operator-596cb848c6-q2dmp 1/1 Running 0 2m33sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.4. Managing workloads with the JobSet Operator Copy linkLink copied to clipboard!
Use the JobSet Operator on OpenShift Container Platform to manage and run large-scale, coordinated workloads like high-performance computing (HPC) and AI training. Features like multi-template job support and stable networking can help you recover quickly and use resources efficiently.
4.4.1. Deploying a JobSet Copy linkLink copied to clipboard!
You can use the JobSet Operator to deploy a JobSet to manage and run large-scale, coordinated workloads.
Prerequisites
- You have installed the JobSet Operator.
- You have a cluster with available NVIDIA GPUs.
Procedure
Create a new project by running the following command:
oc new-project <my_namespace>
$ oc new-project <my_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a file named
jobset.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
<pods_running_number>- Specifies the number of pods running at the same time.
<pods_finish_number>- Specifies the total number of pods that must finish successfully for the job to be marked complete.
Apply the JobSet configuration by running the following command:
oc apply -f jobset.yaml
$ oc apply -f jobset.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that pods were started by running the following command:
oc get pods -n <my_namespace>
$ oc get pods -n <my_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE pytorch-workers-0-0-2lzwt 1/1 Running 0 2m17s pytorch-workers-0-1-g2lrv 1/1 Running 0 2m17s pytorch-workers-0-2-dpljq 1/1 Running 0 2m17s
NAME READY STATUS RESTARTS AGE pytorch-workers-0-0-2lzwt 1/1 Running 0 2m17s pytorch-workers-0-1-g2lrv 1/1 Running 0 2m17s pytorch-workers-0-2-dpljq 1/1 Running 0 2m17sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.4.2. Specifying a JobSet coordinator Copy linkLink copied to clipboard!
To manage communication between JobSet pods, you can assign a specific JobSet coordinator pod. This ensures that your distributed workloads can reference a stable network endpoint as a central point of coordination for task synchronization and data exchange.
Prerequisites
- You have installed the JobSet Operator.
Procedure
Create a new namespace by running the following command.
oc new-project <new_namespace>
$ oc new-project <new_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file called
jobset-coordinator.yaml:Example YAML file
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
<pods_running_number>- Specifies the number of pods running at the same time.
<pods_finish_number>- Specifies the total number of pods that must finish successfully for the job to be marked complete.
Apply the
jobset-coordinator.yamlfile by running the following command:oc apply -f jobset-coordinator.yaml
$ oc apply -f jobset-coordinator.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that pods were created by running the following command:
oc get pods -n <new_namespace>
$ oc get pods -n <new_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE coordinator-driver-0-0-svgk7 1/1 Running 0 67s coordinator-workers-0-0-57jvg 1/1 Running 0 67s coordinator-workers-0-1-mghvx 1/1 Running 0 67s coordinator-workers-0-2-7cnvv 1/1 Running 0 67s
NAME READY STATUS RESTARTS AGE coordinator-driver-0-0-svgk7 1/1 Running 0 67s coordinator-workers-0-0-57jvg 1/1 Running 0 67s coordinator-workers-0-1-mghvx 1/1 Running 0 67s coordinator-workers-0-2-7cnvv 1/1 Running 0 67sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.4.3. Failure policy configuration for JobSet Operator Copy linkLink copied to clipboard!
To control workload behavior in response to child job failures, you can configure a JobSet failure policy. This enables you to define specific actions, such as restarting or failing the entire JobSet, based on the failure reason or the specific replicated job affected.
4.4.3.1. Failure policy actions Copy linkLink copied to clipboard!
These actions are available when a job failure matches a defined rule.
| Action | Description |
|---|---|
|
| Marks the entire JobSet as failed immediately. |
|
|
Restarts the JobSet by recreating all child jobs. This action counts toward the |
|
|
Restarts the JobSet without counting toward the |
4.4.3.2. Rule-targeting attributes Copy linkLink copied to clipboard!
Use the following attributes to define failure rules.
| Attribute | Description |
|---|---|
|
| Specifies which replicated jobs trigger the rule. |
|
|
Triggers the rule based on the specific job failure reason. Valid values include |
4.4.3.3. Configuration example Copy linkLink copied to clipboard!
This configuration marks the JobSet as failed if the leader job fails.
Example of a YAML file to mark the job set failed if the leader fails
4.4.4. Configuring volume claim policies for JobSet Operator Copy linkLink copied to clipboard!
You can configure a JobSet to automatically create and manage shared persistent volume claims (PVCs) across multiple replicated jobs. This is useful for workloads that require shared access to datasets, models, or checkpoints.
Prerequisites
- You have the JobSet Operator installed in your cluster.
- You have set a default storage class or chosen a storage class for your workload.
Procedure
Define the volume templates in the
spec.volumeClaimPoliciessection of your JobSet YAML file.Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
<job_name>- Specifies your unique identifier for your jobs within your namespace.
<persistent_volume_claim_name>-
Specifies the name for the PVC. The name used here will also be used as the
volumeMountsname. A volume will be automatically added to the pod that will mount a PVC created with a name in the format of<persistent_volume_claim_name>-<job_name>. <deletion_retention_policy>-
Specifies the deletion retention policy. Optionally, you can keep data after the JobSet is deleted by setting this value to
Retain.
In your
replicatedJobsconfiguration, add avolumeMountthat matches the template name you defined.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the JobSet configuration by running the following command:
oc apply -f <jobset_yaml>
$ oc apply -f <jobset_yaml>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the PVCs were created using the naming convention
<claim_name>-<jobset_name>:oc get pvc
$ oc get pvcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Bound pvc-385996a0-70af-4791-aa8e-9e6459e6b123 3Gi RWO file-storage 3d pvc-2 Bound pvc-8aeddd4d-aad5-4039-8d04-640a71c9a72d 12Gi RWO file-storage 3d pvc-3 Bound pvc-0050144d-940c-4c4e-a23a-2a660a5490eb 12Gi RWO file-storage 3d
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Bound pvc-385996a0-70af-4791-aa8e-9e6459e6b123 3Gi RWO file-storage 3d pvc-2 Bound pvc-8aeddd4d-aad5-4039-8d04-640a71c9a72d 12Gi RWO file-storage 3d pvc-3 Bound pvc-0050144d-940c-4c4e-a23a-2a660a5490eb 12Gi RWO file-storage 3dCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.5. Uninstalling the JobSet Operator Copy linkLink copied to clipboard!
Uninstall the JobSet Operator by using the OpenShift Container Platform web console to remove the Operator instance and its resources from your cluster.
4.5.1. Uninstalling the JobSet Operator Copy linkLink copied to clipboard!
Uninstall the JobSet Operator by using the OpenShift Container Platform web console to remove the Operator instance.
Prerequisites
-
You have access to the cluster with
cluster-adminprivileges. - You have access to the OpenShift Container Platform web console.
- You have installed the JobSet Operator.
Procedure
- Log in to the OpenShift Container Platform web console.
-
Navigate to Operators
Installed Operators. -
Select
openshift-js-operatorfrom the Project dropdown list. Delete the
JobSetOperatorinstance.- Click JobSet Operator and select the JobSetOperator tab.
-
Click the Options menu
next to the cluster entry and select Delete JobSetOperator.
- In the confirmation dialog, click Delete.
Uninstall the JobSet Operator.
-
Navigate to Operators
Installed Operators. -
Click the Options menu
next to the JobSet Operator entry and click Uninstall Operator.
- In the confirmation dialog, click Uninstall.
-
Navigate to Operators
4.5.2. Uninstalling JobSet Operator resources Copy linkLink copied to clipboard!
Optionally, after uninstalling the JobSet Operator, you can remove its related resources from your cluster.
Prerequisites
-
You have access to the cluster with
cluster-adminprivileges. - You have access to the OpenShift Container Platform web console.
- You have uninstalled the JobSet Operator.
Procedure
- Log in to the OpenShift Container Platform web console.
Remove CRDs that were created when the JobSet Operator was installed:
-
Navigate to Administration
CustomResourceDefinitions. -
Enter
JobSetOperatorin the Name field to filter the CRDs. -
Click the Options menu
next to the JobSetOperator CRD and select Delete CustomResourceDefinition.
- In the confirmation dialog, click Delete.
-
Navigate to Administration
Delete the
openshift-jobset-operatornamespace.-
Navigate to Administration
Namespaces. -
Fine
openshift-jobset-operatorin the list of namespaces. -
Click the Options menu
next to the openshift-jobset-operator entry and select Delete Namespace.
-
In the confirmation dialog, enter
openshift-jobset-operatorand click Delete.
-
Navigate to Administration