Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
AI workloads
Running AI workloads on OpenShift Container Platform
Abstract
Chapter 1. Overview of AI workloads on OpenShift Container Platform Link kopierenLink in die Zwischenablage kopiert!
OpenShift Container Platform provides a secure, scalable foundation for running artificial intelligence (AI) workloads across training, inference, and data science workflows.
1.1. Operators for running AI workloads Link kopierenLink in die Zwischenablage kopiert!
You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on OpenShift Container Platform. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use OpenShift Container Platform as the core platform for your applications.
OpenShift Container Platform provides several Operators that can help you run AI workloads:
- Leader Worker Set Operator
You can use the Leader Worker Set Operator to enable large-scale AI inference workloads to run reliably across nodes with synchronization between leader and worker processes. Without proper coordination, large training runs might fail or stall.
For more information, see "Leader Worker Set Operator overview".
- Red Hat build of Kueue
You can use Red Hat build of Kueue to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources.
For more information, see Introduction to Red Hat build of Kueue in the Red Hat build of Kueue documentation.
Chapter 2. Leader Worker Set Operator Link kopierenLink in die Zwischenablage kopiert!
2.1. Leader Worker Set Operator overview Link kopierenLink in die Zwischenablage kopiert!
Using large language models (LLMs) for AI/ML inference often requires significant compute resources, and workloads typically must be sharded across multiple nodes. This can make deployments complex, creating challenges around scaling, recovery from failures, and efficient pod placement.
The Leader Worker Set Operator simplifies these multi-node deployments by treating a group of pods as a single, coordinated unit. It manages the lifecycle of each pod in the group, scales the entire group together, and performs updates and failure recovery at the group level to ensure consistency.
2.1.1. About the Leader Worker Set Operator Link kopierenLink in die Zwischenablage kopiert!
The Leader Worker Set Operator is based on the LeaderWorkerSet open source project. LeaderWorkerSet
is a custom Kubernetes API that can be used to deploy a group of pods as a unit. This is useful for artificial intelligence (AI) and machine learning (ML) inference workloads, where large language models (LLMs) are sharded across multiple nodes.
With the LeaderWorkerSet
API, pods are grouped into units consisting of one leader and multiple workers, all managed together as a single entity. Each pod in a group has a unique pod identity. Pods within a group are created in parallel and share identical lifecycle stages. Rollouts, rolling updates, and pod failure restarts are performed as a group.
In the LeaderWorkerSet
configuration, you define the size of the groups and the number of group replicas. If necessary, you can define separate templates for leader and worker pods, allowing for role-specific customization. You can also configure topology-aware placement, so that pods in the same group are co-located in the same topology.
Before you install the Leader Worker Set Operator, you must install the cert-manager Operator for Red Hat OpenShift because it is required to configure services and manage metrics collection.
Monitoring for the Leader Worker Set Operator is provided by default with OpenShift Container Platform through Prometheus.
2.1.1.1. LeaderWorkerSet architecture Link kopierenLink in die Zwischenablage kopiert!
The following diagram shows how the LeaderWorkerSet
API organizes groups of pods into a single unit, with one pod as the leader and the rest as the workers, to coordinate distributed workloads:
Figure 2.1. Leader worker set architecture
The LeaderWorkerSet
API uses a leader stateful set to manage the deployment and lifecycle of the groups of pods. For each replica defined, a leader-worker group is created.
Each leader-worker group contains a leader pod and a worker stateful set. The worker stateful set is owned by the leader pod and manages the set of worker pods associated with that leader pod. The specified size defines the total number of pods in each leader-worker group, with the leader pod included in that number.
2.2. Leader Worker Set Operator release notes Link kopierenLink in die Zwischenablage kopiert!
You can use the Leader Worker Set Operator to manage distributed inference workloads and process large-scale inference requests efficiently.
These release notes track the development of the Leader Worker Set Operator.
For more information, see About the Leader Worker Set Operator.
2.2.1. Release notes for Leader Worker Set Operator 1.0.0 Link kopierenLink in die Zwischenablage kopiert!
Issued: 18 September 2025
The following advisories are available for the Leader Worker Set Operator 1.0.0:
2.2.1.1. New features and enhancements Link kopierenLink in die Zwischenablage kopiert!
- This is the initial release of the Leader Worker Set Operator.
2.3. Managing distributed workloads with the Leader Worker Set Operator Link kopierenLink in die Zwischenablage kopiert!
You can use the Leader Worker Set Operator to manage distributed inference workloads and process large-scale inference requests efficiently.
2.3.1. Installing the Leader Worker Set Operator Link kopierenLink in die Zwischenablage kopiert!
You can use the web console to install the Leader Worker Set Operator.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
- You have installed the cert-manager Operator for Red Hat OpenShift.
Procedure
- Log in to the OpenShift Container Platform web console.
- Verify that the cert-manager Operator for Red Hat OpenShift is installed.
Install the Leader Worker Set Operator.
- Navigate to Operators → OperatorHub.
- Enter Leader Worker Set Operator into the filter box.
- Select the Leader Worker Set Operator and click Install.
On the Install Operator page:
- The Update channel is set to stable-v1.0, which installs the latest stable release of Leader Worker Set Operator 1.0.
- Under Installation mode, select A specific namespace on the cluster.
- Under Installed Namespace, select Operator recommended Namespace: openshift-lws-operator.
Under Update approval, select one of the following update strategies:
- The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
- The Manual strategy requires a user with appropriate credentials to approve the Operator update.
- Click Install.
Create the custom resource (CR) for the Leader Worker Set Operator:
- Navigate to Installed Operators → Leader Worker Set Operator.
- Under Provided APIs, click Create instance in the LeaderWorkerSetOperator pane.
- Click Create.
2.3.2. Deploying a leader worker set Link kopierenLink in die Zwischenablage kopiert!
You can use the Leader Worker Set Operator to deploy a leader worker set to assist with managing distributed workloads across nodes.
Prerequisites
- You have installed the Leader Worker Set Operator.
Procedure
Create a new project by running the following command:
oc new-project my-namespace
$ oc new-project my-namespace
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a file named
leader-worker-set.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the name of the leader worker set resource.
- 2
- Specify the namespace for the leader worker set to run in.
- 3
- Specify the pod template for the leader pods.
- 4
- Specify the restart policy for when pod failures occur. Allowed values are
RecreateGroupOnPodRestart
to restart the whole group orNone
to not restart the group. - 5
- Specify the number of pods to create for each group, including the leader pod. For example, a value of
3
creates 1 leader pod and 2 worker pods. The default value is1
. - 6
- Specify the pod template for the worker pods.
- 7
- Specify the policy to use when creating the headless service. Allowed values are
UniquePerReplica
orShared
. The default value isShared
. - 8
- Specify the number of replicas, or leader-worker groups. The default value is
1
. - 9
- Specify the maximum number of replicas that can be scheduled above the
replicas
value during rolling updates. The value can be specified as an integer or a percentage.
For more information on all available fields to configure, see LeaderWorkerSet API upstream documentation.
Apply the leader worker set configuration by running the following command:
oc apply -f leader-worker-set.yaml
$ oc apply -f leader-worker-set.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that pods were created by running the following command:
oc get pods -n my-namespace
$ oc get pods -n my-namespace
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Review the stateful sets by running the following command:
oc get statefulsets
$ oc get statefulsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY AGE my-lws 4/4 111s my-lws-0 2/2 57s my-lws-1 2/2 60s
NAME READY AGE my-lws 4/4 111s
1 my-lws-0 2/2 57s
2 my-lws-1 2/2 60s
3 Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4. Uninstalling the Leader Worker Set Operator Link kopierenLink in die Zwischenablage kopiert!
You can remove the Leader Worker Set Operator from OpenShift Container Platform by uninstalling the Operator and removing its related resources.
2.4.1. Uninstalling the Leader Worker Set Operator Link kopierenLink in die Zwischenablage kopiert!
You can use the web console to uninstall the Leader Worker Set Operator.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
- You have installed the Leader Worker Set Operator.
Procedure
- Log in to the OpenShift Container Platform web console.
- Navigate to Operators → Installed Operators.
-
Select
openshift-lws-operator
from the Project dropdown list. Delete the
LeaderWorkerSetOperator
instance.- Click Leader Worker Set Operator and select the LeaderWorkerSetOperator tab.
-
Click the Options menu
next to the cluster entry and select Delete LeaderWorkerSetOperator.
- In the confirmation dialog, click Delete.
Uninstall the Leader Worker Set Operator.
- Navigate to Operators → Installed Operators.
-
Click the Options menu
next to the Leader Worker Set Operator entry and click Uninstall Operator.
- In the confirmation dialog, click Uninstall.
2.4.2. Uninstalling Leader Worker Set Operator resources Link kopierenLink in die Zwischenablage kopiert!
Optionally, after uninstalling the Leader Worker Set Operator, you can remove its related resources from your cluster.
Prerequisites
-
You have access to the cluster with
cluster-admin
privileges. - You have access to the OpenShift Container Platform web console.
- You have uninstalled the Leader Worker Set Operator.
Procedure
- Log in to the OpenShift Container Platform web console.
Remove CRDs that were created when the Leader Worker Set Operator was installed:
- Navigate to Administration → CustomResourceDefinitions.
-
Enter
LeaderWorkerSetOperator
in the Name field to filter the CRDs. -
Click the Options menu
next to the LeaderWorkerSetOperator CRD and select Delete CustomResourceDefinition.
- In the confirmation dialog, click Delete.
Delete the
openshift-lws-operator
namespace.- Navigate to Administration → Namespaces.
-
Enter
openshift-lws-operator
into the filter box. -
Click the Options menu
next to the openshift-lws-operator entry and select Delete Namespace.
-
In the confirmation dialog, enter
openshift-lws-operator
and click Delete.
Legal Notice
Link kopierenLink in die Zwischenablage kopiert!
Copyright © 2025 Red Hat
OpenShift documentation is licensed under the Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0).
Modified versions must remove all Red Hat trademarks.
Portions adapted from https://github.com/kubernetes-incubator/service-catalog/ with modifications by Red Hat.
Red Hat, Red Hat Enterprise Linux, the Red Hat logo, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
Java® is a registered trademark of Oracle and/or its affiliates.
XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation’s permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.