Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

AI workloads


OpenShift Container Platform 4.19

Running AI workloads on OpenShift Container Platform

Red Hat OpenShift Documentation Team

Abstract

This document provides information about running artificial intelligence (AI) workloads on an OpenShift Container Platform cluster. It includes details on how to enable large-scale AI training workloads to run reliably across nodes.

Chapter 1. Overview of AI workloads on OpenShift Container Platform

OpenShift Container Platform provides a secure, scalable foundation for running artificial intelligence (AI) workloads across training, inference, and data science workflows.

1.1. Operators for running AI workloads

You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on OpenShift Container Platform. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use OpenShift Container Platform as the core platform for your applications.

OpenShift Container Platform provides several Operators that can help you run AI workloads:

Leader Worker Set Operator

You can use the Leader Worker Set Operator to enable large-scale AI inference workloads to run reliably across nodes with synchronization between leader and worker processes. Without proper coordination, large training runs might fail or stall.

For more information, see "Leader Worker Set Operator overview".

Red Hat build of Kueue

You can use Red Hat build of Kueue to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources.

For more information, see Introduction to Red Hat build of Kueue in the Red Hat build of Kueue documentation.

Chapter 2. Leader Worker Set Operator

2.1. Leader Worker Set Operator overview

Using large language models (LLMs) for AI/ML inference often requires significant compute resources, and workloads typically must be sharded across multiple nodes. This can make deployments complex, creating challenges around scaling, recovery from failures, and efficient pod placement.

The Leader Worker Set Operator simplifies these multi-node deployments by treating a group of pods as a single, coordinated unit. It manages the lifecycle of each pod in the group, scales the entire group together, and performs updates and failure recovery at the group level to ensure consistency.

2.1.1. About the Leader Worker Set Operator

The Leader Worker Set Operator is based on the LeaderWorkerSet open source project. LeaderWorkerSet is a custom Kubernetes API that can be used to deploy a group of pods as a unit. This is useful for artificial intelligence (AI) and machine learning (ML) inference workloads, where large language models (LLMs) are sharded across multiple nodes.

With the LeaderWorkerSet API, pods are grouped into units consisting of one leader and multiple workers, all managed together as a single entity. Each pod in a group has a unique pod identity. Pods within a group are created in parallel and share identical lifecycle stages. Rollouts, rolling updates, and pod failure restarts are performed as a group.

In the LeaderWorkerSet configuration, you define the size of the groups and the number of group replicas. If necessary, you can define separate templates for leader and worker pods, allowing for role-specific customization. You can also configure topology-aware placement, so that pods in the same group are co-located in the same topology.

Important

Before you install the Leader Worker Set Operator, you must install the cert-manager Operator for Red Hat OpenShift because it is required to configure services and manage metrics collection.

Monitoring for the Leader Worker Set Operator is provided by default with OpenShift Container Platform through Prometheus.

2.1.1.1. LeaderWorkerSet architecture

The following diagram shows how the LeaderWorkerSet API organizes groups of pods into a single unit, with one pod as the leader and the rest as the workers, to coordinate distributed workloads:

Figure 2.1. Leader worker set architecture

The LeaderWorkerSet API uses a leader stateful set to manage the deployment and lifecycle of the groups of pods. For each replica defined, a leader-worker group is created.

Each leader-worker group contains a leader pod and a worker stateful set. The worker stateful set is owned by the leader pod and manages the set of worker pods associated with that leader pod. The specified size defines the total number of pods in each leader-worker group, with the leader pod included in that number.

2.2. Leader Worker Set Operator release notes

You can use the Leader Worker Set Operator to manage distributed inference workloads and process large-scale inference requests efficiently.

These release notes track the development of the Leader Worker Set Operator.

For more information, see About the Leader Worker Set Operator.

2.2.1. Release notes for Leader Worker Set Operator 1.0.0

Issued: 18 September 2025

The following advisories are available for the Leader Worker Set Operator 1.0.0:

2.2.1.1. New features and enhancements
  • This is the initial release of the Leader Worker Set Operator.

2.3. Managing distributed workloads with the Leader Worker Set Operator

You can use the Leader Worker Set Operator to manage distributed inference workloads and process large-scale inference requests efficiently.

2.3.1. Installing the Leader Worker Set Operator

You can use the web console to install the Leader Worker Set Operator.

Prerequisites

  • You have access to the cluster with cluster-admin privileges.
  • You have access to the OpenShift Container Platform web console.
  • You have installed the cert-manager Operator for Red Hat OpenShift.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Verify that the cert-manager Operator for Red Hat OpenShift is installed.
  3. Install the Leader Worker Set Operator.

    1. Navigate to OperatorsOperatorHub.
    2. Enter Leader Worker Set Operator into the filter box.
    3. Select the Leader Worker Set Operator and click Install.
    4. On the Install Operator page:

      1. The Update channel is set to stable-v1.0, which installs the latest stable release of Leader Worker Set Operator 1.0.
      2. Under Installation mode, select A specific namespace on the cluster.
      3. Under Installed Namespace, select Operator recommended Namespace: openshift-lws-operator.
      4. Under Update approval, select one of the following update strategies:

        • The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
        • The Manual strategy requires a user with appropriate credentials to approve the Operator update.
      5. Click Install.
  4. Create the custom resource (CR) for the Leader Worker Set Operator:

    1. Navigate to Installed OperatorsLeader Worker Set Operator.
    2. Under Provided APIs, click Create instance in the LeaderWorkerSetOperator pane.
    3. Click Create.

2.3.2. Deploying a leader worker set

You can use the Leader Worker Set Operator to deploy a leader worker set to assist with managing distributed workloads across nodes.

Prerequisites

  • You have installed the Leader Worker Set Operator.

Procedure

  1. Create a new project by running the following command:

    $ oc new-project my-namespace
    Copy to Clipboard Toggle word wrap
  2. Create a file named leader-worker-set.yaml

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      generation: 1
      name: my-lws 
    1
    
      namespace: my-namespace 
    2
    
    spec:
      leaderWorkerTemplate:
        leaderTemplate: 
    3
    
          metadata: {}
          spec:
            containers:
            - image: nginxinc/nginx-unprivileged:1.27
              name: leader
              resources: {}
        restartPolicy: RecreateGroupOnPodRestart 
    4
    
        size: 3 
    5
    
        workerTemplate: 
    6
    
          metadata: {}
          spec:
            containers:
            - image: nginxinc/nginx-unprivileged:1.27
              name: worker
              ports:
              - containerPort: 8080
                protocol: TCP
              resources: {}
      networkConfig:
        subdomainPolicy: Shared 
    7
    
      replicas: 2 
    8
    
      rolloutStrategy:
        rollingUpdateConfiguration:
          maxSurge: 1 
    9
    
          maxUnavailable: 1
        type: RollingUpdate
      startupPolicy: LeaderCreated
    Copy to Clipboard Toggle word wrap
    1
    Specify the name of the leader worker set resource.
    2
    Specify the namespace for the leader worker set to run in.
    3
    Specify the pod template for the leader pods.
    4
    Specify the restart policy for when pod failures occur. Allowed values are RecreateGroupOnPodRestart to restart the whole group or None to not restart the group.
    5
    Specify the number of pods to create for each group, including the leader pod. For example, a value of 3 creates 1 leader pod and 2 worker pods. The default value is 1.
    6
    Specify the pod template for the worker pods.
    7
    Specify the policy to use when creating the headless service. Allowed values are UniquePerReplica or Shared. The default value is Shared.
    8
    Specify the number of replicas, or leader-worker groups. The default value is 1.
    9
    Specify the maximum number of replicas that can be scheduled above the replicas value during rolling updates. The value can be specified as an integer or a percentage.

    For more information on all available fields to configure, see LeaderWorkerSet API upstream documentation.

  3. Apply the leader worker set configuration by running the following command:

    $ oc apply -f leader-worker-set.yaml
    Copy to Clipboard Toggle word wrap

Verification

  1. Verify that pods were created by running the following command:

    $ oc get pods -n my-namespace
    Copy to Clipboard Toggle word wrap

    Example output

    NAME         READY   STATUS    RESTARTS   AGE
    my-lws-0     1/1     Running   0          4s 
    1
    
    my-lws-0-1   1/1     Running   0          3s
    my-lws-0-2   1/1     Running   0          3s
    my-lws-1     1/1     Running   0          7s 
    2
    
    my-lws-1-1   1/1     Running   0          6s
    my-lws-1-2   1/1     Running   0          6s
    Copy to Clipboard Toggle word wrap

    1
    The leader pod for the first group.
    2
    The leader pod for the second group.
  2. Review the stateful sets by running the following command:

    $ oc get statefulsets
    Copy to Clipboard Toggle word wrap

    Example output

    NAME       READY   AGE
    my-lws     4/4     111s 
    1
    
    my-lws-0   2/2     57s 
    2
    
    my-lws-1   2/2     60s 
    3
    Copy to Clipboard Toggle word wrap

    1
    The leader stateful set for all leader-worker groups.
    2
    The worker stateful set for the first group.
    3
    The worker stateful set for the second group.

2.4. Uninstalling the Leader Worker Set Operator

You can remove the Leader Worker Set Operator from OpenShift Container Platform by uninstalling the Operator and removing its related resources.

2.4.1. Uninstalling the Leader Worker Set Operator

You can use the web console to uninstall the Leader Worker Set Operator.

Prerequisites

  • You have access to the cluster with cluster-admin privileges.
  • You have access to the OpenShift Container Platform web console.
  • You have installed the Leader Worker Set Operator.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Navigate to OperatorsInstalled Operators.
  3. Select openshift-lws-operator from the Project dropdown list.
  4. Delete the LeaderWorkerSetOperator instance.

    1. Click Leader Worker Set Operator and select the LeaderWorkerSetOperator tab.
    2. Click the Options menu kebab next to the cluster entry and select Delete LeaderWorkerSetOperator.
    3. In the confirmation dialog, click Delete.
  5. Uninstall the Leader Worker Set Operator.

    1. Navigate to OperatorsInstalled Operators.
    2. Click the Options menu kebab next to the Leader Worker Set Operator entry and click Uninstall Operator.
    3. In the confirmation dialog, click Uninstall.

2.4.2. Uninstalling Leader Worker Set Operator resources

Optionally, after uninstalling the Leader Worker Set Operator, you can remove its related resources from your cluster.

Prerequisites

  • You have access to the cluster with cluster-admin privileges.
  • You have access to the OpenShift Container Platform web console.
  • You have uninstalled the Leader Worker Set Operator.

Procedure

  1. Log in to the OpenShift Container Platform web console.
  2. Remove CRDs that were created when the Leader Worker Set Operator was installed:

    1. Navigate to AdministrationCustomResourceDefinitions.
    2. Enter LeaderWorkerSetOperator in the Name field to filter the CRDs.
    3. Click the Options menu kebab next to the LeaderWorkerSetOperator CRD and select Delete CustomResourceDefinition.
    4. In the confirmation dialog, click Delete.
  3. Delete the openshift-lws-operator namespace.

    1. Navigate to AdministrationNamespaces.
    2. Enter openshift-lws-operator into the filter box.
    3. Click the Options menu kebab next to the openshift-lws-operator entry and select Delete Namespace.
    4. In the confirmation dialog, enter openshift-lws-operator and click Delete.

Legal Notice

Copyright © 2025 Red Hat

OpenShift documentation is licensed under the Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0).

Modified versions must remove all Red Hat trademarks.

Portions adapted from https://github.com/kubernetes-incubator/service-catalog/ with modifications by Red Hat.

Red Hat, Red Hat Enterprise Linux, the Red Hat logo, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

Java® is a registered trademark of Oracle and/or its affiliates.

XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.

MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.

Node.js® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.

The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation’s permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

All other trademarks are the property of their respective owners.

Nach oben
Red Hat logoGithubredditYoutubeTwitter

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Wir helfen Red Hat Benutzern, mit unseren Produkten und Diensten innovativ zu sein und ihre Ziele zu erreichen – mit Inhalten, denen sie vertrauen können. Entdecken Sie unsere neuesten Updates.

Mehr Inklusion in Open Source

Red Hat hat sich verpflichtet, problematische Sprache in unserem Code, unserer Dokumentation und unseren Web-Eigenschaften zu ersetzen. Weitere Einzelheiten finden Sie in Red Hat Blog.

Über Red Hat

Wir liefern gehärtete Lösungen, die es Unternehmen leichter machen, plattform- und umgebungsübergreifend zu arbeiten, vom zentralen Rechenzentrum bis zum Netzwerkrand.

Theme

© 2025 Red Hat