此内容没有您所选择的语言版本。

Chapter 9. Managing distributed workloads


In OpenShift AI, distributed workloads like PyTorchJob, RayJob, and RayCluster are created and managed by their respective workload operators. Kueue provides queueing and admission control and integrates with these operators to decide when workloads can run based on cluster-wide quotas.

You can perform advanced configuration for your distributed workloads environment, such as configuring quota management or setting up a cluster for RDMA.

9.1. Configuring quota management for distributed workloads

Configure quotas for distributed workloads by creating Kueue resources. Quotas ensure that you can share resources between several projects.

Prerequisites

  • You have logged in to OpenShift with the cluster-admin role.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

  • You have installed and activated the Red Hat build of Kueue Operator as described in Configuring workload management with Kueue.
  • You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
  • You have created a project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about how to create a project, see Creating a project.
  • You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
  • The resources are physically available in the cluster. For more information about Kueue resources, see the Red Hat build of Kueue documentation.
  • If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.

    Note

    In OpenShift AI 3.4, Red Hat supports only NVIDIA GPU accelerators and AMD GPU accelerators for distributed workloads.

Procedure

  1. In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI (oc) as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Verify that a resource flavor exists or create a custom one, as follows:

    1. Check whether a ResourceFlavor already exists:

      $ oc get resourceflavors
    2. If a ResourceFlavor already exists and you need to modify it, edit it in place:

      $ oc edit resourceflavor <existing_resourceflavor_name>
    3. If a ResourceFlavor does not exist or you want a custom one, create a file called default_flavor.yaml and populate it with the following content:

      Empty Kueue resource flavor

      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ResourceFlavor
      metadata:
        name: <example_resource_flavor>

      For more examples, see Example Kueue resource configurations.

    4. Perform one of the following actions:

      • If you are modifying the existing resource flavor, save the changes.
      • If you are creating a new resource flavor, apply the configuration to create the ResourceFlavor object:

        $ oc apply -f default_flavor.yaml
  3. Verify that a default cluster queue exists or create a custom one, as follows:

    Note

    OpenShift AI automatically created a default cluster queue when the Kueue integration was activated. You can verify and modify the default cluster queue, or create a custom one.

    1. Check whether a ClusterQueue already exists:

      $ oc get clusterqueues
    2. If a ClusterQueue already exists and you need to modify it (for example, to change the resources), edit it in place:

      $ oc edit clusterqueue <existing_clusterqueue_name>
    3. If a ClusterQueue does not exist or you want a custom one, create a file called cluster_queue.yaml and populate it with the following content:

      Example cluster queue

      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ClusterQueue
      metadata:
        name: <example_cluster_queue>
      spec:
        namespaceSelector: {}  
      1
      
        resourceGroups:
        - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]  
      2
      
          flavors:
          - name: "<resource_flavor_name>"  
      3
      
            resources:  
      4
      
            - name: "cpu"
              nominalQuota: 9
            - name: "memory"
              nominalQuota: 36Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 5

      1
      Defines which namespaces can use the resources governed by this cluster queue. An empty namespaceSelector as shown in the example means that all namespaces can use these resources.
      2
      Defines the resource types governed by the cluster queue. This example ClusterQueue object governs CPU, memory, and GPU resources. If you use AMD GPUs, replace nvidia.com/gpu with amd.com/gpu in the example code.
      3
      Defines the resource flavor that is applied to the resource types listed. In this example, the <resource_flavor_name> resource flavor is applied to CPU, memory, and GPU resources.
      4
      Defines the resource requirements for admitting jobs. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.
    4. Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. If you use AMD GPUs, replace nvidia.com/gpu with amd.com/gpu in the example code. For more examples, see Example Kueue resource configurations.

      You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the spec.resourceGroups section as follows:

      • Include the resource name in the coveredResources list.
      • Specify the resource name and nominalQuota in the flavors.resources section, even if the nominalQuota value is 0.
    5. Perform one of the following actions:

      • If you are modifying the existing cluster queue, save the changes.
      • If you are creating a new cluster queue, apply the configuration to create the ClusterQueue object:

        $ oc apply -f cluster_queue.yaml
  4. Verify that a local queue that points to your cluster queue exists for your project namespace, or create a custom one, as follows:

    Note

    If Kueue is enabled in the OpenShift AI dashboard, new projects created from the dashboard are automatically configured for Kueue management. In those namespaces, a default local queue might already exist. You can verify and modify the local queue, or create a custom one.

    1. Check whether a LocalQueue already exists for your project namespace:

      $ oc get localqueues -n <project_namespace>
    2. If a LocalQueue already exists and you need to modify it (for example, to point to a different ClusterQueue), edit it in place:

      $ oc edit localqueue <existing_localqueue_name> -n <project_namespace>
    3. If a LocalQueue does not exist or you want a custom one, create a file called local_queue.yaml and populate it with the following content:

      Example local queue

      apiVersion: kueue.x-k8s.io/v1beta1
      kind: LocalQueue
      metadata:
        name: <example_local_queue>
        namespace: <project_namespace>
      spec:
        clusterQueue: <cluster_queue_name>

    4. Replace the name, namespace, and clusterQueue values accordingly.
    5. Perform one of the following actions:

      • If you are modifying an existing local queue, save the changes.
      • If you are creating a new local queue, apply the configuration to create the LocalQueue object:

        $ oc apply -f local_queue.yaml

Verification

Check the status of the local queue in a project, as follows:

$ oc get localqueues -n <project_namespace>

You can use these example configurations as a starting point for creating Kueue resources to manage your distributed training workloads.

These examples show how to configure Kueue resource flavors and cluster queues for common distributed training scenarios.

Note

In OpenShift AI 3.4, Red Hat does not support shared cohorts.

9.2.1. NVIDIA GPUs without shared cohort

9.2.1.1. NVIDIA RTX A400 GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "a400node"
spec:
  nodeLabels:
    instance-type: nvidia-a400-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

9.2.1.2. NVIDIA RTX A1000 GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "a1000node"
spec:
  nodeLabels:
    instance-type: nvidia-a1000-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

9.2.1.3. NVIDIA RTX A400 GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "a400queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "a400node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

9.2.1.4. NVIDIA RTX A1000 GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "a1000queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "a1000node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

9.2.2. NVIDIA GPUs and AMD GPUs without shared cohort

9.2.2.1. AMD GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "amd-node"
spec:
  nodeLabels:
    instance-type: amd-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

9.2.2.2. NVIDIA GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "nvidia-node"
spec:
  nodeLabels:
    instance-type: nvidia-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

9.2.2.3. AMD GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-amd-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "amd.com/gpu"]
    flavors:
    - name: "amd-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "amd.com/gpu"
        nominalQuota: 2

9.2.2.4. NVIDIA GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-nvidia-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "nvidia-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

9.3. Configuring a cluster for RDMA

NVIDIA GPUDirect RDMA uses Remote Direct Memory Access (RDMA) to provide direct GPU interconnect. To configure a cluster for RDMA, a cluster administrator must install and configure several Operators.

Prerequisites

Procedure

  1. Log in to the OpenShift Console as a cluster administrator.
  2. Enable NVIDIA GPU support in OpenShift AI.

    This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. For more information, see Enabling NVIDIA GPUs.

    Note

    After the NVIDIA GPU Operator is installed, ensure that rdma is set to enabled in your ClusterPolicy custom resource instance.

  3. To simplify the management of NVIDIA networking resources, install and configure the NVIDIA Network Operator, as follows:

    1. Install the NVIDIA Network Operator, as described in Adding Operators to a cluster in the OpenShift documentation.
    2. Configure the NVIDIA Network Operator, as described in the deployment examples in the Network Operator Application Notes in the NVIDIA documentation.
  4. [Optional] To use Single Root I/O Virtualization (SR-IOV) deployment modes, complete the following steps:

    1. Install the SR-IOV Network Operator, as described in the Installing the SR-IOV Network Operator section in the OpenShift documentation.
    2. Configure the SR-IOV Network Operator, as described in the Configuring the SR-IOV Network Operator section in the OpenShift documentation.
  5. Use the Machine Configuration Operator to increase the limit of pinned memory for non-root users in the container engine (CRI-O) configuration, as follows:

    1. In the OpenShift Console, in the Administrator perspective, click Compute MachineConfigs.
    2. Click Create MachineConfig.
    3. Replace the placeholder text with the following content:

      Example machine configuration

      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: 02-worker-container-runtime
      spec:
        config:
          ignition:
            version: 3.2.0
          storage:
            files:
              - contents:
                  inline: |
                    [crio.runtime]
                    default_ulimits = [
                      "memlock=-1:-1"
                    ]
                mode: 420
                overwrite: true
                path: /etc/crio/crio.conf.d/10-custom

    4. Edit the default_ulimits entry to specify an appropriate value for your configuration. For more information about default limits, see the Set default ulimits on CRIO Using machine config Knowledgebase solution.
    5. Click Create.
    6. Restart the worker nodes to apply the machine configuration.

    This configuration enables non-root users to run the training job with RDMA in the most restrictive OpenShift default security context.

Verification

  1. Verify that the Operators are installed correctly, as follows:

    1. In the OpenShift Console, in the Administrator perspective, click Workloads Pods.
    2. Select your project from the Project list.
    3. Verify that a pod is running for each of the newly installed Operators.
  2. Verify that RDMA is being used, as follows:

    1. Edit the PyTorchJob resource to set the *NCCL_DEBUG* environment variable to INFO, as shown in the following example:

      Setting the NCCL debug level to INFO

              spec:
                containers:
                - command:
                  - /bin/bash
                  - -c
                  - "your container command"
                  env:
                  - name: NCCL_SOCKET_IFNAME
                    value: "net1"
                  - name: NCCL_IB_HCA
                    value: "mlx5_1"
                  - name: NCCL_DEBUG
                    value: "INFO"

    2. Run the PyTorch job.
    3. Check that the pod logs include an entry similar to the following text:

      Example pod log entry

      NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]

If your users are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

If the problem is not documented here or in the release notes, contact Red Hat Support.

9.4.1. A user’s Ray cluster is in a suspended state

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workload resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the suspended state, as shown in the following example:

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

Resolution

  1. Check whether the resource flavor is created, as follows:

    1. In the OpenShift console, select the user’s project from the Project list.
    2. Click Home Search, and from the Resources list, select ResourceFlavor.
    3. If necessary, create the resource flavor.
  2. Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
  3. If necessary, increase the resource quota.

For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.

9.4.2. A user’s Ray cluster is in a failed state

Problem

The user might have insufficient resources.

Diagnosis

The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.

Resolution

If the failed state persists, complete the following steps:

  1. In the OpenShift console, select the user’s project from the Project list.
  2. Click Workloads Pods.
  3. Click the user’s pod name to open the pod details page.
  4. Click the Events tab, and review the pod events to identify the cause of the problem.
  5. Check the status of the Workload resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the failed state.

9.4.3. A user’s Ray cluster does not start

Problem

After the user runs the cluster.apply() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready. No pods are created.

Diagnosis

Check the status of the Workload resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for remaining in the Starting state. Similarly, check the status.conditions.message field for the RayCluster resource.

Resolution

  1. In the OpenShift console, select the user’s project from the Project list.
  2. Click Workloads Pods.
  3. Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
  4. Review the logs for the KubeRay pod to identify errors.

9.4.4. A user cannot create a Ray cluster or submit jobs

Problem

After the user runs the cluster.apply() command, an error similar to the following text is shown:

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

Diagnosis

The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.

Resolution

  1. Advise the user to identify and specify the correct OpenShift login credentials as follows:

    1. In the OpenShift console header, click your username and click Copy login command.
    2. In the new tab that opens, log in as the user whose credentials you want to use.
    3. Click Display Token.
    4. From the Log in with this token section, copy the token and server values.
    5. Specify the copied token and server values in your notebook code as follows:

      auth = TokenAuthentication(
          token = "<token>",
          server = "<server>",
          skip_tls=False
      )
      auth.login()
  2. Verify that the user has the correct permissions and is part of the rhods-users group.
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2026 Red Hat
返回顶部