Chapter 9. Managing distributed workloads
In OpenShift AI, distributed workloads like PyTorchJob, RayJob, and RayCluster are created and managed by their respective workload operators. Kueue provides queueing and admission control and integrates with these operators to decide when workloads can run based on cluster-wide quotas.
You can perform advanced configuration for your distributed workloads environment, such as changing the default behavior of the CodeFlare Operator or setting up a cluster for RDMA.
9.1. Configuring quota management for distributed workloads 링크 복사링크가 클립보드에 복사되었습니다!
Configure quotas for distributed workloads by creating Kueue resources. Quotas ensure that you can share resources between several data science projects.
Prerequisites
-
You have logged in to OpenShift with the
cluster-adminrole. You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
- You have installed and activated the Red Hat build of Kueue Operator as described in Configuring workload management with Kueue.
- You have installed the required distributed workloads components as described in Installing the distributed workloads components.
- You have created a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about how to create a project, see Creating a data science project.
- You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
- The resources are physically available in the cluster. For more information about Kueue resources, see the Red Hat build of Kueue documentation.
If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
NoteIn OpenShift AI, Red Hat supports only NVIDIA GPU accelerators and AMD GPU accelerators for distributed workloads.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI (
oc) as shown in the following example:oc login <openshift_cluster_url> -u <admin_username> -p <password>
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that a resource flavor exists or create a custom one, as follows:
Check whether a
ResourceFlavoralready exists:oc get resourceflavors
$ oc get resourceflavorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow If a
ResourceFlavoralready exists and you need to modify it, edit it in place:oc edit resourceflavor <existing_resourceflavor_name>
$ oc edit resourceflavor <existing_resourceflavor_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If a
ResourceFlavordoes not exist or you want a custom one, create a file calleddefault_flavor.yamland populate it with the following content:Empty Kueue resource flavor
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: <example_resource_flavor>
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: <example_resource_flavor>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For more examples, see Example Kueue resource configurations.
Perform one of the following actions:
- If you are modifying the existing resource flavor, save the changes.
If you are creating a new resource flavor, apply the configuration to create the
ResourceFlavorobject:oc apply -f default_flavor.yaml
$ oc apply -f default_flavor.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that a default cluster queue exists or create a custom one, as follows:
NoteOpenShift AI automatically created a default cluster queue when the Kueue integration was activated. You can verify and modify the default cluster queue, or create a custom one.
Check whether a
ClusterQueuealready exists:oc get clusterqueues
$ oc get clusterqueuesCopy to Clipboard Copied! Toggle word wrap Toggle overflow If a
ClusterQueuealready exists and you need to modify it (for example, to change the resources), edit it in place:oc edit clusterqueue <existing_clusterqueue_name>
$ oc edit clusterqueue <existing_clusterqueue_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If a
ClusterQueuedoes not exist or you want a custom one, create a file calledcluster_queue.yamland populate it with the following content:Example cluster queue
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Defines which namespaces can use the resources governed by this cluster queue. An empty
namespaceSelectoras shown in the example means that all namespaces can use these resources. - 2
- Defines the resource types governed by the cluster queue. This example
ClusterQueueobject governs CPU, memory, and GPU resources. If you use AMD GPUs, replacenvidia.com/gpuwithamd.com/gpuin the example code. - 3
- Defines the resource flavor that is applied to the resource types listed. In this example, the <resource_flavor_name> resource flavor is applied to CPU, memory, and GPU resources.
- 4
- Defines the resource requirements for admitting jobs. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.
Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. If you use AMD GPUs, replace
nvidia.com/gpuwithamd.com/gpuin the example code. For more examples, see Example Kueue resource configurations.You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the
spec.resourceGroupssection as follows:-
Include the resource name in the
coveredResourceslist. -
Specify the resource
nameandnominalQuotain theflavors.resourcessection, even if thenominalQuotavalue is 0.
-
Include the resource name in the
Perform one of the following actions:
- If you are modifying the existing cluster queue, save the changes.
If you are creating a new cluster queue, apply the configuration to create the
ClusterQueueobject:oc apply -f cluster_queue.yaml
$ oc apply -f cluster_queue.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that a local queue that points to your cluster queue exists for your project namespace, or create a custom one, as follows:
NoteIf Kueue is enabled in the OpenShift AI dashboard, new projects created from the dashboard are automatically configured for Kueue management. In those namespaces, a default local queue might already exist. You can verify and modify the local queue, or create a custom one.
Check whether a
LocalQueuealready exists for your project namespace:oc get localqueues -n <project_namespace>
$ oc get localqueues -n <project_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If a
LocalQueuealready exists and you need to modify it (for example, to point to a differentClusterQueue), edit it in place:oc edit localqueue <existing_localqueue_name> -n <project_namespace>
$ oc edit localqueue <existing_localqueue_name> -n <project_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If a
LocalQueuedoes not exist or you want a custom one, create a file calledlocal_queue.yamland populate it with the following content:Example local queue
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace the
name,namespace, andclusterQueuevalues accordingly. Perform one of the following actions:
- If you are modifying an existing local queue, save the changes.
If you are creating a new local queue, apply the configuration to create the
LocalQueueobject:oc apply -f local_queue.yaml
$ oc apply -f local_queue.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the local queue in a project, as follows:
oc get localqueues -n <project_namespace>
$ oc get localqueues -n <project_namespace>
9.2. Example Kueue resource configurations for distributed workloads 링크 복사링크가 클립보드에 복사되었습니다!
You can use these example configurations as a starting point for creating Kueue resources to manage your distributed training workloads.
These examples show how to configure Kueue resource flavors and cluster queues for common distributed training scenarios.
In OpenShift AI, Red Hat does not support shared cohorts.
9.3. Configuring the CodeFlare Operator 링크 복사링크가 클립보드에 복사되었습니다!
If you want to change the default configuration of the CodeFlare Operator for distributed workloads in OpenShift AI, you can edit the associated config map.
Prerequisites
-
You have logged in to OpenShift with the
cluster-adminrole. - You have installed the required distributed workloads components as described in Installing the distributed workloads components.
Procedure
-
In the OpenShift console, click Workloads
ConfigMaps. - From the Project list, select redhat-ods-applications.
- Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.
- Click the YAML tab to show the config map specifications.
In the
data:config.yaml:kuberaysection, you can edit the following entries:- ingressDomain
This configuration option is null (
ingressDomain: "") by default. Do not change this option unless the Ingress Controller is not running on OpenShift. OpenShift AI uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:Example dashboard and client routes
ray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain>
ray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - mTLSEnabled
This configuration option is enabled (
mTLSEnabled: true) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from theca-secret-_<cluster_name>_secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:Example Ray client authentication code
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - rayDashboardOauthEnabled
This configuration option is enabled (
rayDashboardOAuthEnabled: true) by default. When this option is enabled, OpenShift AI places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the RayJobSubmissionClientclass), they must set an authorization header as part of their request, as shown in the following example:Example authorization header
{Authorization: "Bearer <your-openshift-token>"}{Authorization: "Bearer <your-openshift-token>"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- To save your changes, click Save.
To apply your changes, delete the pod:
-
Click Workloads
Pods. - Find the codeflare-operator-manager-<pod-id> pod.
- Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.
-
Click Workloads
Verification
Check the status of the codeflare-operator-manager pod, as follows:
-
In the OpenShift console, click Workloads
Deployments. - Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.
- Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.
9.4. Configuring a cluster for RDMA 링크 복사링크가 클립보드에 복사되었습니다!
NVIDIA GPUDirect RDMA uses Remote Direct Memory Access (RDMA) to provide direct GPU interconnect. To configure a cluster for RDMA, a cluster administrator must install and configure several Operators.
Prerequisites
- You can access an OpenShift cluster as a cluster administrator.
- Your cluster has multiple worker nodes with supported NVIDIA GPUs, and can access a compatible NVIDIA accelerated networking platform.
- You have installed Red Hat OpenShift AI with the required distributed training components as described in Installing the distributed workloads components.
- You have configured the distributed training resources as described in Managing distributed workloads.
Procedure
- Log in to the OpenShift Console as a cluster administrator.
Enable NVIDIA GPU support in OpenShift AI.
This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. For more information, see Enabling NVIDIA GPUs.
NoteAfter the NVIDIA GPU Operator is installed, ensure that
rdmais set toenabledin yourClusterPolicycustom resource instance.To simplify the management of NVIDIA networking resources, install and configure the NVIDIA Network Operator, as follows:
- Install the NVIDIA Network Operator, as described in Adding Operators to a cluster in the OpenShift documentation.
- Configure the NVIDIA Network Operator, as described in the deployment examples in the Network Operator Application Notes in the NVIDIA documentation.
[Optional] To use Single Root I/O Virtualization (SR-IOV) deployment modes, complete the following steps:
- Install the SR-IOV Network Operator, as described in the Installing the SR-IOV Network Operator section in the OpenShift documentation.
- Configure the SR-IOV Network Operator, as described in the Configuring the SR-IOV Network Operator section in the OpenShift documentation.
Use the Machine Configuration Operator to increase the limit of pinned memory for non-root users in the container engine (CRI-O) configuration, as follows:
-
In the OpenShift Console, in the Administrator perspective, click Compute
MachineConfigs. - Click Create MachineConfig.
Replace the placeholder text with the following content:
Example machine configuration
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Edit the
default_ulimitsentry to specify an appropriate value for your configuration. For more information about default limits, see the Set default ulimits on CRIO Using machine config Knowledgebase solution. - Click Create.
- Restart the worker nodes to apply the machine configuration.
This configuration enables non-root users to run the training job with RDMA in the most restrictive OpenShift default security context.
-
In the OpenShift Console, in the Administrator perspective, click Compute
Verification
Verify that the Operators are installed correctly, as follows:
-
In the OpenShift Console, in the Administrator perspective, click Workloads
Pods. - Select your project from the Project list.
- Verify that a pod is running for each of the newly installed Operators.
-
In the OpenShift Console, in the Administrator perspective, click Workloads
Verify that RDMA is being used, as follows:
Edit the
PyTorchJobresource to set the*NCCL_DEBUG*environment variable toINFO, as shown in the following example:Setting the NCCL debug level to INFO
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Run the PyTorch job.
Check that the pod logs include an entry similar to the following text:
Example pod log entry
NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]
NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.5. Troubleshooting common problems with distributed workloads for administrators 링크 복사링크가 클립보드에 복사되었습니다!
If your users are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
If the problem is not documented here or in the release notes, contact Red Hat Support.
9.5.1. A user’s Ray cluster is in a suspended state 링크 복사링크가 클립보드에 복사되었습니다!
Problem
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
Diagnosis
The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workload resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the suspended state, as shown in the following example:
status:
conditions:
- lastTransitionTime: '2024-05-29T13:05:09Z'
message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
status:
conditions:
- lastTransitionTime: '2024-05-29T13:05:09Z'
message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
Resolution
Check whether the resource flavor is created, as follows:
- In the OpenShift console, select the user’s project from the Project list.
-
Click Home
Search, and from the Resources list, select ResourceFlavor. - If necessary, create the resource flavor.
- Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
- If necessary, increase the resource quota.
For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.
9.5.2. A user’s Ray cluster is in a failed state 링크 복사링크가 클립보드에 복사되었습니다!
Problem
The user might have insufficient resources.
Diagnosis
The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
Resolution
If the failed state persists, complete the following steps:
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Click the user’s pod name to open the pod details page.
- Click the Events tab, and review the pod events to identify the cause of the problem.
-
Check the status of the
Workloadresource that is created with theRayClusterresource. Thestatus.conditions.messagefield provides the reason for the failed state.
9.5.3. A user receives a "failed to call webhook" error message for the CodeFlare Operator 링크 복사링크가 클립보드에 복사되었습니다!
Problem
After the user runs the cluster.apply() command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
Diagnosis
The CodeFlare Operator pod might not be running.
Resolution
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.5.4. A user’s Ray cluster does not start 링크 복사링크가 클립보드에 복사되었습니다!
Problem
After the user runs the cluster.apply() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready. No pods are created.
Diagnosis
Check the status of the Workload resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for remaining in the Starting state. Similarly, check the status.conditions.message field for the RayCluster resource.
Resolution
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
- Review the logs for the KubeRay pod to identify errors.
9.5.5. A user cannot create a Ray cluster or submit jobs 링크 복사링크가 클립보드에 복사되었습니다!
Problem
After the user runs the cluster.apply() command, an error similar to the following text is shown:
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
Diagnosis
The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.
Resolution
Advise the user to identify and specify the correct OpenShift login credentials as follows:
- In the OpenShift console header, click your username and click Copy login command.
- In the new tab that opens, log in as the user whose credentials you want to use.
- Click Display Token.
-
From the Log in with this token section, copy the
tokenandservervalues. Specify the copied
tokenandservervalues in your notebook code as follows:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Verify that the user has the correct permissions and is part of the
rhods-usersgroup.