Chapter 8. Managing distributed workloads
In OpenShift AI, cluster administrators create Kueue resources to configure quota management for distributed workloads. Cluster administrators can optionally configure the CodeFlare Operator if they want to change its default behavior.
The minimum version of OpenShift that is compatible with Kueue 0.10.1 is 4.15. If you deploy Red Hat OpenShift AI 2.18 or later on OpenShift 4.14 or earlier, you must disable the API Priority and Fairness configuration for the Visibility API, as described in the Enabling/Disabling API Priority and Fairness section in the Kueue Cluster Administration guide. This modification is required because the PriorityLevelConfiguration
API is incompatible with older OpenShift versions.
8.1. Overview of Kueue resources
Cluster administrators can configure Kueue objects (such as resource flavors, cluster queues, and local queues) to manage distributed workload resources across multiple nodes in an OpenShift cluster.
In OpenShift AI 2.21, Red Hat does not support shared cohorts.
8.1.1. Resource flavor
The Kueue ResourceFlavor
object describes the resource variations that are available in a cluster.
Resources in a cluster can be homogenous or heterogeneous:
- Homogeneous resources are identical across the cluster: same node type, CPUs, memory, accelerators, and so on.
- Heterogeneous resources have variations across the cluster.
If a cluster has homogeneous resources, or if it is not necessary to manage separate quotas for different flavors of a resource, a cluster administrator can create an empty ResourceFlavor
object named default-flavor
, without any labels or taints, as follows:
Empty Kueue resource flavor for homegeneous resources
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: default-flavor
If a cluster has heterogeneous resources, cluster administrators can define a different resource flavor for each variation in the resources available. Example variations include different CPUs, different memory, or different accelerators. If a cluster has multiple types of accelerator, cluster administrators can set up a resource flavor for each accelerator type. Cluster administrators can then associate the resource flavors with cluster nodes by using labels, taints, and tolerations, as shown in the following example.
Example Kueue resource flavor for heterogeneous resources
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "spot" spec: nodeLabels: instance-type: spot nodeTaints: - effect: NoSchedule key: spot value: "true" tolerations: - key: "spot-taint" operator: "Exists" effect: "NoSchedule"
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "spot"
spec:
nodeLabels:
instance-type: spot
nodeTaints:
- effect: NoSchedule
key: spot
value: "true"
tolerations:
- key: "spot-taint"
operator: "Exists"
effect: "NoSchedule"
Make sure that each resource flavor has the correct label selectors and taint tolerations so that workloads run on the expected nodes.
See the example configurations provided in Example Kueue resource configurations.
For more information about configuring resource flavors, see Resource Flavor in the Kueue documentation.
8.1.2. Cluster queue
The Kueue ClusterQueue
object manages a pool of cluster resources such as pods, CPUs, memory, and accelerators. A cluster can have multiple cluster queues, and each cluster queue can reference multiple resource flavors.
Cluster administrators can configure a cluster queue to define the resource flavors that the queue manages, and assign a quota for each resource in each resource flavor.
The following example configures a cluster queue to assign a quota of 9 CPUs, 36 GiB memory, 5 pods, and 5 NVIDIA GPUs.
Example cluster queue
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - name: "pods" nominalQuota: 5 - name: "nvidia.com/gpu" nominalQuota: '5'
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 9
- name: "memory"
nominalQuota: 36Gi
- name: "pods"
nominalQuota: 5
- name: "nvidia.com/gpu"
nominalQuota: '5'
A cluster administrator should notify the consumers of a cluster queue about the quota limits for that cluster queue. The cluster queue starts a distributed workload only if the total required resources are within these quota limits. If the sum of the requests for a resource in a distributed workload is greater than the specified quota for that resource in the cluster queue, the cluster queue does not admit the distributed workload.
See the example configurations provided in Example Kueue resource configurations.
For more information about configuring cluster queues, see Cluster Queue in the Kueue documentation.
8.1.3. Local queue
The Kueue LocalQueue
object groups closely related distributed workloads in a project. Cluster administrators can configure local queues to specify the project name and the associated cluster queue. Each local queue then grants access to the resources that its specified cluster queue manages. A cluster administrator can optionally define one local queue in a project as the default local queue for that project.
When configuring a distributed workload, the user specifies the local queue name. If a cluster administrator configured a default local queue, the user can omit the local queue specification from the distributed workload code.
Kueue allocates the resources for a distributed workload from the cluster queue that is associated with the local queue, if the total requested resources are within the quota limits specified in that cluster queue.
The following example configures a local queue called team-a-queue
for the team-a
project, and specifies cluster-queue
as the associated cluster queue.
Example local queue
apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: team-a name: team-a-queue annotations: kueue.x-k8s.io/default-queue: "true" spec: clusterQueue: cluster-queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: team-a
name: team-a-queue
annotations:
kueue.x-k8s.io/default-queue: "true"
spec:
clusterQueue: cluster-queue
In this example, the kueue.x-k8s.io/default-queue: "true"
annotation defines this local queue as the default local queue for the team-a
project. If a user submits a distributed workload in the team-a
project and that distributed workload does not specify a local queue in the cluster configuration, Kueue automatically routes the distributed workload to the team-a-queue
local queue. The distributed workload can then access the resources that the cluster-queue
cluster queue manages.
For more information about configuring local queues, see Local Queue in the Kueue documentation.
8.2. Example Kueue resource configurations
These examples show how to configure Kueue resource flavors and cluster queues.
In OpenShift AI 2.21, Red Hat does not support shared cohorts.
8.3. Configuring quota management for distributed workloads
Configure quotas for distributed workloads on a cluster, so that you can share resources between several data science projects.
Prerequisites
-
You have logged in to OpenShift with the
cluster-admin
role. - You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
- You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- You have created a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about how to create a project, see Creating a data science project.
- You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
- The resources are physically available in the cluster. For more information about Kueue resources, see Overview of Kueue resources.
If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
NoteIn OpenShift AI 2.21, Red Hat supports only NVIDIA GPU accelerators and AMD GPU accelerators for distributed workloads.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
oc login <openshift_cluster_url> -u <admin_username> -p <password>
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
Copy to Clipboard Copied! Create an empty Kueue resource flavor, as follows:
Create a file called
default_flavor.yaml
and populate it with the following content:Empty Kueue resource flavor
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor
Copy to Clipboard Copied! Apply the configuration to create the
default-flavor
object:oc apply -f default_flavor.yaml
$ oc apply -f default_flavor.yaml
Copy to Clipboard Copied!
Create a cluster queue to manage the empty Kueue resource flavor, as follows:
Create a file called
cluster_queue.yaml
and populate it with the following content:Example cluster queue
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - name: "nvidia.com/gpu" nominalQuota: 5
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - name: "nvidia.com/gpu" nominalQuota: 5
Copy to Clipboard Copied! Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. If you use AMD GPUs, replace
nvidia.com/gpu
withamd.com/gpu
in the example code. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the
spec.resourceGroups
section as follows:-
Include the resource name in the
coveredResources
list. -
Specify the resource
name
andnominalQuota
in theflavors.resources
section, even if thenominalQuota
value is 0.
-
Include the resource name in the
Apply the configuration to create the
cluster-queue
object:oc apply -f cluster_queue.yaml
$ oc apply -f cluster_queue.yaml
Copy to Clipboard Copied!
Create a local queue that points to your cluster queue, as follows:
Create a file called
local_queue.yaml
and populate it with the following content:Example local queue
apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: test name: local-queue-test annotations: kueue.x-k8s.io/default-queue: 'true' spec: clusterQueue: cluster-queue
apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: test name: local-queue-test annotations: kueue.x-k8s.io/default-queue: 'true' spec: clusterQueue: cluster-queue
Copy to Clipboard Copied! The
kueue.x-k8s.io/default-queue: 'true'
annotation defines this queue as the default queue. Distributed workloads are submitted to this queue if nolocal_queue
value is specified in theClusterConfiguration
section of the data science pipeline or Jupyter notebook or Microsoft Visual Studio Code file.-
Update the
namespace
value to specify the same namespace as in theClusterConfiguration
section that creates the Ray cluster. -
Optional: Update the
name
value accordingly. Apply the configuration to create the local-queue object:
oc apply -f local_queue.yaml
$ oc apply -f local_queue.yaml
Copy to Clipboard Copied! The cluster queue allocates the resources to run distributed workloads in the local queue.
Verification
Check the status of the local queue in a project, as follows:
oc get -n <project-name> localqueues
$ oc get -n <project-name> localqueues
8.4. Enforcing the use of local queues
Efficient workload orchestration in OpenShift clusters relies on strict management of resources and queues. Cluster administrators can use the Validating Admission Policy feature to enforce the mandatory labeling of RayCluster and PyTorchJob resources with Local Queue identifiers. This labeling ensures that workloads are properly categorized and routed based on queue management policies, which prevents resource contention and enhances operational efficiency.
The Validating Admission Policy feature is available in OpenShift v4.17 or later.
The Validating Admission Policy feature is currently available in Red Hat OpenShift AI 2.21 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
8.4.1. Enforcing the local-queue labeling policy for all projects
When the local-queue labeling policy is enforced, Ray clusters and PyTorchJobs are created only if they are configured to use a local queue, and the Ray cluster and PyTorchJob resources are then managed by Kueue.
The local-queue labeling policy is enforced for all projects by default. The Validating Admission Policy is enforced on both RayCluster and PyTorchJob resources.
If the original ValidatingAdmissionPolicyBinding
resource is edited, you can use either of the following methods to undo the edits and enforce the policy for all projects:
-
Delete the
kueue-validating-admission-policy-binding
resource. The resource is automatically re-created with the default values. No other action is required. - Edit the existing resource as described in this procedure.
Prerequisites
-
You have logged in to OpenShift with the
cluster-admin
role. - You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
Procedure
- In the OpenShift console, open the Administrator perspective.
- From the Project list, select All Projects.
-
Click Home
Search. - In the Resources list, search for ValidatingAdmissionPolicyBinding.
-
Click the
kueue-validating-admission-policy-binding
entry to open the details page. - Click the YAML tab to show the binding specifications.
Ensure that the following fields are set to the specified values:
Example to enforce local-queue labeling for all projects
kind: ValidatingAdmissionPolicyBinding apiVersion: admissionregistration.k8s.io/v1 metadata: name: kueue-validating-admission-policy-binding uid: <Populated by the system. Read-only.> resourceVersion: <Populated by the system. Read-only.> generation: <Populated by the system. Read-only.> creationTimestamp: <Populated by the system. Read-only.> labels: app.kubernetes.io/component: controller app.kubernetes.io/name: kueue managedFields: spec: policyName: kueue-validating-admission-policy matchResources: namespaceSelector: {} objectSelector: {} matchPolicy: Equivalent validationActions: - Deny
kind: ValidatingAdmissionPolicyBinding apiVersion: admissionregistration.k8s.io/v1 metadata: name: kueue-validating-admission-policy-binding uid: <Populated by the system. Read-only.> resourceVersion: <Populated by the system. Read-only.> generation: <Populated by the system. Read-only.> creationTimestamp: <Populated by the system. Read-only.> labels: app.kubernetes.io/component: controller app.kubernetes.io/name: kueue managedFields: spec: policyName: kueue-validating-admission-policy matchResources: namespaceSelector: {} objectSelector: {} matchPolicy: Equivalent validationActions: - Deny
Copy to Clipboard Copied! - If you made any changes, click Save.
Verification
To verify that the local-queue labeling policy is enforced for Ray clusters:
- Create a project.
Complete the following steps in the new project:
Before you configure a local queue, try to create a Ray cluster.
The Validating Admission Policy rejects the request, and the Ray cluster is not created, because no local queue is configured.
-
Create a local queue without the
default-queue
annotation. Try to create a Ray cluster, and specify the local-queue name in the
local_queue
field.The Validating Admission Policy approves the request, and the Ray cluster is created.
Try to create a Ray cluster without specifying a value in the
local_queue
field.The Validating Admission Policy rejects the request, and the Ray cluster is not created, because a local queue is not specified and a default local queue is not configured.
-
Edit the local queue to add the
kueue.x-k8s.io/default-queue: "true"
annotation, which configures that queue as the default local queue. Try to create a Ray cluster without specifying a value in the
local_queue
field.The Validating Admission Policy approves the request, and the Ray cluster is created even though a local queue is not specified, because the default local queue is used.
To verify that the local-queue labeling policy is enforced for PyTorchJobs:
Complete the following steps in the new project:
Before you configure a local queue, try to create a PyTorchJob.
The Validating Admission Policy rejects the request, and the PyTorchJob is not created, because no local queue is configured.
- Create a local queue.
Try to create a PyTorchJob and add the
kueue.x-k8s.io/queue-name: <local-queue-name>
label in thelabels
field.The Validating Admission Policy approves the request, and the PyTorchJob is created.
Try to create a PyTorchJob without specifying the
kueue.x-k8s.io/queue-name
label in thelabels
field.The Validating Admission Policy rejects the request, and the PyTorchJob is not created, because a local queue is not specified.
8.4.2. Disabling the local-queue labeling policy for all projects
The local-queue labeling policy is enforced for all projects by default. If the local-queue labeling policy is disabled, it is possible to create Ray clusters that do not use a local queue. However, the resources of such Ray clusters are not managed by Kueue.
You can disable the local-queue labeling policy for all projects by editing the ValidatingAdmissionPolicyBinding
resource as described in this procedure.
Prerequisites
-
You have logged in to OpenShift with the
cluster-admin
role. - You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
Procedure
- In the OpenShift console, open the Administrator perspective.
- From the Project list, select All Projects.
-
Click Home
Search. - In the Resources list, search for ValidatingAdmissionPolicyBinding.
-
Click the
kueue-validating-admission-policy-binding
entry to open the details page. - Click the YAML tab to show the binding specifications.
Edit the
policyName
field to change the value todisabled
, as shown in the following example:Example to disable local-queue labeling for all projects
kind: ValidatingAdmissionPolicyBinding apiVersion: admissionregistration.k8s.io/v1 metadata: name: kueue-validating-admission-policy-binding uid: <Populated by the system. Read-only.> resourceVersion: <Populated by the system. Read-only.> generation: <Populated by the system. Read-only.> creationTimestamp: <Populated by the system. Read-only.> labels: app.kubernetes.io/component: controller app.kubernetes.io/name: kueue managedFields: spec: policyName: disabled matchResources: namespaceSelector: {} objectSelector: {} matchPolicy: Equivalent validationActions: - Deny
kind: ValidatingAdmissionPolicyBinding apiVersion: admissionregistration.k8s.io/v1 metadata: name: kueue-validating-admission-policy-binding uid: <Populated by the system. Read-only.> resourceVersion: <Populated by the system. Read-only.> generation: <Populated by the system. Read-only.> creationTimestamp: <Populated by the system. Read-only.> labels: app.kubernetes.io/component: controller app.kubernetes.io/name: kueue managedFields: spec: policyName: disabled matchResources: namespaceSelector: {} objectSelector: {} matchPolicy: Equivalent validationActions: - Deny
Copy to Clipboard Copied! - Click Save.
Verification
To verify that the local-queue labeling policy is disabled for Ray clusters:
- Create a project.
Complete the following steps in the new project:
Before you configure a local queue, try to create a Ray cluster.
The Ray cluster is created, even though no local queue is configured, because the Validating Admission Policy is not enforced. However, the Ray cluster resources are not managed by Kueue.
-
Create a local queue without the
default-queue
annotation. Try to create a Ray cluster, and specify the local-queue name in the
local_queue
field.The Ray cluster is created, and the Ray cluster resources are managed by Kueue.
Try to create a Ray cluster without specifying a value in the
local_queue
field.The Ray cluster is created, but the Ray cluster resources are not managed by Kueue.
-
Edit the local queue to add the
kueue.x-k8s.io/default-queue: "true"
annotation, which configures that queue as the default local queue. Try to create a Ray cluster without specifying a value in the
local_queue
field.The Ray cluster is created, and the Ray cluster resources are managed by Kueue.
To verify that the local-queue labeling policy is disabled for PyTorchJobs:
- Create a project.
Complete the following steps in the new project:
Before you configure a local queue, try to create a PyTorchJob.
The PyTorchJob is created, even though no local queue is configured, because the Validating Admission Policy is not enforced. The PyTorchJob resources are not managed by Kueue.
- Create a local queue.
Try to create a PyTorchJob and add the
kueue.x-k8s.io/queue-name: <local-queue-name>
label in thelabels
field.The PyTorchJob is created, and the PyTorchJob resources are managed by Kueue.
Try to create a PyTorchJob without adding the
kueue.x-k8s.io/queue-name: <local-queue-name>
label in thelabels
field.The PyTorchJob is created, but the PyTorchJob resources are not managed by Kueue.
8.4.3. Enforcing the local-queue labeling policy for some projects only
When the local-queue labeling policy is enforced, Ray clusters and PyTorchJobs are created only if they are configured to use a local queue, and the Ray cluster and PyTorchJob resources are then managed by Kueue. Disabling the policy means that it is possible to create Ray clusters or PyTorchJobs that do not use a local queue, but the resources of such Ray clusters or PyTorchJobs are not managed by Kueue.
The local-queue labeling policy is enforced for all projects by default. The Validating Admission Policy is enforced on both RayCluster and PyTorchJob resources. To enforce the local-queue labeling policy for some projects only, follow these steps.
Prerequisites
-
You have logged in to OpenShift with the
cluster-admin
role. - You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
Procedure
- In the OpenShift console, open the Administrator perspective.
- From the Project list, select All Projects.
-
Click Home
Search. - In the Resources list, search for ValidatingAdmissionPolicyBinding.
-
Click the
kueue-validating-admission-policy-binding
entry to open the details page. - Click the YAML tab to show the binding specifications.
Edit the
namespaceSelector
field to delete the{}
value, and add thematchLabels
andkueue.openshift.io/managed
values as shown in the following example:ImportantThe
kueue.openshift.io/managed=true
label is supported for OpenShift AI projects only.Example to enforce local-queue labeling for some projects only
kind: ValidatingAdmissionPolicyBinding apiVersion: admissionregistration.k8s.io/v1 metadata: name: kueue-validating-admission-policy-binding uid: <Populated by the system. Read-only.> resourceVersion: <Populated by the system. Read-only.> generation: <Populated by the system. Read-only.> creationTimestamp: <Populated by the system. Read-only.> labels: app.kubernetes.io/component: controller app.kubernetes.io/name: kueue managedFields: spec: policyName: kueue-validating-admission-policy matchResources: namespaceSelector: matchLabels: kueue.openshift.io/managed: "true" objectSelector: {} matchPolicy: Equivalent validationActions: - Deny
kind: ValidatingAdmissionPolicyBinding apiVersion: admissionregistration.k8s.io/v1 metadata: name: kueue-validating-admission-policy-binding uid: <Populated by the system. Read-only.> resourceVersion: <Populated by the system. Read-only.> generation: <Populated by the system. Read-only.> creationTimestamp: <Populated by the system. Read-only.> labels: app.kubernetes.io/component: controller app.kubernetes.io/name: kueue managedFields: spec: policyName: kueue-validating-admission-policy matchResources: namespaceSelector: matchLabels: kueue.openshift.io/managed: "true" objectSelector: {} matchPolicy: Equivalent validationActions: - Deny
Copy to Clipboard Copied! - Click Save.
Add the
kueue.openshift.io/managed
label to each project for which you want to enforce this policy, by running the following command:Example command to add the
kueue.openshift.io/managed
label to a projectoc label namespace <project-name> kueue.openshift.io/managed=true
oc label namespace <project-name> kueue.openshift.io/managed=true
Copy to Clipboard Copied!
Verification
- Create two projects: Project A and Project B.
-
Add the
kueue.openshift.io/managed
label to Project A only. In each project, try to create a Ray cluster or PyTorchJob.
-
In Project A and all projects with the
kueue.openshift.io/managed
label, the behavior is as described in Enforcing the local-queue labeling policy for all projects. -
In Project B and all projects without the
kueue.openshift.io/managed
label, the behavior is as described in Disabling the local-queue labeling policy for all projects.
-
In Project A and all projects with the
8.5. Configuring the CodeFlare Operator
If you want to change the default configuration of the CodeFlare Operator for distributed workloads in OpenShift AI, you can edit the associated config map.
Prerequisites
-
You have logged in to OpenShift with the
cluster-admin
role. - You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
Procedure
-
In the OpenShift console, click Workloads
ConfigMaps. - From the Project list, select redhat-ods-applications.
- Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.
- Click the YAML tab to show the config map specifications.
In the
data:config.yaml:kuberay
section, you can edit the following entries:- ingressDomain
This configuration option is null (
ingressDomain: ""
) by default. Do not change this option unless the Ingress Controller is not running on OpenShift. OpenShift AI uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:Example dashboard and client routes
ray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain>
ray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain>
Copy to Clipboard Copied! - mTLSEnabled
This configuration option is enabled (
mTLSEnabled: true
) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from theca-secret-_<cluster_name>_
secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:Example Ray client authentication code
from codeflare_sdk import generate_cert generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) generate_cert.export_env(cluster.config.name, cluster.config.namespace) ray.init(cluster.cluster_uri())
from codeflare_sdk import generate_cert generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) generate_cert.export_env(cluster.config.name, cluster.config.namespace) ray.init(cluster.cluster_uri())
Copy to Clipboard Copied! - rayDashboardOauthEnabled
This configuration option is enabled (
rayDashboardOAuthEnabled: true
) by default. When this option is enabled, OpenShift AI places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the RayJobSubmissionClient
class), they must set an authorization header as part of their request, as shown in the following example:Example authorization header
{Authorization: "Bearer <your-openshift-token>"}
{Authorization: "Bearer <your-openshift-token>"}
Copy to Clipboard Copied!
- To save your changes, click Save.
To apply your changes, delete the pod:
-
Click Workloads
Pods. - Find the codeflare-operator-manager-<pod-id> pod.
- Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.
-
Click Workloads
Verification
Check the status of the codeflare-operator-manager pod, as follows:
-
In the OpenShift console, click Workloads
Deployments. - Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.
- Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.
8.6. Configuring a cluster for RDMA
NVIDIA GPUDirect RDMA uses Remote Direct Memory Access (RDMA) to provide direct GPU interconnect. To configure a cluster for RDMA, a cluster administrator must install and configure several Operators.
Prerequisites
- You can access an OpenShift cluster as a cluster administrator.
- Your cluster has multiple worker nodes with supported NVIDIA GPUs, and can access a compatible NVIDIA accelerated networking platform.
- You have installed Red Hat OpenShift AI with the required distributed training components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- You have configured the distributed training resources as described in Managing distributed workloads.
Procedure
- Log in to the OpenShift Console as a cluster administrator.
Enable NVIDIA GPU support in OpenShift AI.
This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. For more information, see Enabling NVIDIA GPUs.
NoteAfter the NVIDIA GPU Operator is installed, ensure that
rdma
is set toenabled
in yourClusterPolicy
custom resource instance.To simplify the management of NVIDIA networking resources, install and configure the NVIDIA Network Operator, as follows:
- Install the NVIDIA Network Operator, as described in Adding Operators to a cluster in the OpenShift documentation.
- Configure the NVIDIA Network Operator, as described in the deployment examples in the Network Operator Application Notes in the NVIDIA documentation.
[Optional] To use Single Root I/O Virtualization (SR-IOV) deployment modes, complete the following steps:
- Install the SR-IOV Network Operator, as described in the Installing the SR-IOV Network Operator section in the OpenShift documentation.
- Configure the SR-IOV Network Operator, as described in the Configuring the SR-IOV Network Operator section in the OpenShift documentation.
Use the Machine Configuration Operator to increase the limit of pinned memory for non-root users in the container engine (CRI-O) configuration, as follows:
-
In the OpenShift Console, in the Administrator perspective, click Compute
MachineConfigs. - Click Create MachineConfig.
Replace the placeholder text with the following content:
Example machine configuration
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 02-worker-container-runtime spec: config: ignition: version: 3.2.0 storage: files: - contents: inline: | [crio.runtime] default_ulimits = [ "memlock=-1:-1" ] mode: 420 overwrite: true path: /etc/crio/crio.conf.d/10-custom
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 02-worker-container-runtime spec: config: ignition: version: 3.2.0 storage: files: - contents: inline: | [crio.runtime] default_ulimits = [ "memlock=-1:-1" ] mode: 420 overwrite: true path: /etc/crio/crio.conf.d/10-custom
Copy to Clipboard Copied! -
Edit the
default_ulimits
entry to specify an appropriate value for your configuration. For more information about default limits, see the Set default ulimits on CRIO Using machine config Knowledgebase solution. - Click Create.
- Restart the worker nodes to apply the machine configuration.
This configuration enables non-root users to run the training job with RDMA in the most restrictive OpenShift default security context.
-
In the OpenShift Console, in the Administrator perspective, click Compute
Verification
Verify that the Operators are installed correctly, as follows:
-
In the OpenShift Console, in the Administrator perspective, click Workloads
Pods. - Select your project from the Project list.
- Verify that a pod is running for each of the newly installed Operators.
-
In the OpenShift Console, in the Administrator perspective, click Workloads
Verify that RDMA is being used, as follows:
Edit the
PyTorchJob
resource to set the*NCCL_DEBUG*
environment variable toINFO
, as shown in the following example:Setting the NCCL debug level to INFO
spec: containers: - command: - /bin/bash - -c - "your container command" env: - name: NCCL_SOCKET_IFNAME value: "net1" - name: NCCL_IB_HCA value: "mlx5_1" - name: NCCL_DEBUG value: "INFO"
spec: containers: - command: - /bin/bash - -c - "your container command" env: - name: NCCL_SOCKET_IFNAME value: "net1" - name: NCCL_IB_HCA value: "mlx5_1" - name: NCCL_DEBUG value: "INFO"
Copy to Clipboard Copied! - Run the PyTorch job.
Check that the pod logs include an entry similar to the following text:
Example pod log entry
NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]
NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]
Copy to Clipboard Copied!
8.7. Troubleshooting common problems with distributed workloads for administrators
If your users are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
If the problem is not documented here or in the release notes, contact Red Hat Support.
8.7.1. A user’s Ray cluster is in a suspended state
Problem
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
Diagnosis
The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workloads
resource that is created with the RayCluster
resource. The status.conditions.message
field provides the reason for the suspended state, as shown in the following example:
status: conditions: - lastTransitionTime: '2024-05-29T13:05:09Z' message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
status:
conditions:
- lastTransitionTime: '2024-05-29T13:05:09Z'
message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
Resolution
Check whether the resource flavor is created, as follows:
- In the OpenShift console, select the user’s project from the Project list.
-
Click Home
Search, and from the Resources list, select ResourceFlavor. - If necessary, create the resource flavor.
- Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
- If necessary, increase the resource quota.
For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.
8.7.2. A user’s Ray cluster is in a failed state
Problem
The user might have insufficient resources.
Diagnosis
The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed
state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
Resolution
If the failed state persists, complete the following steps:
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Click the user’s pod name to open the pod details page.
- Click the Events tab, and review the pod events to identify the cause of the problem.
-
Check the status of the
Workloads
resource that is created with theRayCluster
resource. Thestatus.conditions.message
field provides the reason for the failed state.
8.7.3. A user receives a "failed to call webhook" error message for the CodeFlare Operator
Problem
After the user runs the cluster.up()
command, the following error is shown:
ApiException: (500) Reason: Internal Server Error HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
Diagnosis
The CodeFlare Operator pod might not be running.
Resolution
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
Copy to Clipboard Copied!
8.7.4. A user receives a "failed to call webhook" error message for Kueue
Problem
After the user runs the cluster.up()
command, the following error is shown:
ApiException: (500) Reason: Internal Server Error HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
Diagnosis
The Kueue pod might not be running.
Resolution
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Verify that the Kueue pod is running. If necessary, restart the Kueue pod.
Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:
{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
Copy to Clipboard Copied!
8.7.5. A user’s Ray cluster does not start
Problem
After the user runs the cluster.up()
command, when they run either the cluster.details()
command or the cluster.status()
command, the Ray cluster status remains as Starting
instead of changing to Ready
. No pods are created.
Diagnosis
Check the status of the Workloads
resource that is created with the RayCluster
resource. The status.conditions.message
field provides the reason for remaining in the Starting
state. Similarly, check the status.conditions.message
field for the RayCluster
resource.
Resolution
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
- Review the logs for the KubeRay pod to identify errors.
8.7.6. A user receives a Default Local Queue … not found error message
Problem
After the user runs the cluster.up()
command, the following error is shown:
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
Diagnosis
No default local queue is defined, and a local queue is not specified in the cluster configuration.
Resolution
Check whether a local queue exists in the user’s project, as follows:
- In the OpenShift console, select the user’s project from the Project list.
-
Click Home
Search, and from the Resources list, select LocalQueue. - If no local queues are found, create a local queue.
- Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.
8.7.7. A user receives a local_queue provided does not exist error message
Problem
After the user runs the cluster.up()
command, the following error is shown:
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
Diagnosis
An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.
Resolution
In the OpenShift console, select the user’s project from the Project list.
- Click Search, and from the Resources list, select LocalQueue.
Resolve the problem in one of the following ways:
- If no local queues are found, create a local queue.
-
If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the
namespace
value in the cluster configuration matches their project name. If the user does not specify anamespace
value in the cluster configuration, the Ray cluster is created in the current project.
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.
8.7.8. A user cannot create a Ray cluster or submit jobs
Problem
After the user runs the cluster.up()
command, an error similar to the following text is shown:
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403) Reason: Forbidden HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
Diagnosis
The correct OpenShift login credentials are not specified in the TokenAuthentication
section of the user’s notebook code.
Resolution
Advise the user to identify and specify the correct OpenShift login credentials as follows:
- In the OpenShift console header, click your username and click Copy login command.
- In the new tab that opens, log in as the user whose credentials you want to use.
- Click Display Token.
-
From the Log in with this token section, copy the
token
andserver
values. Specify the copied
token
andserver
values in your notebook code as follows:auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
Copy to Clipboard Copied!
-
Verify that the user has the correct permissions and is part of the
rhoai-users
group.
8.7.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled
Problem
Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.
Diagnosis
- In the OpenShift console, select the user’s project from the Project list.
-
Click Workloads
Pods. - Click the user’s pod name to open the pod details page.
- Click the Events tab, and review the pod events to check whether the image pull completed successfully.
Resolution
If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:
-
Add an
OnFailure
restart policy for resources that are managed by Kueue. -
In the
redhat-ods-applications
namespace, edit thekueue-manager-config
ConfigMap to set a custom timeout for thewaitForPodsReady
property. For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.