Chapter 6. Working with distributed workloads
To train complex machine-learning models or process data more quickly, data scientists can use the distributed workloads feature to run their jobs on multiple OpenShift worker nodes in parallel. This approach significantly reduces the task completion time, and enables the use of larger datasets and more complex models.
The distributed workloads feature is currently available in Red Hat OpenShift AI 2.8 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
6.1. Overview of distributed workloads
You can use the distributed workloads feature to queue, scale, and manage the resources required to run data science workloads across multiple nodes in an OpenShift cluster simultaneously. Typically, data science workloads include several types of artificial intelligence (AI) workloads, including machine learning (ML) and Python workloads.
Distributed workloads provide the following benefits:
- You can iterate faster and experiment more frequently because of the reduced processing time.
- You can use larger datasets, which can lead to more accurate models.
- You can use complex models that could not be trained on a single node.
The distributed workloads infrastructure includes the following components:
- CodeFlare Operator
- Manages the queuing of batch jobs
- CodeFlare SDK
- Defines and controls the remote distributed compute jobs and infrastructure for any Python-based environment
- KubeRay
- Manages remote Ray clusters on OpenShift for running distributed compute workloads
You can run distributed data science workloads from data science pipelines or from notebooks.
6.2. Configuring distributed workloads
To configure the distributed workloads feature for your data scientists to use in OpenShift AI, you must enable several components.
Prerequisites
-
You have logged in to OpenShift Container Platform with the
cluster-admin
role. - You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.1 vCPU and 1.6 GB memory to deploy the distributed workloads infrastructure.
- You have access to a Ray cluster image. For information about how to create a Ray cluster, see the Ray Clusters documentation.
- You have removed any previously installed instances of the CodeFlare Operator, as described in the Knowledgebase solution How to migrate from a separately installed CodeFlare Operator in your data science cluster.
- If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. See Enabling GPU support in OpenShift AI.
If you want to use self-signed certificates, you have added them to a central Certificate Authority (CA) bundle as described in Working with certificates (for disconnected environments, see Working with certificates). No additional configuration is necessary to use those certificates with distributed workloads. The centrally configured self-signed certificates are automatically available in the workload pods at the following mount points:
Cluster-wide CA bundle:
/etc/pki/tls/certs/odh-trusted-ca-bundle.crt /etc/ssl/certs/odh-trusted-ca-bundle.crt
Custom CA bundle:
/etc/pki/tls/certs/odh-ca-bundle.crt /etc/ssl/certs/odh-ca-bundle.crt
Procedure
-
In the OpenShift Container Platform console, click Operators
Installed Operators. - Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
- Click the Data Science Cluster tab.
Click the default instance name to open the instance details page.
NoteStarting from Red Hat OpenShift AI 2.4, the default instance name for new installations is default-dsc. The default instance name for earlier installations, rhods, is preserved during upgrade.
- Click the YAML tab to show the instance specifications.
In the
spec.components
section, ensure that themanagementState
field is set correctly for the required components depending on whether the distributed workload is run from a pipeline or notebook or both, as shown in the following table.Table 6.1. Components required for distributed workloads Component Pipelines only Notebooks only Pipelines and notebooks codeflare
Managed
Managed
Managed
dashboard
Managed
Managed
Managed
datasciencepipelines
Managed
Removed
Managed
ray
Managed
Managed
Managed
workbenches
Removed
Managed
Managed
-
Click Save. After a short time, the components with a
Managed
state are ready.
Verification
Check the status of the codeflare-operator-manager
pod, as follows:
- In the OpenShift Container Platform console, from the Project list, select redhat-ods-applications.
-
Click Workloads
Deployments. - Search for the codeflare-operator-manager deployment, and click the deployment name to open the deployment details page.
-
Click the Pods tab. When the status of the
codeflare-operator-manager-_<pod-id>_
pod isRunning
, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and click the Logs tab.
6.3. Running distributed data science workloads from notebooks
To run a distributed data science workload from a notebook, you must first provide the link to your Ray cluster image.
Prerequisites
- You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
- You have created a data science project that contains a workbench that is running one of the default notebook images, for example, the Standard Data Science notebook. See the table in Notebook images for data scientists for a complete list of default notebook images.
- You have launched your notebook server and logged in to Jupyter.
Procedure
To access the demo notebooks, clone the
codeflare-sdk
repository as follows:- In the JupyterLab interface, click Git > Clone a Repository.
-
In the "Clone a repo" dialog, enter
https://github.com/project-codeflare/codeflare-sdk.git
and then click Clone. Thecodeflare-sdk
repository is listed in the left navigation pane.
Run a distributed workload job as shown in the following example:
- In the JupyterLab interface, in the left navigation pane, double-click codeflare-sdk.
- Double-click demo-notebooks, and then double-click guided-demos.
Update each example demo notebook as follows:
- Replace the links to the example community image with a link to your Ray cluster image.
-
Set
instascale
toFalse
. InstaScale is not deployed in the Technology Preview version of the distributed workloads feature.
- Run the notebooks.
Verification
The notebooks run to completion without errors. In the notebooks, the output from the cluster.status()
function or cluster.details()
function indicates that the Ray cluster is Active
.
6.4. Running distributed data science workloads from data science pipelines
To run a distributed data science workload from a data science pipeline, you must first update the pipeline to include a link to your Ray cluster image.
Prerequisites
-
You have logged in to OpenShift Container Platform with the
cluster-admin
role. - You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
- You have installed the Red Hat OpenShift Pipelines Operator, as described in Installing OpenShift Pipelines.
- You have access to S3-compatible object storage.
- You have logged in to Red Hat OpenShift AI.
- You have created a data science project.
Procedure
- Create a data connection to connect the object storage to your data science project, as described in Adding a data connection to your data science project.
- Configure a pipeline server to use the data connection, as described in Configuring a pipeline server.
Create the data science pipeline as follows:
Install the
kfp-tekton
Python package, which is required for all pipelines:$ pip install kfp-tekton
- Install any other dependencies that are required for your pipeline.
Build your data science pipeline in Python code. For example, create a file named
compile_example.py
with the following content:from kfp import components, dsl def ray_fn(openshift_server: str, openshift_token: str) -> int: 1 import ray from codeflare_sdk.cluster.auth import TokenAuthentication from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration auth = TokenAuthentication( 2 token=openshift_token, server=openshift_server, skip_tls=True ) auth_return = auth.login() cluster = Cluster( 3 ClusterConfiguration( name="raytest", # namespace must exist namespace="pipeline-example", num_workers=1, head_cpus="500m", min_memory=1, max_memory=1, num_gpus=0, image="quay.io/project-codeflare/ray:latest-py39-cu118", 4 instascale=False, 5 ) ) print(cluster.status()) cluster.up() 6 cluster.wait_ready() 7 print(cluster.status()) print(cluster.details()) ray_dashboard_uri = cluster.cluster_dashboard_uri() ray_cluster_uri = cluster.cluster_uri() print(ray_dashboard_uri, ray_cluster_uri) # Before proceeding, ensure that the cluster exists and that its URI contains a value assert ray_cluster_uri, "Ray cluster must be started and set before proceeding" ray.init(address=ray_cluster_uri) print("Ray cluster is up and running: ", ray.is_initialized()) @ray.remote def train_fn(): 8 # complex training function return 100 result = ray.get(train_fn.remote()) assert 100 == result ray.shutdown() cluster.down() 9 auth.logout() return result @dsl.pipeline( 10 name="Ray Simple Example", description="Ray Simple Example", ) def ray_integration(openshift_server, openshift_token): ray_op = components.create_component_from_func( ray_fn, base_image='registry.redhat.io/ubi8/python-39:latest', packages_to_install=["codeflare-sdk"], ) ray_op(openshift_server, openshift_token) if __name__ == '__main__': 11 from kfp_tekton.compiler import TektonCompiler TektonCompiler().compile(ray_integration, 'compiled-example.yaml')
- 1
- Imports from the CodeFlare SDK the packages that define the cluster functions
- 2
- Authenticates with the cluster by using values that you specify when creating the pipeline run
- 3
- Specifies the Ray cluster resources: replace these example values with the values for your Ray cluster
- 4
- Specifies the location of the Ray cluster image: if using a disconnected environment, replace the default value with the location for your environment
- 5
- InstaScale is not supported in this release
- 6
- Creates a Ray cluster using the specified image and configuration
- 7
- Waits for the Ray cluster to be ready before proceeding
- 8
- Replace the example details in this section with the details for your workload
- 9
- Removes the Ray cluster when your workload is finished
- 10
- Replace the example name and description with the values for your workload
- 11
- Compiles the Python code and saves the output in a YAML file
Compile the Python file (in this example, the
compile_example.py
file):$ python compile_example.py
This command creates a YAML file (in this example,
compiled-example.yaml
), which you can import in the next step.
- Import your data science pipeline, as described in Importing a data science pipeline.
- Schedule the pipeline run, as described in Scheduling a pipeline run.
- When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing triggered pipeline runs.
Verification
The YAML file is created and the pipeline run completes without errors. You can view the run details, as described in Viewing the details of a pipeline run.
Additional resources
6.5. Running distributed data science workloads in a disconnected environment
To run a distributed data science workload in a disconnected environment, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.
Prerequisites
-
You have logged in to OpenShift Container Platform with the
cluster-admin
role. - You have access to the disconnected data science cluster.
- You have installed Red Hat OpenShift AI and created a mirror image as described in Installing and uninstalling OpenShift AI Self-Managed in a disconnected environment.
You can access the following software from the disconnected cluster:
- A Ray cluster image
- The data sets and models to be used by the workload
- The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server that is available from the disconnected cluster
- You have logged in to Red Hat OpenShift AI.
- You have created a data science project.
Procedure
- Configure the disconnected data science cluster to run distributed workloads as described in Configuring distributed workloads.
In the
ClusterConfiguration
section of the notebook or pipeline, ensure that theimage
value specifies a Ray cluster image that can be accessed from the disconnected environment:- Notebooks use the Ray cluster image to create a Ray cluster when running the notebook.
- Pipelines use the Ray cluster image to create a Ray cluster during the pipeline run.
If any of the Python packages required by the workload are not available in the Ray cluster, configure the Ray cluster to download the Python packages from a private PyPI server.
For example, set the
PIP_INDEX_URL
andPIP_TRUSTED_HOST
environment variables for the Ray cluster, to specify the location of the Python dependencies, as shown in the following example:PIP_INDEX_URL: https://pypi-notebook.apps.mylocation.com/simple PIP_TRUSTED_HOST: pypi-notebook.apps.mylocation.com
where
-
PIP_INDEX_URL
specifies the base URL of your private PyPI server (the default value is https://pypi.org). -
PIP_TRUSTED_HOST
configures Python to mark the specified host as trusted, regardless of whether that host has a valid SSL certificate or is using a secure channel.
-
- Run the distributed data science workload, as described in Running distributed data science workloads from notebooks or Running distributed data science workloads from data science pipelines.
Verification
The notebook or pipeline run completes without errors:
-
For notebooks, the output from the
cluster.status()
function orcluster.details()
function indicates that the Ray cluster isActive
. - For pipeline runs, you can view the run details as described in Viewing the details of a pipeline run.