Search

Chapter 6. Working with distributed workloads

download PDF

To train complex machine-learning models or process data more quickly, data scientists can use the distributed workloads feature to run their jobs on multiple OpenShift worker nodes in parallel. This approach significantly reduces the task completion time, and enables the use of larger datasets and more complex models.

Important

The distributed workloads feature is currently available in Red Hat OpenShift AI 2.8 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

6.1. Overview of distributed workloads

You can use the distributed workloads feature to queue, scale, and manage the resources required to run data science workloads across multiple nodes in an OpenShift cluster simultaneously. Typically, data science workloads include several types of artificial intelligence (AI) workloads, including machine learning (ML) and Python workloads.

Distributed workloads provide the following benefits:

  • You can iterate faster and experiment more frequently because of the reduced processing time.
  • You can use larger datasets, which can lead to more accurate models.
  • You can use complex models that could not be trained on a single node.

The distributed workloads infrastructure includes the following components:

CodeFlare Operator
Manages the queuing of batch jobs
CodeFlare SDK
Defines and controls the remote distributed compute jobs and infrastructure for any Python-based environment
KubeRay
Manages remote Ray clusters on OpenShift for running distributed compute workloads

You can run distributed data science workloads from data science pipelines or from notebooks.

6.2. Configuring distributed workloads

To configure the distributed workloads feature for your data scientists to use in OpenShift AI, you must enable several components.

Prerequisites

  • You have logged in to OpenShift Container Platform with the cluster-admin role.
  • You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.1 vCPU and 1.6 GB memory to deploy the distributed workloads infrastructure.
  • You have access to a Ray cluster image. For information about how to create a Ray cluster, see the Ray Clusters documentation.
  • You have removed any previously installed instances of the CodeFlare Operator, as described in the Knowledgebase solution How to migrate from a separately installed CodeFlare Operator in your data science cluster.
  • If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. See Enabling GPU support in OpenShift AI.
  • If you want to use self-signed certificates, you have added them to a central Certificate Authority (CA) bundle as described in Working with certificates (for disconnected environments, see Working with certificates). No additional configuration is necessary to use those certificates with distributed workloads. The centrally configured self-signed certificates are automatically available in the workload pods at the following mount points:

    • Cluster-wide CA bundle:

      /etc/pki/tls/certs/odh-trusted-ca-bundle.crt
      /etc/ssl/certs/odh-trusted-ca-bundle.crt
    • Custom CA bundle:

      /etc/pki/tls/certs/odh-ca-bundle.crt
      /etc/ssl/certs/odh-ca-bundle.crt

Procedure

  1. In the OpenShift Container Platform console, click Operators Installed Operators.
  2. Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
  3. Click the Data Science Cluster tab.
  4. Click the default instance name to open the instance details page.

    Note

    Starting from Red Hat OpenShift AI 2.4, the default instance name for new installations is default-dsc. The default instance name for earlier installations, rhods, is preserved during upgrade.

  5. Click the YAML tab to show the instance specifications.
  6. In the spec.components section, ensure that the managementState field is set correctly for the required components depending on whether the distributed workload is run from a pipeline or notebook or both, as shown in the following table.

    Table 6.1. Components required for distributed workloads
    ComponentPipelines onlyNotebooks onlyPipelines and notebooks

    codeflare

    Managed

    Managed

    Managed

    dashboard

    Managed

    Managed

    Managed

    datasciencepipelines

    Managed

    Removed

    Managed

    ray

    Managed

    Managed

    Managed

    workbenches

    Removed

    Managed

    Managed

  7. Click Save. After a short time, the components with a Managed state are ready.

Verification

Check the status of the codeflare-operator-manager pod, as follows:

  1. In the OpenShift Container Platform console, from the Project list, select redhat-ods-applications.
  2. Click Workloads Deployments.
  3. Search for the codeflare-operator-manager deployment, and click the deployment name to open the deployment details page.
  4. Click the Pods tab. When the status of the codeflare-operator-manager-_<pod-id>_ pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and click the Logs tab.

6.3. Running distributed data science workloads from notebooks

To run a distributed data science workload from a notebook, you must first provide the link to your Ray cluster image.

Prerequisites

  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You have created a data science project that contains a workbench that is running one of the default notebook images, for example, the Standard Data Science notebook. See the table in Notebook images for data scientists for a complete list of default notebook images.
  • You have launched your notebook server and logged in to Jupyter.

Procedure

  1. To access the demo notebooks, clone the codeflare-sdk repository as follows:

    1. In the JupyterLab interface, click Git > Clone a Repository.
    2. In the "Clone a repo" dialog, enter https://github.com/project-codeflare/codeflare-sdk.git and then click Clone. The codeflare-sdk repository is listed in the left navigation pane.
  2. Run a distributed workload job as shown in the following example:

    1. In the JupyterLab interface, in the left navigation pane, double-click codeflare-sdk.
    2. Double-click demo-notebooks, and then double-click guided-demos.
    3. Update each example demo notebook as follows:

      • Replace the links to the example community image with a link to your Ray cluster image.
      • Set instascale to False. InstaScale is not deployed in the Technology Preview version of the distributed workloads feature.
    4. Run the notebooks.

Verification

The notebooks run to completion without errors. In the notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.

6.4. Running distributed data science workloads from data science pipelines

To run a distributed data science workload from a data science pipeline, you must first update the pipeline to include a link to your Ray cluster image.

Prerequisites

  • You have logged in to OpenShift Container Platform with the cluster-admin role.
  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You have installed the Red Hat OpenShift Pipelines Operator, as described in Installing OpenShift Pipelines.
  • You have access to S3-compatible object storage.
  • You have logged in to Red Hat OpenShift AI.
  • You have created a data science project.

Procedure

  1. Create a data connection to connect the object storage to your data science project, as described in Adding a data connection to your data science project.
  2. Configure a pipeline server to use the data connection, as described in Configuring a pipeline server.
  3. Create the data science pipeline as follows:

    1. Install the kfp-tekton Python package, which is required for all pipelines:

      $ pip install kfp-tekton
    2. Install any other dependencies that are required for your pipeline.
    3. Build your data science pipeline in Python code. For example, create a file named compile_example.py with the following content:

      from kfp import components, dsl
      
      
      def ray_fn(openshift_server: str, openshift_token: str) -> int: 1
         import ray
         from codeflare_sdk.cluster.auth import TokenAuthentication
         from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
      
      
         auth = TokenAuthentication( 2
             token=openshift_token, server=openshift_server, skip_tls=True
         )
         auth_return = auth.login()
         cluster = Cluster( 3
             ClusterConfiguration(
                 name="raytest",
                 # namespace must exist
                 namespace="pipeline-example",
                 num_workers=1,
                 head_cpus="500m",
                 min_memory=1,
                 max_memory=1,
                 num_gpus=0,
                 image="quay.io/project-codeflare/ray:latest-py39-cu118", 4
                 instascale=False, 5
             )
         )
      
      
         print(cluster.status())
         cluster.up() 6
         cluster.wait_ready() 7
         print(cluster.status())
         print(cluster.details())
      
      
         ray_dashboard_uri = cluster.cluster_dashboard_uri()
         ray_cluster_uri = cluster.cluster_uri()
         print(ray_dashboard_uri, ray_cluster_uri)
      
      
         # Before proceeding, ensure that the cluster exists and that its URI contains a value
         assert ray_cluster_uri, "Ray cluster must be started and set before proceeding"
      
      
         ray.init(address=ray_cluster_uri)
         print("Ray cluster is up and running: ", ray.is_initialized())
      
      
         @ray.remote
         def train_fn(): 8
             # complex training function
             return 100
      
      
         result = ray.get(train_fn.remote())
         assert 100 == result
         ray.shutdown()
         cluster.down() 9
         auth.logout()
         return result
      
      
      @dsl.pipeline( 10
         name="Ray Simple Example",
         description="Ray Simple Example",
      )
      def ray_integration(openshift_server, openshift_token):
         ray_op = components.create_component_from_func(
             ray_fn,
             base_image='registry.redhat.io/ubi8/python-39:latest',
             packages_to_install=["codeflare-sdk"],
         )
         ray_op(openshift_server, openshift_token)
      
      
      if __name__ == '__main__': 11
          from kfp_tekton.compiler import TektonCompiler
          TektonCompiler().compile(ray_integration, 'compiled-example.yaml')
      1
      Imports from the CodeFlare SDK the packages that define the cluster functions
      2
      Authenticates with the cluster by using values that you specify when creating the pipeline run
      3
      Specifies the Ray cluster resources: replace these example values with the values for your Ray cluster
      4
      Specifies the location of the Ray cluster image: if using a disconnected environment, replace the default value with the location for your environment
      5
      InstaScale is not supported in this release
      6
      Creates a Ray cluster using the specified image and configuration
      7
      Waits for the Ray cluster to be ready before proceeding
      8
      Replace the example details in this section with the details for your workload
      9
      Removes the Ray cluster when your workload is finished
      10
      Replace the example name and description with the values for your workload
      11
      Compiles the Python code and saves the output in a YAML file
    4. Compile the Python file (in this example, the compile_example.py file):

      $ python compile_example.py

      This command creates a YAML file (in this example, compiled-example.yaml), which you can import in the next step.

  4. Import your data science pipeline, as described in Importing a data science pipeline.
  5. Schedule the pipeline run, as described in Scheduling a pipeline run.
  6. When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing triggered pipeline runs.

Verification

The YAML file is created and the pipeline run completes without errors. You can view the run details, as described in Viewing the details of a pipeline run.

6.5. Running distributed data science workloads in a disconnected environment

To run a distributed data science workload in a disconnected environment, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.

Prerequisites

  • You have logged in to OpenShift Container Platform with the cluster-admin role.
  • You have access to the disconnected data science cluster.
  • You have installed Red Hat OpenShift AI and created a mirror image as described in Installing and uninstalling OpenShift AI Self-Managed in a disconnected environment.
  • You can access the following software from the disconnected cluster:

    • A Ray cluster image
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server that is available from the disconnected cluster
  • You have logged in to Red Hat OpenShift AI.
  • You have created a data science project.

Procedure

  1. Configure the disconnected data science cluster to run distributed workloads as described in Configuring distributed workloads.
  2. In the ClusterConfiguration section of the notebook or pipeline, ensure that the image value specifies a Ray cluster image that can be accessed from the disconnected environment:

    • Notebooks use the Ray cluster image to create a Ray cluster when running the notebook.
    • Pipelines use the Ray cluster image to create a Ray cluster during the pipeline run.
  3. If any of the Python packages required by the workload are not available in the Ray cluster, configure the Ray cluster to download the Python packages from a private PyPI server.

    For example, set the PIP_INDEX_URL and PIP_TRUSTED_HOST environment variables for the Ray cluster, to specify the location of the Python dependencies, as shown in the following example:

    PIP_INDEX_URL: https://pypi-notebook.apps.mylocation.com/simple
    PIP_TRUSTED_HOST: pypi-notebook.apps.mylocation.com

    where

    • PIP_INDEX_URL specifies the base URL of your private PyPI server (the default value is https://pypi.org).
    • PIP_TRUSTED_HOST configures Python to mark the specified host as trusted, regardless of whether that host has a valid SSL certificate or is using a secure channel.
  4. Run the distributed data science workload, as described in Running distributed data science workloads from notebooks or Running distributed data science workloads from data science pipelines.

Verification

The notebook or pipeline run completes without errors:

  • For notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.
  • For pipeline runs, you can view the run details as described in Viewing the details of a pipeline run.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.