Search

Chapter 3. Running distributed workloads

download PDF

In OpenShift AI, you can run a distributed workload from a notebook or from a pipeline. You can also run distributed workloads in a disconnected environment if you have access to all of the required software.

3.1. Running distributed data science workloads from notebooks

To run a distributed data science workload from a notebook, you must first provide the link to your Ray cluster image.

Prerequisites

  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • Your cluster administrator has created the required Kueue resources as described in Configuring quota management for distributed workloads.
  • Optional: Your cluster administrator has defined a default local queue for the Ray cluster by creating a LocalQueue resource and adding the following annotation to its configuration details, as described in Configuring quota management for distributed workloads:

    "kueue.x-k8s.io/default-queue": "true"
    Note

    If your cluster administrator does not define a default local queue, you must specify a local queue in each notebook.

  • You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about how to create a project, see Creating a data science project.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have launched your notebook server and logged in to your notebook editor. The examples in this procedure refer to the JupyterLab integrated development environment (IDE).

Procedure

  1. Download the demo notebooks provided by the CodeFlare SDK. The demo notebooks provide guidelines for how to use the CodeFlare stack in your own notebooks.

    To access the demo notebooks, clone the codeflare-sdk repository as follows:

    1. In the JupyterLab interface, click Git > Clone a Repository.
    2. In the "Clone a repo" dialog, enter https://github.com/project-codeflare/codeflare-sdk.git and then click Clone. The codeflare-sdk repository is listed in the left navigation pane.
  2. Locate the downloaded demo notebooks as follows:

    1. In the JupyterLab interface, in the left navigation pane, double-click codeflare-sdk.
    2. Double-click demo-notebooks, and then double-click guided-demos.
  3. Update each example demo notebook as follows:

    1. If not already specified, update the import section to import the generate_cert component:

      Updated import section

      from codeflare_sdk import generate_cert

    2. Replace the default namespace value with the name of your data science project.
    3. In the TokenAuthentication section of your notebook code, provide the token and server details to authenticate to the OpenShift cluster by using the CodeFlare SDK.
    4. Replace the link to the example community image with a link to your Ray cluster image.
    5. Ensure that the following Ray cluster authentication code is included after the Ray cluster creation section.

      Ray cluster authentication code

      generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)
      generate_cert.export_env(cluster.config.name, cluster.config.namespace)

      Note

      Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in OpenShift AI. You must include the Ray cluster authentication code to enable the Ray client that runs within a notebook to connect to a secure Ray cluster that has mTLS enabled.

    6. If you have not configured a default local queue by including the kueue.x-k8s.io/default-queue: 'true' annotation as described in Configuring quota management for distributed workloads, update the ClusterConfiguration section to specify the local queue for the Ray cluster, as shown in the following example:

      Example local queue assignment

      local_queue="your_local_queue_name"

    7. Optional: In the ClusterConfiguration section, assign a dictionary of labels parameters to the Ray cluster for identification and management purposes, as shown in the following example:

      Example labels assignment

      labels = {"exampleLabel1": "exampleLabel1Value", "exampleLabel2": "exampleLabel2Value"}

  4. Run the notebooks.

Verification

The notebooks run to completion without errors. In the notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.

3.2. Running distributed data science workloads from data science pipelines

To run a distributed data science workload from a data science pipeline, you must first update the pipeline to include a link to your Ray cluster image.

Prerequisites

  • You have logged in to OpenShift Container Platform with the cluster-admin role.
  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • Your cluster administrator has created the required Kueue resources as described in Configuring quota management for distributed workloads.
  • Optional: Your cluster administrator has defined a default local queue for the Ray cluster by creating a LocalQueue resource and adding the following annotation to the configuration details for that LocalQueue resource, as described in Configuring quota management for distributed workloads:

    "kueue.x-k8s.io/default-queue": "true"
    Note

    If your cluster administrator does not define a default local queue, you must specify a local queue in each pipeline.

  • You have access to S3-compatible object storage.
  • You have logged in to Red Hat OpenShift AI.
  • You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about how to create a project, see Creating a data science project.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.

Procedure

  1. Create a data connection to connect the object storage to your data science project, as described in Adding a data connection to your data science project.
  2. Configure a pipeline server to use the data connection, as described in Configuring a pipeline server.
  3. Create the data science pipeline as follows:

    1. Install the kfp Python package, which is required for all pipelines:

      $ pip install kfp
    2. Install any other dependencies that are required for your pipeline.
    3. Build your data science pipeline in Python code.

      For example, create a file named compile_example.py with the following content:

      from kfp import dsl
      
      
      @dsl.component(
          base_image="registry.redhat.io/ubi8/python-39:latest",
          packages_to_install=['codeflare-sdk']
      )
      
      
      def ray_fn():
         import ray 1
         import time 2
         from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert 3
      
      
         cluster = Cluster( 4
             ClusterConfiguration(
                 namespace="my_project", 5
                 name="raytest",
                 num_workers=1,
                 head_cpus="500m",
                 min_memory=1,
                 max_memory=1,
                 num_gpus=0,
                 image="quay.io/project-codeflare/ray:latest-py39-cu118", 6
                 local_queue="local_queue_name", 7
             )
         )
      
      
         print(cluster.status())
         cluster.up() 8
         // cluster.wait_ready()
         time.sleep(180) 9
         print(cluster.status())
         print(cluster.details())
      
      
         ray_dashboard_uri = cluster.cluster_dashboard_uri()
         ray_cluster_uri = cluster.cluster_uri()
         print(ray_dashboard_uri, ray_cluster_uri)
      
         # Enable Ray client to connect to secure Ray cluster that has mTLS enabled
         generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) 10
         generate_cert.export_env(cluster.config.name, cluster.config.namespace)
      
      
         ray.init(address=ray_cluster_uri)
         print("Ray cluster is up and running: ", ray.is_initialized())
      
      
         @ray.remote
         def train_fn(): 11
             # complex training function
             return 100
      
      
         result = ray.get(train_fn.remote())
         assert 100 == result
         ray.shutdown()
         cluster.down() 12
         auth.logout()
         return result
      
      
      @dsl.pipeline( 13
         name="Ray Simple Example",
         description="Ray Simple Example",
      )
      
      
      def ray_integration():
         ray_fn()
      
      
      if __name__ == '__main__': 14
          from kfp.compiler import Compiler
          Compiler().compile(ray_integration, 'compiled-example.yaml')
      1
      Imports Ray.
      2
      Imports the time package so that you can use the sleep function to wait during code execution, as a workaround for RHOAIENG-7346.
      3
      Imports packages from the CodeFlare SDK to define the cluster functions.
      4
      Specifies the Ray cluster configuration: replace these example values with the values for your Ray cluster.
      5
      Optional: Specifies the project where the Ray cluster is created. Replace the example value with the name of your project. If you omit this line, the Ray cluster is created in the current project.
      6
      Specifies the location of the Ray cluster image. If you are running this code in a disconnected environment, replace the default value with the location for your environment.
      7
      Specifies the local queue to which the Ray cluster will be submitted. If a default local queue is configured, you can omit this line.
      8
      Creates a Ray cluster by using the specified image and configuration.
      9
      Waits until the Ray cluster is ready before proceeding. As a workaround for RHOAIENG-7346, use time.sleep(180) instead of cluster.wait_ready().
      10
      Enables the Ray client to connect to a secure Ray cluster that has mutual Transport Layer Security (mTLS) enabled. mTLS is enabled by default in the CodeFlare component in OpenShift AI.
      11
      Replace the example details in this section with the details for your workload.
      12
      Removes the Ray cluster when your workload is finished.
      13
      Replace the example name and description with the values for your workload.
      14
      Compiles the Python code and saves the output in a YAML file.
    4. Compile the Python file (in this example, the compile_example.py file):

      $ python compile_example.py

      This command creates a YAML file (in this example, compiled-example.yaml), which you can import in the next step.

  4. Import your data science pipeline, as described in Importing a data science pipeline.
  5. Schedule the pipeline run, as described in Scheduling a pipeline run.
  6. When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing the details of a pipeline run.

Verification

The YAML file is created and the pipeline run completes without errors.

You can view the run details, as described in Viewing the details of a pipeline run.

3.3. Running distributed data science workloads in a disconnected environment

To run a distributed data science workload in a disconnected environment, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.

Prerequisites

  • You have logged in to OpenShift Container Platform with the cluster-admin role.
  • You have access to the disconnected data science cluster.
  • You have installed Red Hat OpenShift AI and created a mirror image as described in Installing and uninstalling OpenShift AI Self-Managed in a disconnected environment.
  • You can access the following software from the disconnected cluster:

    • A Ray cluster image
    • An image that includes the openssl package, for the creation of TLS certificates when creating Ray clusters
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server that is available from the disconnected cluster
  • You have logged in to Red Hat OpenShift AI.
  • You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about how to create a project, see Creating a data science project.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.

Procedure

  1. Configure the disconnected data science cluster to run distributed workloads as described in Configuring distributed workloads.
  2. In the ClusterConfiguration section of the notebook or pipeline, ensure that the image value specifies a Ray cluster image that you can access from the disconnected environment:

    • Notebooks use the Ray cluster image to create a Ray cluster when running the notebook.
    • Pipelines use the Ray cluster image to create a Ray cluster during the pipeline run.
  3. In the CodeFlare Operator config map, ensure that the kuberay:certGeneratorImage value specifies an image that contains the openssl package, and that you can access the image from the disconnected environment. The following example shows the default value provided by OpenShift AI:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: codeflare-operator-config
      namespace: redhat-ods-applications
      data:
      config.yaml: |
        kuberay:
          certGeneratorImage: "registry.redhat.io/ubi9@sha256:770cf07083e1c85ae69c25181a205b7cdef63c11b794c89b3b487d4670b4c328"
  4. If any of the Python packages required by the workload are not available in the Ray cluster, configure the Ray cluster to download the Python packages from a private PyPI server.

    For example, set the PIP_INDEX_URL and PIP_TRUSTED_HOST environment variables for the Ray cluster, to specify the location of the Python dependencies, as shown in the following example:

    PIP_INDEX_URL: https://pypi-notebook.apps.mylocation.com/simple
    PIP_TRUSTED_HOST: pypi-notebook.apps.mylocation.com

    where

    • PIP_INDEX_URL specifies the base URL of your private PyPI server (the default value is https://pypi.org).
    • PIP_TRUSTED_HOST configures Python to mark the specified host as trusted, regardless of whether that host has a valid SSL certificate or is using a secure channel.
  5. Run the distributed data science workload, as described in Running distributed data science workloads from notebooks or Running distributed data science workloads from data science pipelines.

Verification

The notebook or pipeline run completes without errors:

  • For notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.
  • For pipeline runs, you can view the run details as described in Viewing the details of a pipeline run.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.