Search

Chapter 3. Running distributed workloads

download PDF

In OpenShift AI, you can run a distributed workload from a notebook or from a pipeline.

You can run distributed workloads in a disconnected environment if you can access all of the required software from that environment. For example, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.

3.1. Downloading the demo notebooks from the CodeFlare SDK

If you want to run distributed workloads from notebooks, the demo notebooks from the CodeFlare SDK provide guidelines on how to use the CodeFlare stack in your own notebooks.

If you do not want to run distributed workloads from notebooks, you can skip this section.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have logged in to Red Hat OpenShift AI.
  • You have launched your notebook server and logged in to your notebook editor. The examples in this procedure refer to the JupyterLab integrated development environment (IDE).

Procedure

  1. In the JupyterLab interface, click File > New > Notebook, and then click Select.

    A new notebook is created in a .ipynb file.

  2. Add the following code to a cell in the new notebook:

    Code to download the demo notebooks

    from codeflare_sdk import copy_demo_nbs
    copy_demo_nbs()

  3. Select the cell, and click Run > Run selected cell.

    After a few seconds, the copy_demo_nbs() function copies the demo notebooks that are packaged with the currently installed version of the CodeFlare SDK, and clones them into the demo-notebooks folder.

  4. In the left navigation pane, right-click the new notebook and click Delete.
  5. Click Delete to confirm.

Verification

Locate the downloaded demo notebooks in the JupyterLab interface, as follows:

  1. In the left navigation pane, double-click demo-notebooks.
  2. Double-click additional-demos and verify that the folder contains several demo notebooks.
  3. Click demo-notebooks.
  4. Double-click guided-demos and verify that the folder contains several demo notebooks.

You can run these demo notebooks as described in Running distributed data science workloads from notebooks.

3.2. Running distributed data science workloads from notebooks

To run a distributed workload from a notebook, you must configure a Ray cluster. You must also provide environment-specific information such as cluster authentication details.

In the examples in this procedure, you edit the demo notebooks to provide the required information.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • Your cluster administrator has created the required Kueue resources as described in Configuring quota management for distributed workloads.
  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have logged in to Red Hat OpenShift AI.
  • You have launched your notebook server and logged in to your notebook editor. The examples in this procedure refer to the JupyterLab integrated development environment (IDE).
  • You have downloaded the demo notebooks provided by the CodeFlare SDK, as described in Downloading the demo notebooks from the CodeFlare SDK.

Procedure

  1. Check whether your cluster administrator has defined a default local queue for the Ray cluster:

    1. In the OpenShift web console, select your project from the Project list.
    2. Click Search, and from the Resources list, select LocalQueue to show the list of local queues for your project.

      If no local queue is listed, contact your cluster administrator.

    3. Review the details of each local queue:

      1. Click the local queue name.
      2. Click the YAML tab, and review the metadata.annotations section.

        If the kueue.x-k8s.io/default-queue annotation is set to 'true', the queue is configured as the default local queue.

        Note

        If your cluster administrator does not define a default local queue, you must specify a local queue in each notebook.

  2. In the JupyterLab interface, open the demo-notebooks > guided-demos folder.
  3. Open all of the notebooks by double-clicking each notebook file.

    Notebook files have the .ipynb file name extension.

  4. In each notebook, ensure that the import section imports the required components from the CodeFlare SDK, as follows:

    Example import section

    from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication

  5. In each notebook, update the TokenAuthentication section to provide the token and server details to authenticate to the OpenShift cluster by using the CodeFlare SDK.

    You can find your token and server details as follows:

    1. In the OpenShift AI top navigation bar, click the application launcher icon ( The application launcher ) and then click OpenShift Console to open the OpenShift web console.
    2. In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
    3. After you have logged in, click Display Token.
    4. In the Log in with this token section, find the required values as follows:

      • The token value is the text after the --token= prefix.
      • The server value is the text after the --server= prefix.
    Note

    The token and server values are security credentials, treat them with care.

    • Do not save the token and server details in a notebook.
    • Do not store the token and server details in Git.

    The token expires after 24 hours.

  6. Optional: If you want to use custom certificates, update the TokenAuthentication section to add the ca_cert_path parameter to specify the location of the custom certificates, as shown in the following example:

    Example authentication section

    auth = TokenAuthentication(
        token = "XXXXX",
        server = "XXXXX",
        skip_tls=False,
        ca_cert_path="/path/to/cert"
    )
    auth.login()

    Alternatively, you can set the CF_SDK_CA_CERT_PATH environment variable to specify the location of the custom certificates.

  7. In each notebook, update the cluster configuration section as follows:

    1. If the namespace value is specified, replace the example value with the name of your project. If you omit this line, the Ray cluster is created in the current project.
    2. If the image value is specified, replace the example value with a link to your Ray cluster image. If you omit this line, the default community Ray image quay.io/rhoai/ray:2.23.0-py39-cu121 is used.

      Note

      The default Ray image is an AMD64 image, which might not work on other architectures.

    3. If your cluster administrator has not configured a default local queue, specify the local queue for the Ray cluster, as shown in the following example:

      Example local queue assignment

      local_queue="your_local_queue_name"

    4. Optional: Assign a dictionary of labels parameters to the Ray cluster for identification and management purposes, as shown in the following example:

      Example labels assignment

      labels = {"exampleLabel1": "exampleLabel1Value", "exampleLabel2": "exampleLabel2Value"}

  8. In the 2_basic_interactive.ipynb notebook, ensure that the following Ray cluster authentication code is included after the Ray cluster creation section.

    Ray cluster authentication code

    from codeflare_sdk import generate_cert
    generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)
    generate_cert.export_env(cluster.config.name, cluster.config.namespace)

    Note

    Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in OpenShift AI. You must include the Ray cluster authentication code to enable the Ray client that runs within a notebook to connect to a secure Ray cluster that has mTLS enabled.

  9. Run the notebooks in the order indicated by the file-name prefix (0_, 1_, and so on).

    1. In each notebook, run each cell in turn, and review the cell output.
    2. If an error is shown, review the output to find information about the problem and the required corrective action. For example, replace any deprecated parameters as instructed. See also Troubleshooting common problems with distributed workloads for users.

Verification

  1. The notebooks run to completion without errors.
  2. In the notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.

3.3. Running distributed data science workloads from data science pipelines

To run a distributed workload from a pipeline, you must first update the pipeline to include a link to your Ray cluster image.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • Your cluster administrator has created the required Kueue resources as described in Configuring quota management for distributed workloads.
  • Optional: Your cluster administrator has defined a default local queue for the Ray cluster by creating a LocalQueue resource and adding the following annotation to the configuration details for that LocalQueue resource, as described in Configuring quota management for distributed workloads:

    "kueue.x-k8s.io/default-queue": "true"
    Note

    If your cluster administrator does not define a default local queue, you must specify a local queue in each pipeline.

  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have access to S3-compatible object storage.
  • You have logged in to Red Hat OpenShift AI.

Procedure

  1. Create a data connection to connect the object storage to your data science project, as described in Adding a data connection to your data science project.
  2. Configure a pipeline server to use the data connection, as described in Configuring a pipeline server.
  3. Create the data science pipeline as follows:

    1. Install the kfp Python package, which is required for all pipelines:

      $ pip install kfp
    2. Install any other dependencies that are required for your pipeline.
    3. Build your data science pipeline in Python code.

      For example, create a file named compile_example.py with the following content:

      from kfp import dsl
      
      
      @dsl.component(
          base_image="registry.redhat.io/ubi8/python-39:latest",
          packages_to_install=['codeflare-sdk']
      )
      
      
      def ray_fn():
         import ray 1
         from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert 2
      
      
         cluster = Cluster( 3
             ClusterConfiguration(
                 namespace="my_project", 4
                 name="raytest",
                 num_workers=1,
                 head_cpus="500m",
                 min_memory=1,
                 max_memory=1,
                 worker_extended_resource_requests={“nvidia.com/gpu”: 1}, 5
                 image="quay.io/rhoai/ray:2.23.0-py39-cu121", 6
                 local_queue="local_queue_name", 7
             )
         )
      
      
         print(cluster.status())
         cluster.up() 8
         cluster.wait_ready() 9
         print(cluster.status())
         print(cluster.details())
      
      
         ray_dashboard_uri = cluster.cluster_dashboard_uri()
         ray_cluster_uri = cluster.cluster_uri()
         print(ray_dashboard_uri, ray_cluster_uri)
      
         # Enable Ray client to connect to secure Ray cluster that has mTLS enabled
         generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) 10
         generate_cert.export_env(cluster.config.name, cluster.config.namespace)
      
      
         ray.init(address=ray_cluster_uri)
         print("Ray cluster is up and running: ", ray.is_initialized())
      
      
         @ray.remote
         def train_fn(): 11
             # complex training function
             return 100
      
      
         result = ray.get(train_fn.remote())
         assert 100 == result
         ray.shutdown()
         cluster.down() 12
         auth.logout()
         return result
      
      
      @dsl.pipeline( 13
         name="Ray Simple Example",
         description="Ray Simple Example",
      )
      
      
      def ray_integration():
         ray_fn()
      
      
      if __name__ == '__main__': 14
          from kfp.compiler import Compiler
          Compiler().compile(ray_integration, 'compiled-example.yaml')
      1
      Imports Ray.
      2
      Imports packages from the CodeFlare SDK to define the cluster functions.
      3
      Specifies the Ray cluster configuration: replace these example values with the values for your Ray cluster.
      4
      Optional: Specifies the project where the Ray cluster is created. Replace the example value with the name of your project. If you omit this line, the Ray cluster is created in the current project.
      5
      Optional: Specifies the requested accelerators for the Ray cluster (in this example, 1 NVIDIA GPU). If no accelerators are required, set the value to 0 or omit the line. Note: To specify the requested accelerators for the Ray cluster, use the worker_extended_resource_requests parameter instead of the deprecated num_gpus parameter. For more details, see the CodeFlare SDK documentation.
      6
      Specifies the location of the Ray cluster image. The default Ray image is an AMD64 image, which might not work on other architectures. If you are running this code in a disconnected environment, replace the default value with the location for your environment.
      7
      Specifies the local queue to which the Ray cluster will be submitted. If a default local queue is configured, you can omit this line.
      8
      Creates a Ray cluster by using the specified image and configuration.
      9
      Waits until the Ray cluster is ready before proceeding.
      10
      Enables the Ray client to connect to a secure Ray cluster that has mutual Transport Layer Security (mTLS) enabled. mTLS is enabled by default in the CodeFlare component in OpenShift AI.
      11
      Replace the example details in this section with the details for your workload.
      12
      Removes the Ray cluster when your workload is finished.
      13
      Replace the example name and description with the values for your workload.
      14
      Compiles the Python code and saves the output in a YAML file.
    4. Compile the Python file (in this example, the compile_example.py file):

      $ python compile_example.py

      This command creates a YAML file (in this example, compiled-example.yaml), which you can import in the next step.

  4. Import your data science pipeline, as described in Importing a data science pipeline.
  5. Schedule the pipeline run, as described in Scheduling a pipeline run.
  6. When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing the details of a pipeline run.

Verification

The YAML file is created and the pipeline run completes without errors.

You can view the run details, as described in Viewing the details of a pipeline run.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.