このコンテンツは選択した言語では利用できません。

Chapter 3. Running Ray-based distributed workloads


In OpenShift AI, you can run a Ray-based distributed workload from a Jupyter notebook or from a pipeline.

You can run Ray-based distributed workloads in a disconnected environment if you can access all of the required software from that environment. For example, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.

3.1. Running distributed data science workloads from Jupyter notebooks

To run a distributed workload from a Jupyter notebook, you must configure a Ray cluster. You must also provide environment-specific information such as cluster authentication details.

The examples in this section refer to the JupyterLab integrated development environment (IDE).

3.1.1. Downloading the demo Jupyter notebooks from the CodeFlare SDK

The demo Jupyter notebooks from the CodeFlare SDK provide guidelines on how to use the CodeFlare stack in your own Jupyter notebooks. Download the demo Jupyter notebooks so that you can learn how to run Jupyter notebooks locally.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
  • You can access a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
  • You have administrator access for the data science project.

    • If you created the project, you automatically have administrator access.
    • If you did not create the project, your cluster administrator must give you administrator access.
  • You have logged in to Red Hat OpenShift AI, started your workbench, and logged in to JupyterLab.

Procedure

  1. In the JupyterLab interface, click File > New > Notebook. Specify your preferred Python version, and then click Select.

    A new Jupyter notebook file is created with the .ipynb file name extension.

  2. Add the following code to a cell in the new notebook:

    Code to download the demo Jupyter notebooks

    from codeflare_sdk import copy_demo_nbs
    copy_demo_nbs()
    Copy to Clipboard Toggle word wrap

  3. Select the cell, and click Run > Run selected cell.

    After a few seconds, the copy_demo_nbs() function copies the demo Jupyter notebooks that are packaged with the currently installed version of the CodeFlare SDK, and clones them into the demo-notebooks folder.

  4. In the left navigation pane, right-click the new notebook and click Delete.
  5. Click Delete to confirm.

Verification

Locate the downloaded demo Jupyter notebooks in the JupyterLab interface, as follows:

  1. In the left navigation pane, double-click demo-notebooks.
  2. Double-click additional-demos and verify that the folder contains several demo Jupyter notebooks.
  3. Click demo-notebooks.
  4. Double-click guided-demos and verify that the folder contains several demo Jupyter notebooks.

You can run these demo Jupyter notebooks as described in Running the demo Jupyter notebooks from the CodeFlare SDK.

3.1.2. Running the demo Jupyter notebooks from the CodeFlare SDK

To run the demo Jupyter notebooks from the CodeFlare SDK, you must provide environment-specific information.

In the examples in this procedure, you edit the demo Jupyter notebooks in JupyterLab to provide the required information, and then run the Jupyter notebooks.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
  • You can access a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about projects and workbenches, see Working on data science projects.
  • You have administrator access for the data science project.

    • If you created the project, you automatically have administrator access.
    • If you did not create the project, your cluster administrator must give you administrator access.
  • You have logged in to Red Hat OpenShift AI, started your workbench, and logged in to JupyterLab.
  • You have downloaded the demo Jupyter notebooks provided by the CodeFlare SDK, as described in Downloading the demo Jupyter notebooks from the CodeFlare SDK.

Procedure

  1. Check whether your cluster administrator has defined a default local queue for the Ray cluster.

    You can use the codeflare_sdk.list_local_queues() function to view all local queues in your current namespace, and the resource flavors associated with each local queue.

    Alternatively, you can use the OpenShift web console as follows:

    1. In the OpenShift web console, select your project from the Project list.
    2. Click Search, and from the Resources list, select LocalQueue to show the list of local queues for your project.

      If no local queue is listed, contact your cluster administrator.

    3. Review the details of each local queue:

      1. Click the local queue name.
      2. Click the YAML tab, and review the metadata.annotations section.

        If the kueue.x-k8s.io/default-queue annotation is set to 'true', the queue is configured as the default local queue.

        Note

        If your cluster administrator does not define a default local queue, you must specify a local queue in each Jupyter notebook.

  2. In the JupyterLab interface, open the demo-notebooks > guided-demos folder.
  3. Open all of the Jupyter notebooks by double-clicking each Jupyter notebook file.

    Jupyter notebook files have the .ipynb file name extension.

  4. In each Jupyter notebook, ensure that the import section imports the required components from the CodeFlare SDK, as follows:

    Example import section

    from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication
    Copy to Clipboard Toggle word wrap

  5. In each Jupyter notebook, update the TokenAuthentication section to provide the token and server details to authenticate to the OpenShift cluster by using the CodeFlare SDK.

    For information about how to find the server and token details, see Using the cluster server and token to authenticate.

  6. Optional: If you want to use custom certificates, update the TokenAuthentication section to add the ca_cert_path parameter to specify the location of the custom certificates, as shown in the following example:

    Example authentication section

    auth = TokenAuthentication(
        token = "XXXXX",
        server = "XXXXX",
        skip_tls=False,
        ca_cert_path="/path/to/cert"
    )
    auth.login()
    Copy to Clipboard Toggle word wrap

    Alternatively, you can set the CF_SDK_CA_CERT_PATH environment variable to specify the location of the custom certificates.

  7. In each Jupyter notebook, update the cluster configuration section as follows:

    1. If the namespace value is specified, replace the example value with the name of your project.

      If you omit this line, the Ray cluster is created in the current project.

    2. If the image value is specified, replace the example value with a link to a suitable Ray cluster image. The Python version in the Ray cluster image must be the same as the Python version in the workbench.

      If you omit this line, one of the following Ray cluster images is used by default, based on the Python version detected in the workbench:

      • Python 3.9: quay.io/modh/ray:2.35.0-py39-cu121
      • Python 3.11: quay.io/modh/ray:2.47.1-py311-cu121

      The default Ray images are compatible with NVIDIA GPUs that are supported by the specified CUDA version. The default images are AMD64 images, which might not work on other architectures.

      Additional ROCm-compatible Ray cluster images are available, which are compatible with AMD accelerators that are supported by the specified ROCm version. These images are AMD64 images, which might not work on other architectures.

      For information about the latest available training images and their preinstalled packages, including the CUDA and ROCm versions, see Red Hat OpenShift AI: Supported Configurations.

    3. If your cluster administrator has not configured a default local queue, specify the local queue for the Ray cluster, as shown in the following example:

      Example local queue assignment

      local_queue="your_local_queue_name"
      Copy to Clipboard Toggle word wrap

    4. Optional: Assign a dictionary of labels parameters to the Ray cluster for identification and management purposes, as shown in the following example:

      Example labels assignment

      labels = {"exampleLabel1": "exampleLabel1Value", "exampleLabel2": "exampleLabel2Value"}
      Copy to Clipboard Toggle word wrap

  8. In the 2_basic_interactive.ipynb Jupyter notebook, ensure that the following Ray cluster authentication code is included after the Ray cluster creation section:

    Ray cluster authentication code

    from codeflare_sdk import generate_cert
    generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)
    generate_cert.export_env(cluster.config.name, cluster.config.namespace)
    Copy to Clipboard Toggle word wrap

    Note

    Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in OpenShift AI. You must include the Ray cluster authentication code to enable the Ray client that runs within a Jupyter notebook to connect to a secure Ray cluster that has mTLS enabled.

  9. Run the Jupyter notebooks in the order indicated by the file-name prefix (0_, 1_, and so on).

    1. In each Jupyter notebook, run each cell in turn, and review the cell output.
    2. If an error is shown, review the output to find information about the problem and the required corrective action. For example, replace any deprecated parameters as instructed. See also Troubleshooting common problems with distributed workloads for users.
    3. For more information about the interactive browser controls that you can use to simplify Ray cluster tasks when working within a Jupyter notebook, see Managing Ray clusters from within a Jupyter notebook.

Verification

  1. The Jupyter notebooks run to completion without errors.
  2. In the Jupyter notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.

3.1.3. Managing Ray clusters from within a Jupyter notebook

You can use interactive browser controls to simplify Ray cluster tasks when working within a Jupyter notebook.

The interactive browser controls provide an alternative to the equivalent commands, but do not replace them. You can continue to manage the Ray clusters by running commands within the Jupyter notebook, for ease of use in scripts and pipelines.

Several different interactive browser controls are available:

  • When you run a cell that provides the cluster configuration, the Jupyter notebook automatically shows the controls for starting or deleting the cluster.
  • You can run the view_clusters() command to add controls that provide the following functionality:

    • View a list of the Ray clusters that you can access.
    • View cluster information, such as cluster status and allocated resources, for the selected Ray cluster. You can view this information from within the Jupyter notebook, without switching to the OpenShift console or the Ray dashboard.
    • Open the Ray dashboard directly from the Jupyter notebook, to view the submitted jobs.
    • Refresh the Ray cluster list and the cluster information for the selected cluster.

    You can add these controls to existing Jupyter notebooks, or manage the Ray clusters from a separate Jupyter notebook.

The 3_widget_example.ipynb demo Jupyter notebook shows all of the available interactive browser controls. In the example in this procedure, you create a new Jupyter notebook to manage the Ray clusters, similar to the example provided in the 3_widget_example.ipynb demo Jupyter notebook.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
  • You can access a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about projects and workbenches, see Working on data science projects.
  • You have administrator access for the data science project.

    • If you created the project, you automatically have administrator access.
    • If you did not create the project, your cluster administrator must give you administrator access.
  • You have logged in to Red Hat OpenShift AI, started your workbench, and logged in to JupyterLab.
  • You have downloaded the demo Jupyter notebooks provided by the CodeFlare SDK, as described in Downloading the demo Jupyter notebooks from the CodeFlare SDK.

Procedure

  1. Run all of the demo Jupyter notebooks in the order indicated by the file-name prefix (0_, 1_, and so on), as described in Running the demo Jupyter notebooks from the CodeFlare SDK.
  2. In each demo Jupyter notebook, when you run the cluster configuration step, the following interactive controls are automatically shown in the Jupyter notebook:

    • Cluster Up: You can click this button to start the Ray cluster. This button is equivalent to the cluster.up() command. When you click this button, a message indicates whether the cluster was successfully created.
    • Cluster Down: You can click this button to delete the Ray cluster. This button is equivalent to the cluster.down() command. The cluster is deleted immediately; you are not prompted to confirm the deletion. When you click this button, a message indicates whether the cluster was successfully deleted.
    • Wait for Cluster: You can select this option to specify that the notebook cell should wait for the Ray cluster dashboard to be ready before proceeding to the next step. This option is equivalent to the cluster.wait_ready() command.
  3. In the JupyterLab interface, create a new Jupyter notebook to manage the Ray clusters, as follows:

    1. Click File > New > Notebook. Specify your preferred Python version, and then click Select.

      A new Jupyter notebook file is created with the .ipynb file name extension.

    2. Add the following code to a cell in the new Jupyter notebook:

      Code to import the required packages

      from codeflare_sdk import TokenAuthentication, view_clusters
      Copy to Clipboard Toggle word wrap

      The view_clusters package provides the interactive browser controls for listing the clusters, showing the cluster details, opening the Ray dashboard, and refreshing the cluster data.

    3. Add a new notebook cell, and add the following code to the new cell:

      Code to authenticate

      auth = TokenAuthentication(
          token = "XXXXX",
          server = "XXXXX",
          skip_tls=False
      )
      auth.login()
      Copy to Clipboard Toggle word wrap

      For information about how to find the token and server values, see Running the demo Jupyter notebooks from the CodeFlare SDK.

    4. Add a new notebook cell, and add the following code to the new cell:

      Code to view clusters in the current project

      view_clusters()
      Copy to Clipboard Toggle word wrap

      When you run the view_clusters() command with no arguments specified, you generate a list of all of the Ray clusters in the current project, and display information similar to the cluster.details() function.

      If you have access to another project, you can list the Ray clusters in that project by specifying the project name as shown in the following example:

      Code to view clusters in another project

      view_clusters("my_second_project")
      Copy to Clipboard Toggle word wrap

    5. Click File > Save Notebook As, enter demo-notebooks/guided-demos/manage_ray_clusters.ipynb, and click Save.
  4. In the demo-notebooks/guided-demos/manage_ray_clusters.ipynb Jupyter notebook, select each cell in turn, and click Run > Run selected cell.
  5. When you run the cell with the view_clusters() function, the output depends on whether any Ray clusters exist.

    If no Ray clusters exist, the following text is shown, where _[project-name]_ is the name of the target project:

    No clusters found in the [project-name] namespace.
    Copy to Clipboard Toggle word wrap

    Otherwise, the Jupyter notebook shows the following information about the existing Ray clusters:

    • Select an existing cluster

      Under this heading, a toggle button is shown for each existing cluster. Click a cluster name to select the cluster. The cluster details section is updated to show details about the selected cluster; for example, cluster name, OpenShift AI project name, cluster resource information, and cluster status.

    • Delete cluster

      Click this button to delete the selected cluster. This button is equivalent to the Cluster Down button. The cluster is deleted immediately; you are not prompted to confirm the deletion. A message indicates whether the cluster was successfully deleted, and the corresponding button is no longer shown under the Select an existing cluster heading.

    • View Jobs

      Click this button to open the Jobs tab in the Ray dashboard for the selected cluster, and view details of the submitted jobs. The corresponding URL is shown in the Jupyter notebook.

    • Open Ray Dashboard

      Click this button to open the Overview tab in the Ray dashboard for the selected cluster. The corresponding URL is shown in the Jupyter notebook.

    • Refresh Data

      Click this button to refresh the list of Ray clusters, and the cluster details for the selected cluster, on demand. The cluster details are automatically refreshed when you select a cluster and when you delete the selected cluster.

Verification

  1. The demo Jupyter notebooks run to completion without errors.
  2. In the manage_ray_clusters.ipynb Jupyter notebook, the output from the view_clusters() function is correct.

3.2. Running distributed data science workloads from data science pipelines

To run a distributed workload from a pipeline, you must first update the pipeline to include a link to your Ray cluster image.

Prerequisites

  • You can access a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
  • You can access the following software from your data science cluster:

    • A Ray cluster image that is compatible with your hardware architecture
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server
  • You can access a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about projects and workbenches, see Working on data science projects.
  • You have administrator access for the data science project.

    • If you created the project, you automatically have administrator access.
    • If you did not create the project, your cluster administrator must give you administrator access.
  • You have access to S3-compatible object storage.
  • You have logged in to Red Hat OpenShift AI.

Procedure

  1. Create a connection to connect the object storage to your data science project, as described in Adding a connection to your data science project.
  2. Configure a pipeline server to use the connection, as described in Configuring a pipeline server.
  3. Create the data science pipeline as follows:

    1. Install the kfp Python package, which is required for all pipelines:

      $ pip install kfp
      Copy to Clipboard Toggle word wrap
    2. Install any other dependencies that are required for your pipeline.
    3. Build your data science pipeline in Python code.

      For example, create a file named compile_example.py with the following content.

      from kfp import dsl
      
      
      @dsl.component(
          base_image="registry.redhat.io/ubi9/python-311:latest",
          packages_to_install=['codeflare-sdk']
      )
      
      
      def ray_fn():
         import ray 
      1
      
         from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert 
      2
      
      
         cluster = Cluster( 
      3
      
             ClusterConfiguration(
                 namespace="my_project", 
      4
      
                 name="raytest",
                 num_workers=1,
                 head_cpu_requests="500m",
                 head_cpu_limits="500m",
                 worker_memory_requests=1,
                 worker_memory_limits=1,
                 worker_extended_resource_requests={"nvidia.com/gpu": 1}, 
      5
      
                 image="quay.io/modh/ray:2.47.1-py311-cu121", 
      6
      
                 local_queue="local_queue_name", 
      7
      
             )
         )
      
      
         print(cluster.status())
         cluster.up() 
      8
      
         cluster.wait_ready() 
      9
      
         print(cluster.status())
         print(cluster.details())
      
      
         ray_dashboard_uri = cluster.cluster_dashboard_uri()
         ray_cluster_uri = cluster.cluster_uri()
         print(ray_dashboard_uri, ray_cluster_uri)
      
         # Enable Ray client to connect to secure Ray cluster that has mTLS enabled
         generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) 
      10
      
         generate_cert.export_env(cluster.config.name, cluster.config.namespace)
      
      
         ray.init(address=ray_cluster_uri)
         print("Ray cluster is up and running: ", ray.is_initialized())
      
      
         @ray.remote
         def train_fn(): 
      11
      
             # complex training function
             return 100
      
      
         result = ray.get(train_fn.remote())
         assert 100 == result
         ray.shutdown()
         cluster.down() 
      12
      
         auth.logout()
         return result
      
      
      @dsl.pipeline( 
      13
      
         name="Ray Simple Example",
         description="Ray Simple Example",
      )
      
      
      def ray_integration():
         ray_fn()
      
      
      if __name__ == '__main__': 
      14
      
          from kfp.compiler import Compiler
          Compiler().compile(ray_integration, 'compiled-example.yaml')
      Copy to Clipboard Toggle word wrap
      1
      Imports Ray.
      2
      Imports packages from the CodeFlare SDK to define the cluster functions.
      3
      Specifies the Ray cluster configuration: replace these example values with the values for your Ray cluster.
      4
      Optional: Specifies the project where the Ray cluster is created. Replace the example value with the name of your project. If you omit this line, the Ray cluster is created in the current project.
      5
      Optional: Specifies the requested accelerators for the Ray cluster (in this example, 1 NVIDIA GPU). If you do not use NVIDIA GPUs, replace nvidia.com/gpu with the correct value for your accelerator; for example, specify amd.com/gpu for AMD GPUs. If no accelerators are required, set the value to 0 or omit the line.
      6
      Specifies the location of the Ray cluster image. The Python version in the Ray cluster image must be the same as the Python version in the workbench. If you omit this line, one of the default CUDA-compatible Ray cluster images is used, based on the Python version detected in the workbench. The default Ray images are AMD64 images, which might not work on other architectures. If you are running this code in a disconnected environment, replace the default value with the location for your environment. For information about the latest available training images and their preinstalled packages, see Red Hat OpenShift AI: Supported Configurations.
      7
      Specifies the local queue to which the Ray cluster will be submitted. If a default local queue is configured, you can omit this line.
      8
      Creates a Ray cluster by using the specified image and configuration.
      9
      Waits until the Ray cluster is ready before proceeding.
      10
      Enables the Ray client to connect to a secure Ray cluster that has mutual Transport Layer Security (mTLS) enabled. mTLS is enabled by default in the CodeFlare component in OpenShift AI.
      11
      Replace the example details in this section with the details for your workload.
      12
      Removes the Ray cluster when your workload is finished.
      13
      Replace the example name and description with the values for your workload.
      14
      Compiles the Python code and saves the output in a YAML file.
    4. Compile the Python file (in this example, the compile_example.py file):

      $ python compile_example.py
      Copy to Clipboard Toggle word wrap

      This command creates a YAML file (in this example, compiled-example.yaml), which you can import in the next step.

  4. Import your data science pipeline, as described in Importing a data science pipeline.
  5. Schedule the pipeline run, as described in Scheduling a pipeline run.
  6. When the pipeline run is complete, confirm that it is included in the list of triggered pipeline runs, as described in Viewing the details of a pipeline run.

Verification

The YAML file is created and the pipeline run completes without errors.

You can view the run details, as described in Viewing the details of a pipeline run.

3.3. Running distributed data science workloads in a disconnected environment

To run a distributed data science workload in a disconnected environment, you must be able to access a Ray cluster image, and the data sets and Python dependencies used by the workload, from the disconnected environment.

Prerequisites

  • You have logged in to OpenShift with the cluster-admin role.
  • You have access to the disconnected data science cluster.
  • You have installed Red Hat OpenShift AI and created a mirror image as described in Installing and uninstalling OpenShift AI Self-Managed in a disconnected environment.
  • You can access the following software from the disconnected cluster:

    • A Ray cluster image
    • The data sets and models to be used by the workload
    • The Python dependencies for the workload, either in a Ray image or in your own Python Package Index (PyPI) server that is available from the disconnected cluster
  • You have logged in to Red Hat OpenShift AI.
  • You have created a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about how to create a project, see Creating a data science project.
  • You have administrator access for the data science project.

    • If you created the project, you automatically have administrator access.
    • If you did not create the project, your cluster administrator must give you administrator access.

Procedure

  1. Configure the disconnected data science cluster to run distributed workloads as described in Managing distributed workloads.
  2. In the ClusterConfiguration section of the Jupyter notebook or pipeline, ensure that the image value specifies a Ray cluster image that you can access from the disconnected environment:

    • Jupyter notebooks use the Ray cluster image to create a Ray cluster when running the notebook cells.
    • Pipelines use the Ray cluster image to create a Ray cluster during the pipeline run.
  3. If any of the Python packages required by the workload are not available in the Ray cluster, configure the Ray cluster to download the Python packages from a private PyPI server.

    For example, set the PIP_INDEX_URL and PIP_TRUSTED_HOST environment variables for the Ray cluster, to specify the location of the Python dependencies, as shown in the following example:

    PIP_INDEX_URL: https://pypi-notebook.apps.mylocation.com/simple
    PIP_TRUSTED_HOST: pypi-notebook.apps.mylocation.com
    Copy to Clipboard Toggle word wrap

    where

    • PIP_INDEX_URL specifies the base URL of your private PyPI server (the default value is https://pypi.org).
    • PIP_TRUSTED_HOST configures Python to mark the specified host as trusted, regardless of whether that host has a valid SSL certificate or is using a secure channel.
  4. Run the distributed data science workload, as described in Running distributed data science workloads from Jupyter notebooks or Running distributed data science workloads from data science pipelines.

Verification

The Jupyter notebook or pipeline run completes without errors:

  • For Jupyter notebooks, the output from the cluster.status() function or cluster.details() function indicates that the Ray cluster is Active.
  • For pipeline runs, you can view the run details as described in Viewing the details of a pipeline run.
トップに戻る
Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

Theme

© 2025 Red Hat