3.2. 从数据科学管道运行分布式数据科学工作负载

要从管道运行分布式工作负载，您必须首先更新管道，使其包含指向您的 Ray 集群镜像的链接。

先决条件

您可以访问配置为运行分布式工作负载的数据科学项目，如管理分布式工作负载中所述。
您可以从数据科学集群访问以下软件：
- 与硬件架构兼容的 Ray 集群镜像
- 工作负载使用的数据集和模型
- 工作负载的 Python 依赖项，可以在 Ray 镜像或您自己的 Python Package Index (PyPI)服务器中
您可以访问包含工作台的数据科学项目，工作台正在运行包含 CodeFlare SDK 的默认笔记本镜像，如 Standard Data Science 笔记本。有关项目和工作台的详情，请参考使用数据科学项目。
您有数据科学项目的 Admin 访问权限。
- 如果创建项目，则自动具有 Admin 访问权限。
- 如果没有创建项目，您的集群管理员必须授予 Admin 访问权限。
您可以访问 S3 兼容对象存储。
您已登陆到 Red Hat OpenShift AI。

流程

创建连接以将对象存储连接到您的数据科学项目，如添加与数据科学项目的连接中所述。
将管道服务器配置为使用连接，如配置管道服务器中所述。

创建数据科学管道，如下所示：

安装所有管道所需的 kfp Python 软件包：
```
pip install kfp
```
```
$ pip install kfp
```
Copy to Clipboard Toggle word wrap
安装管道所需的任何其他依赖项。

在 Python 代码中构建您的数据科学管道。

例如，如果您使用 NVIDIA GPU，请使用以下内容创建名为 compile_example.py 的文件：

from kfp import dsl


@dsl.component(
    base_image="registry.redhat.io/ubi8/python-39:latest",
    packages_to_install=['codeflare-sdk']
)


def ray_fn():
   import ray 
   from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert 

   # If you do not use NVIDIA GPUs, substitute “nvidia.com/gpu” with the correct value for your accelerator
   cluster = Cluster( 
       ClusterConfiguration(
           namespace="my_project", 
           name="raytest",
           num_workers=1,
           head_cpus="500m",
           min_memory=1,
           max_memory=1,
           worker_extended_resource_requests={“nvidia.com/gpu”: 1}, 
           image="quay.io/modh/ray:2.35.0-py39-cu121", 
           local_queue="local_queue_name", 
       )
   )


   print(cluster.status())
   cluster.up() 
   cluster.wait_ready() 
   print(cluster.status())
   print(cluster.details())


   ray_dashboard_uri = cluster.cluster_dashboard_uri()
   ray_cluster_uri = cluster.cluster_uri()
   print(ray_dashboard_uri, ray_cluster_uri)

   # Enable Ray client to connect to secure Ray cluster that has mTLS enabled
   generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) 
   generate_cert.export_env(cluster.config.name, cluster.config.namespace)


   ray.init(address=ray_cluster_uri)
   print("Ray cluster is up and running: ", ray.is_initialized())


   @ray.remote
   def train_fn(): 
       # complex training function
       return 100


   result = ray.get(train_fn.remote())
   assert 100 == result
   ray.shutdown()
   cluster.down() 
   auth.logout()
   return result


@dsl.pipeline( 
   name="Ray Simple Example",
   description="Ray Simple Example",
)


def ray_integration():
   ray_fn()


if __name__ == '__main__': 
    from kfp.compiler import Compiler
    Compiler().compile(ray_integration, 'compiled-example.yaml')

from kfp import dsl


@dsl.component(
    base_image="registry.redhat.io/ubi8/python-39:latest",
    packages_to_install=['codeflare-sdk']
)


def ray_fn():
   import ray


   from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert



   # If you do not use NVIDIA GPUs, substitute “nvidia.com/gpu” with the correct value for your accelerator
   cluster = Cluster(


       ClusterConfiguration(
           namespace="my_project",


           name="raytest",
           num_workers=1,
           head_cpus="500m",
           min_memory=1,
           max_memory=1,
           worker_extended_resource_requests={“nvidia.com/gpu”: 1},


           image="quay.io/modh/ray:2.35.0-py39-cu121",


           local_queue="local_queue_name",


       )
   )


   print(cluster.status())
   cluster.up()


   cluster.wait_ready()


   print(cluster.status())
   print(cluster.details())


   ray_dashboard_uri = cluster.cluster_dashboard_uri()
   ray_cluster_uri = cluster.cluster_uri()
   print(ray_dashboard_uri, ray_cluster_uri)

   # Enable Ray client to connect to secure Ray cluster that has mTLS enabled
   generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)


   generate_cert.export_env(cluster.config.name, cluster.config.namespace)


   ray.init(address=ray_cluster_uri)
   print("Ray cluster is up and running: ", ray.is_initialized())


   @ray.remote
   def train_fn():


       # complex training function
       return 100


   result = ray.get(train_fn.remote())
   assert 100 == result
   ray.shutdown()
   cluster.down()


   auth.logout()
   return result


@dsl.pipeline(


   name="Ray Simple Example",
   description="Ray Simple Example",
)


def ray_integration():
   ray_fn()


if __name__ == '__main__':


    from kfp.compiler import Compiler
    Compiler().compile(ray_integration, 'compiled-example.yaml')

Copy to Clipboard

Toggle word wrap

1: 导入 Ray。
2: 从 CodeFlare SDK 中导入软件包以定义集群功能。
3: 指定 Ray 集群配置：将这些示例值替换为 Ray 集群的值。
4: 可选：指定创建 Ray 集群的项目。将示例值替换为项目的名称。如果省略这一行，则会在当前项目中创建 Ray 集群。
5: 可选：为 Ray 集群指定请求的加速器（本例中为 1 NVIDIA GPU）。如果不需要加速器，请将值设为 0 或省略该行。注：要为 Ray 集群指定请求的加速器，请使用 worker_extended_resource_requests 参数，而不是已弃用的 num_gpus 参数。如需了解更多详细信息，请参阅 CodeFlare SDK 文档。
6: 指定 Ray 集群镜像的位置。如果省略此行，则使用默认 CUDA 兼容 Ray 集群镜像之一，具体取决于工作台中检测到的 Python 版本。默认的 Ray 镜像是 AMD64 镜像，可能不适用于其他架构。如果您在断开连接的环境中运行这个代码，请将默认值替换为您的环境的位置。有关最新可用培训镜像及其预安装的软件包的详情，请参考 Red Hat OpenShift AI: 支持的配置。
7: 指定将向其提交 Ray 集群的本地队列。如果配置了默认的本地队列，您可以省略这一行。
8: 使用指定的镜像和配置创建 Ray 集群。
9: 在继续操作前，等待 Ray 集群就绪。
10: 启用 Ray 客户端连接到启用了 mutual Transport Layer Security (mTLS)的安全 Ray 集群。OpenShift AI 中的 CodeFlare 组件中默认启用 mTLS。
11: 将本节中的示例详情替换为您的工作负载详情。
12: 当工作负载完成后，删除 Ray 集群。
13: 将示例名称和描述替换为您的工作负载值。
14: 编译 Python 代码，并将输出保存到 YAML 文件中。

编译 Python 文件（本例中为 compile_example.py 文件）：
```
python compile_example.py
```
```
$ python compile_example.py
```
Copy to Clipboard Toggle word wrap
此命令创建一个 YAML 文件（本例中为 compiled-example.yaml），您可以在下一步中导入该文件。

导入您的数据科学管道，如导入数据科学管道中所述。
调度管道运行，如调度管道运行中所述。
当管道运行完成后，确认它包含在触发的管道运行列表中，如查看管道运行的详情中所述。

验证

YAML 文件已创建，管道运行会完成且没有错误。

您可以查看运行详情，如查看管道运行的详情中所述。

3.2. 从数据科学管道运行分布式数据科学工作负载

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links