このコンテンツは選択した言語では利用できません。

Chapter 1. Overview of distributed workloads


You can use the distributed workloads feature to queue, scale, and manage the resources required to run data science workloads across multiple nodes in an OpenShift cluster simultaneously. Typically, data science workloads include several types of artificial intelligence (AI) workloads, including machine learning (ML) and Python workloads.

Distributed workloads provide the following benefits:

  • You can iterate faster and experiment more frequently because of the reduced processing time.
  • You can use larger datasets, which can lead to more accurate models.
  • You can use complex models that could not be trained on a single node.
  • You can submit distributed workloads at any time, and the system then schedules the distributed workload when the required resources are available.

1.1. Distributed workloads infrastructure

The distributed workloads infrastructure includes the following components:

CodeFlare SDK

Defines and controls the remote distributed compute jobs and infrastructure for any Python-based environment.

Note

The CodeFlare SDK is not installed as part of OpenShift AI, but it is included in some of the workbench images provided by OpenShift AI.

Kubeflow Training Operator
Provides fine-tuning and scalable distributed training of ML models created with different ML frameworks such as PyTorch.
Kubeflow Training Operator Python Software Development Kit (Training Operator SDK)

Simplifies the creation of distributed training and fine-tuning jobs.

Note

The Training Operator SDK is not installed as part of OpenShift AI, but it is included in some of the workbench images provided by OpenShift AI.

KubeRay Operator
Manages and secures remote Ray clusters on OpenShift for running distributed compute workloads, and enforces a controlled-network environment.
Red Hat build of Kueue Operator
Manages quotas and how distributed workloads consume them, and manages the queueing of distributed workloads with respect to quotas.
cert-manager Operator
Enables integration with external certificate authorities and provides certificate provisioning, renewal, and retirement.

For information about installing these components, see Installing the distributed workloads components. For disconnected environments, see Installing the distributed workloads components.

1.2. Types of distributed workloads

Depending on which type of distributed workloads you want to run, you must use different OpenShift AI components:

  • Ray-based distributed workloads: Use the kueue and ray components.
  • Training Operator-based distributed workloads: Use the trainingoperator and kueue components.

For both Ray-based and Training Operator-based distributed workloads, you can use Kueue and supported accelerators:

  • Use Kueue to manage the resources for the distributed workload.
  • Use CUDA training images for NVIDIA GPUs, and ROCm-based training images for AMD GPUs.

For more information about supported accelerators, see the Supported Configurations for 3.x Knowledgebase article.

You can run distributed workloads from AI pipelines, from Jupyter notebooks, or from Microsoft Visual Studio Code files.

Note

AI pipelines workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics.

Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

Red Hat ドキュメントについて

Legal Notice

Theme

© 2026 Red Hat
トップに戻る