Este contenido no está disponible en el idioma seleccionado.

Chapter 2. Preparing the distributed training environment

Before you run a distributed training or tuning job, prepare your training environment as follows:

Create a workbench with the appropriate workbench image. Review the list of packages in each workbench image to find the most suitable image for your distributed training workload.
Ensure that you have the credentials to authenticate to the OpenShift cluster.
Select a suitable training image. Choose from the list of base training images provided with Red Hat OpenShift AI, or create a custom training image.

For information about the workbench images and training images provided with Red Hat OpenShift AI, and their preinstalled packages, see the Supported Configurations for 3.x Knowledgebase article.

2.1. Creating a workbench for distributed training
Copiar enlace

Create a workbench with the appropriate resources to run a distributed training or tuning job.

Prerequisites

You can access an OpenShift cluster that has sufficient worker nodes with supported accelerators to run your training or tuning job.
Your cluster administrator has configured the cluster as follows:
Installed and activated the Red Hat build of Kueue Operator, as described in Configuring workload management with Kueue.
- Installed Red Hat OpenShift AI with the required distributed training components, as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- Configured the distributed training resources, as described in Managing distributed workloads.
- Configured supported accelerators, as described in Working with accelerators.

Procedure

Log in to the Red Hat OpenShift AI web console.
If you want to add the workbench to an existing project, open the project and proceed to the next step.
If you want to add the workbench to a new project, create the project as follows:
1. In the left navigation pane, click Projects, and click Create project.
2. Enter a project name, and optionally a description, and click Create. The project details page opens, with the Overview tab selected by default.
Create a workbench as follows:
1. On the project details page, click the Workbench tab, and click Create workbench.
2. Enter a workbench name, and optionally a description.
3. In the Workbench image section, from the Image selection list, select the appropriate image for your training or tuning job. If project-scoped images exist, the Image selection list includes subheadings to distinguish between global images and project-scoped images.
  For example, to run the example fine-tuning job described in Fine-tuning a model by using Kubeflow Training, select PyTorch.
4. In the Deployment size section, from the Hardware profile list, select a suitable hardware profile for your workbench.
  If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.
  The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
5. If you want to change the default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values.
6. In the Cluster storage section, click either Attach existing storage or Create storage to specify the storage details so that you can share data between the workbench and the training or tuning runs.
  For example, to run the example fine-tuning job described in Fine-tuning a model by using Kubeflow Training, specify a storage class with ReadWriteMany (RWX) capability.
7. Review the storage configuration and click Create workbench.

Verification

On the Workbenches tab, the status changes from Starting to Running.

2.2. Using the cluster server and token to authenticate
Copiar enlace

To interact with the OpenShift cluster, you must authenticate to the OpenShift API by specifying the cluster server and token. You can find these values from the OpenShift Console.

Prerequisites

You can access the OpenShift Console.

Procedure

Log in to the OpenShift Console.
In the OpenShift AI top navigation bar, click the application launcher icon ( ) and then click OpenShift Console.
In the upper-right corner of the OpenShift Console, click your user name and click Copy login command.
In the new tab that opens, log in as the user whose credentials you want to use.
Click Display Token.
In the Log in with this token section, find the required values as follows:
- The token value is the text after the --token= prefix.
- The server value is the text after the --server= prefix.
Note
The token and server values are security credentials, treat them with care.
- Do not save the token and server details in a notebook file.
- Do not store the token and server details in Git.
The token expires after 24 hours.
You can use the token and server details to authenticate in various ways, as shown in the following examples:
- You can specify the values in a notebook cell:
  api_server = "<server>" token = "<token>"
- You can log in to the OpenShift CLI (oc) by copying the entire Log in with this token command and pasting the command in a terminal window.
  $ oc login --token=<token> --server=<server>

2.3. Managing custom training images
Copiar enlace

To run distributed training jobs, you can use one of the base training images that are provided with OpenShift AI, or you can create your own custom training images. You can optionally push your custom training images to the integrated OpenShift image registry, to make your images available to other users.

2.3.1. About base training images
Copiar enlace

The base training images for distributed workloads are optimized with the tools and libraries that you need to run distributed training jobs. You can use the provided base images, or you can create custom images that are specific to your needs.

For information about Red Hat support of training images and packages, see Supported Configurations for 3.x.

The following table lists the training images that are installed with Red Hat OpenShift AI by default. These images are AMD64 images, which might not work on other architectures.

Expand

Table 2.1. Default training base images
Image type	Description
Ray CUDA	If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Ray Compute Unified Device Architecture (CUDA) base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs.
Ray ROCm	If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.
KFTO CUDA	If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Kubeflow Training Operator CUDA base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs.
KFTO ROCm	If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Kubeflow Training Operator ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.

If the preinstalled packages that are provided in these images are not sufficient for your use case, you have the following options:

Install additional libraries after launching a default image. This option is good if you want to add libraries on an ad hoc basis as you run training jobs. However, it can be challenging to manage the dependencies of installed libraries.
Create a custom image that includes the additional libraries or packages. For more information, see Creating a custom training image.

2.3.2. Creating a custom training image
Copiar enlace

You can create a custom training image by adding packages to a base training image.

Prerequisites

You can access the training image that you have chosen to use as the base for your custom image.

Select the image based on the image type (for example, Ray or Kubeflow Training Operator), the accelerator framework (for example, CUDA for NVIDIA GPUs, or ROCm for AMD GPUs), and the Python version (for example, 3.9 or 3.11).

The following table shows some example base training images:

Expand

Table 2.2. Example base training images
Image type	Accelerator framework	Python version	Example base training image	Preinstalled packages
Ray	CUDA	3.9	`ray:2.35.0-py39-cu121`	Ray 2.35.0, Python 3.9, CUDA 12.1
Ray	CUDA	3.11	`ray:2.47.1-py311-cu121`	Ray 2.47.1, Python 3.11, CUDA 12.1
Ray	ROCm	3.9	`ray:2.35.0-py39-rocm62`	Ray 2.35.0, Python 3.9, ROCm 6.2
Ray	ROCm	3.11	`ray:2.47.1-py311-rocm62`	Ray 2.47.1, Python 3.11, ROCm 6.2
KFTO	CUDA	3.11	`training:py311-cuda124-torch251`	Python 3.11, CUDA 12.4, PyTorch 2.5.1
KFTO	ROCm	3.11	`training:py311-rocm62-torch251`	Python 3.11, ROCm 6.2, PyTorch 2.5.1

For a complete list of the OpenShift AI base training images and their preinstalled packages, see Supported Configurations for 3.x.

You have Podman installed in your local environment, and you can access a container registry.
For more information about Podman and container registries, see Building, running, and managing containers.

Procedure

In a terminal window, create a directory for your work, and change to that directory.
Set the IMG environment variable to the name of your custom image. In the example commands in this section, my_training_image is the name of the custom image.
```
export IMG=my_training_image
```
Create a file named Dockerfile with the following content:
1. Use the FROM instruction to specify the location of a suitable base training image.
  In the following command, replace _<base-training-image>_ with the name of your chosen base training image:
  FROM quay.io/modh/<base-training-image>
  Examples:
  FROM quay.io/modh/ray:2.47.1-py311-cu121
  FROM quay.io/modh/training:py311-rocm62-torch251
2. Use the RUN instruction to install additional packages. You can also add comments to the Dockerfile by prefixing each comment line with a number sign (#).
  The following example shows how to install a specific version of the Python PyTorch package:
  # Install PyTorch RUN python3 -m pip install torch==2.5.1
Build the image file. Use the -t option with the podman build command to create an image tag that specifies the custom image name and version, to make it easier to reference and manage the image:
```
podman build -t <custom-image-name>:_<version>_ -f Dockerfile
```
Example:
```
podman build -t ${IMG}:0.0.1 -f Dockerfile
```
The build output indicates when the build process is complete.
Display a list of your images:
```
podman images
```
If your new image was created successfully, it is included in the list of images.
Push the image to your container registry:
```
podman push ${IMG}:0.0.1
```
Optional: Make your new image available to other users, as described in Pushing an image to the integrated OpenShift image registry.

2.3.3. Pushing an image to the integrated OpenShift image registry
Copiar enlace

To make an image available to other users in your OpenShift cluster, you can push the image to the integrated OpenShift image registry, a built-in container image registry.

For more information about the integrated OpenShift image registry, see Integrated OpenShift image registry.

Prerequisites

Your cluster administrator has exposed the integrated image registry, as described in Exposing the registry.
You have Podman installed in your local environment.
For more information about Podman and container registries, see Building, running, and managing containers.

Procedure

In a terminal window, log in to the OpenShift CLI (oc) as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Set the IMG environment variable to the name of your image. In the example commands in this section, my_training_image is the name of the image.
```
export IMG=my_training_image
```

podman login -u $(oc whoami) -p $(oc whoami -t) $(oc registry info)

Tag the image for the integrated image registry:

podman tag ${IMG} $(oc registry info)/$(oc project -q)/${IMG}

Push the image to the integrated image registry:

podman push $(oc registry info)/$(oc project -q)/${IMG}

Retrieve the image repository location for the tag that you want:
```
oc get is ${IMG} -o jsonpath='{.status.tags[?(@.tag=="<TAG>")].items[0].dockerImageReference}'
```
Any user can now use your image by specifying this retrieved image location value in the image parameter of a Ray cluster or training job.

Este contenido no está disponible en el idioma seleccionado.

Chapter 2. Preparing the distributed training environment

2.1. Creating a workbench for distributed training
Copiar enlace

2.2. Using the cluster server and token to authenticate
Copiar enlace

2.3. Managing custom training images
Copiar enlace

2.3.1. About base training images
Copiar enlace

2.3.2. Creating a custom training image
Copiar enlace

2.3.3. Pushing an image to the integrated OpenShift image registry
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 2. Preparing the distributed training environment

2.1. Creating a workbench for distributed trainingCopiar enlaceEnlace copiado en el portapapeles!

2.2. Using the cluster server and token to authenticateCopiar enlaceEnlace copiado en el portapapeles!

2.3. Managing custom training imagesCopiar enlaceEnlace copiado en el portapapeles!

2.3.1. About base training imagesCopiar enlaceEnlace copiado en el portapapeles!

2.3.2. Creating a custom training imageCopiar enlaceEnlace copiado en el portapapeles!

2.3.3. Pushing an image to the integrated OpenShift image registryCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. Creating a workbench for distributed training
Copiar enlace

2.2. Using the cluster server and token to authenticate
Copiar enlace

2.3. Managing custom training images
Copiar enlace

2.3.1. About base training images
Copiar enlace

2.3.2. Creating a custom training image
Copiar enlace

2.3.3. Pushing an image to the integrated OpenShift image registry
Copiar enlace