Este contenido no está disponible en el idioma seleccionado.
Chapter 2. Preparing the distributed training environment
Before you run a distributed training or tuning job, prepare your training environment as follows:
- Create a workbench with the appropriate workbench image. Review the list of packages in each workbench image to find the most suitable image for your distributed training workload.
- Ensure that you have the credentials to authenticate to the OpenShift cluster.
- Select a suitable training image. Choose from the list of base training images provided with Red Hat OpenShift AI, or create a custom training image.
For information about the workbench images and training images provided with Red Hat OpenShift AI, and their preinstalled packages, see the Supported Configurations for 3.x Knowledgebase article.
2.1. Creating a workbench for distributed training Copiar enlaceEnlace copiado en el portapapeles!
Create a workbench with the appropriate resources to run a distributed training or tuning job.
Prerequisites
- You can access an OpenShift cluster that has sufficient worker nodes with supported accelerators to run your training or tuning job.
- Your cluster administrator has configured the cluster as follows:
Installed and activated the Red Hat build of Kueue Operator, as described in Configuring workload management with Kueue.
- Installed Red Hat OpenShift AI with the required distributed training components, as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- Configured the distributed training resources, as described in Managing distributed workloads.
- Configured supported accelerators, as described in Working with accelerators.
Procedure
- Log in to the Red Hat OpenShift AI web console.
If you want to add the workbench to an existing project, open the project and proceed to the next step.
If you want to add the workbench to a new project, create the project as follows:
- In the left navigation pane, click Projects, and click Create project.
- Enter a project name, and optionally a description, and click Create. The project details page opens, with the Overview tab selected by default.
Create a workbench as follows:
- On the project details page, click the Workbench tab, and click Create workbench.
- Enter a workbench name, and optionally a description.
In the Workbench image section, from the Image selection list, select the appropriate image for your training or tuning job. If project-scoped images exist, the Image selection list includes subheadings to distinguish between global images and project-scoped images.
For example, to run the example fine-tuning job described in Fine-tuning a model by using Kubeflow Training, select PyTorch.
In the Deployment size section, from the Hardware profile list, select a suitable hardware profile for your workbench.
If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.
The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.
- If you want to change the default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values.
In the Cluster storage section, click either Attach existing storage or Create storage to specify the storage details so that you can share data between the workbench and the training or tuning runs.
For example, to run the example fine-tuning job described in Fine-tuning a model by using Kubeflow Training, specify a storage class with ReadWriteMany (RWX) capability.
- Review the storage configuration and click Create workbench.
Verification
On the Workbenches tab, the status changes from Starting to Running.
2.2. Using the cluster server and token to authenticate Copiar enlaceEnlace copiado en el portapapeles!
To interact with the OpenShift cluster, you must authenticate to the OpenShift API by specifying the cluster server and token. You can find these values from the OpenShift Console.
Prerequisites
- You can access the OpenShift Console.
Procedure
Log in to the OpenShift Console.
In the OpenShift AI top navigation bar, click the application launcher icon (
) and then click OpenShift Console.
- In the upper-right corner of the OpenShift Console, click your user name and click Copy login command.
- In the new tab that opens, log in as the user whose credentials you want to use.
- Click Display Token.
In the Log in with this token section, find the required values as follows:
-
The
tokenvalue is the text after the--token=prefix. -
The
servervalue is the text after the--server=prefix.
NoteThe
tokenandservervalues are security credentials, treat them with care.- Do not save the token and server details in a notebook file.
- Do not store the token and server details in Git.
The token expires after 24 hours.
-
The
You can use the token and server details to authenticate in various ways, as shown in the following examples:
You can specify the values in a notebook cell:
api_server = "<server>" token = "<token>"You can log in to the OpenShift CLI (
oc) by copying the entire Log in with this token command and pasting the command in a terminal window.$ oc login --token=<token> --server=<server>
2.3. Managing custom training images Copiar enlaceEnlace copiado en el portapapeles!
To run distributed training jobs, you can use one of the base training images that are provided with OpenShift AI, or you can create your own custom training images. You can optionally push your custom training images to the integrated OpenShift image registry, to make your images available to other users.
2.3.1. About base training images Copiar enlaceEnlace copiado en el portapapeles!
The base training images for distributed workloads are optimized with the tools and libraries that you need to run distributed training jobs. You can use the provided base images, or you can create custom images that are specific to your needs.
For information about Red Hat support of training images and packages, see Supported Configurations for 3.x.
The following table lists the training images that are installed with Red Hat OpenShift AI by default. These images are AMD64 images, which might not work on other architectures.
| Image type | Description |
|---|---|
| Ray CUDA | If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Ray Compute Unified Device Architecture (CUDA) base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs. |
| Ray ROCm | If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs. |
| KFTO CUDA | If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Kubeflow Training Operator CUDA base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs. |
| KFTO ROCm | If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Kubeflow Training Operator ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs. |
If the preinstalled packages that are provided in these images are not sufficient for your use case, you have the following options:
- Install additional libraries after launching a default image. This option is good if you want to add libraries on an ad hoc basis as you run training jobs. However, it can be challenging to manage the dependencies of installed libraries.
- Create a custom image that includes the additional libraries or packages. For more information, see Creating a custom training image.
2.3.2. Creating a custom training image Copiar enlaceEnlace copiado en el portapapeles!
You can create a custom training image by adding packages to a base training image.
Prerequisites
You can access the training image that you have chosen to use as the base for your custom image.
Select the image based on the image type (for example, Ray or Kubeflow Training Operator), the accelerator framework (for example, CUDA for NVIDIA GPUs, or ROCm for AMD GPUs), and the Python version (for example, 3.9 or 3.11).
The following table shows some example base training images:
Expand Table 2.2. Example base training images Image type Accelerator framework Python version Example base training image Preinstalled packages Ray
CUDA
3.9
ray:2.35.0-py39-cu121Ray 2.35.0, Python 3.9, CUDA 12.1
Ray
CUDA
3.11
ray:2.47.1-py311-cu121Ray 2.47.1, Python 3.11, CUDA 12.1
Ray
ROCm
3.9
ray:2.35.0-py39-rocm62Ray 2.35.0, Python 3.9, ROCm 6.2
Ray
ROCm
3.11
ray:2.47.1-py311-rocm62Ray 2.47.1, Python 3.11, ROCm 6.2
KFTO
CUDA
3.11
training:py311-cuda124-torch251Python 3.11, CUDA 12.4, PyTorch 2.5.1
KFTO
ROCm
3.11
training:py311-rocm62-torch251Python 3.11, ROCm 6.2, PyTorch 2.5.1
For a complete list of the OpenShift AI base training images and their preinstalled packages, see Supported Configurations for 3.x.
You have Podman installed in your local environment, and you can access a container registry.
For more information about Podman and container registries, see Building, running, and managing containers.
Procedure
- In a terminal window, create a directory for your work, and change to that directory.
Set the
IMGenvironment variable to the name of your custom image. In the example commands in this section,my_training_imageis the name of the custom image.export IMG=my_training_imageCreate a file named
Dockerfilewith the following content:Use the
FROMinstruction to specify the location of a suitable base training image.In the following command, replace
_<base-training-image>_with the name of your chosen base training image:FROM quay.io/modh/<base-training-image>Examples:
FROM quay.io/modh/ray:2.47.1-py311-cu121FROM quay.io/modh/training:py311-rocm62-torch251Use the
RUNinstruction to install additional packages. You can also add comments to the Dockerfile by prefixing each comment line with a number sign (#).The following example shows how to install a specific version of the Python PyTorch package:
# Install PyTorch RUN python3 -m pip install torch==2.5.1
Build the image file. Use the
-toption with thepodman buildcommand to create an image tag that specifies the custom image name and version, to make it easier to reference and manage the image:podman build -t <custom-image-name>:_<version>_ -f DockerfileExample:
podman build -t ${IMG}:0.0.1 -f DockerfileThe build output indicates when the build process is complete.
Display a list of your images:
podman imagesIf your new image was created successfully, it is included in the list of images.
Push the image to your container registry:
podman push ${IMG}:0.0.1- Optional: Make your new image available to other users, as described in Pushing an image to the integrated OpenShift image registry.
2.3.3. Pushing an image to the integrated OpenShift image registry Copiar enlaceEnlace copiado en el portapapeles!
To make an image available to other users in your OpenShift cluster, you can push the image to the integrated OpenShift image registry, a built-in container image registry.
For more information about the integrated OpenShift image registry, see Integrated OpenShift image registry.
Prerequisites
- Your cluster administrator has exposed the integrated image registry, as described in Exposing the registry.
You have Podman installed in your local environment.
For more information about Podman and container registries, see Building, running, and managing containers.
Procedure
In a terminal window, log in to the OpenShift CLI (
oc) as shown in the following example:$ oc login <openshift_cluster_url> -u <admin_username> -p <password>Set the
IMGenvironment variable to the name of your image. In the example commands in this section,my_training_imageis the name of the image.export IMG=my_training_imageLog in to the integrated image registry:
podman login -u $(oc whoami) -p $(oc whoami -t) $(oc registry info)Tag the image for the integrated image registry:
podman tag ${IMG} $(oc registry info)/$(oc project -q)/${IMG}Push the image to the integrated image registry:
podman push $(oc registry info)/$(oc project -q)/${IMG}Retrieve the image repository location for the tag that you want:
oc get is ${IMG} -o jsonpath='{.status.tags[?(@.tag=="<TAG>")].items[0].dockerImageReference}'Any user can now use your image by specifying this retrieved image location value in the
imageparameter of a Ray cluster or training job.