Este contenido no está disponible en el idioma seleccionado.

Chapter 2. Preparing the distributed training environment


Before you run a distributed training or tuning job, prepare your training environment as follows:

  • Create a workbench with the appropriate workbench image. Review the list of packages in each workbench image to find the most suitable image for your distributed training workload.
  • Ensure that you have the credentials to authenticate to the OpenShift cluster.
  • Select a suitable training image. Choose from the list of base training images provided with Red Hat OpenShift AI, or create a custom training image.

For information about the workbench images and training images provided with Red Hat OpenShift AI, and their preinstalled packages, see the Supported Configurations for 3.x Knowledgebase article.

2.1. Creating a workbench for distributed training

Create a workbench with the appropriate resources to run a distributed training or tuning job.

Prerequisites

Procedure

  1. Log in to the Red Hat OpenShift AI web console.
  2. If you want to add the workbench to an existing project, open the project and proceed to the next step.

    If you want to add the workbench to a new project, create the project as follows:

    1. In the left navigation pane, click Projects, and click Create project.
    2. Enter a project name, and optionally a description, and click Create. The project details page opens, with the Overview tab selected by default.
  3. Create a workbench as follows:

    1. On the project details page, click the Workbench tab, and click Create workbench.
    2. Enter a workbench name, and optionally a description.
    3. In the Workbench image section, from the Image selection list, select the appropriate image for your training or tuning job. If project-scoped images exist, the Image selection list includes subheadings to distinguish between global images and project-scoped images.

      For example, to run the example fine-tuning job described in Fine-tuning a model by using Kubeflow Training, select PyTorch.

    4. In the Deployment size section, from the Hardware profile list, select a suitable hardware profile for your workbench.

      If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.

      The hardware profile specifies the number of CPUs and the amount of memory allocated to the container, setting the guaranteed minimum (request) and maximum (limit) for both.

    5. If you want to change the default values, click Customize resource requests and limit and enter new minimum (request) and maximum (limit) values.
    6. In the Cluster storage section, click either Attach existing storage or Create storage to specify the storage details so that you can share data between the workbench and the training or tuning runs.

      For example, to run the example fine-tuning job described in Fine-tuning a model by using Kubeflow Training, specify a storage class with ReadWriteMany (RWX) capability.

    7. Review the storage configuration and click Create workbench.

Verification

On the Workbenches tab, the status changes from Starting to Running.

2.2. Using the cluster server and token to authenticate

To interact with the OpenShift cluster, you must authenticate to the OpenShift API by specifying the cluster server and token. You can find these values from the OpenShift Console.

Prerequisites

  • You can access the OpenShift Console.

Procedure

  1. Log in to the OpenShift Console.

    In the OpenShift AI top navigation bar, click the application launcher icon ( The application launcher ) and then click OpenShift Console.

  2. In the upper-right corner of the OpenShift Console, click your user name and click Copy login command.
  3. In the new tab that opens, log in as the user whose credentials you want to use.
  4. Click Display Token.
  5. In the Log in with this token section, find the required values as follows:

    • The token value is the text after the --token= prefix.
    • The server value is the text after the --server= prefix.
    Note

    The token and server values are security credentials, treat them with care.

    • Do not save the token and server details in a notebook file.
    • Do not store the token and server details in Git.

    The token expires after 24 hours.

  6. You can use the token and server details to authenticate in various ways, as shown in the following examples:

    • You can specify the values in a notebook cell:

      api_server = "<server>"
      token = "<token>"
    • You can log in to the OpenShift CLI (oc) by copying the entire Log in with this token command and pasting the command in a terminal window.

      $ oc login --token=<token> --server=<server>

2.3. Managing custom training images

To run distributed training jobs, you can use one of the base training images that are provided with OpenShift AI, or you can create your own custom training images. You can optionally push your custom training images to the integrated OpenShift image registry, to make your images available to other users.

2.3.1. About base training images

The base training images for distributed workloads are optimized with the tools and libraries that you need to run distributed training jobs. You can use the provided base images, or you can create custom images that are specific to your needs.

For information about Red Hat support of training images and packages, see Supported Configurations for 3.x.

The following table lists the training images that are installed with Red Hat OpenShift AI by default. These images are AMD64 images, which might not work on other architectures.

Expand
Table 2.1. Default training base images
Image typeDescription

Ray CUDA

If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Ray Compute Unified Device Architecture (CUDA) base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs.

Ray ROCm

If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.

KFTO CUDA

If you are working with compute-intensive models and you want to accelerate the training job with NVIDIA GPU support, you can use the Kubeflow Training Operator CUDA base image to gain access to the NVIDIA CUDA Toolkit. Using this toolkit, you can accelerate your work by using libraries and tools that are optimized for NVIDIA GPUs.

KFTO ROCm

If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Kubeflow Training Operator ROCm base image to gain access to the AMD ROCm software stack. Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.

If the preinstalled packages that are provided in these images are not sufficient for your use case, you have the following options:

  • Install additional libraries after launching a default image. This option is good if you want to add libraries on an ad hoc basis as you run training jobs. However, it can be challenging to manage the dependencies of installed libraries.
  • Create a custom image that includes the additional libraries or packages. For more information, see Creating a custom training image.

2.3.2. Creating a custom training image

You can create a custom training image by adding packages to a base training image.

Prerequisites

  • You can access the training image that you have chosen to use as the base for your custom image.

    Select the image based on the image type (for example, Ray or Kubeflow Training Operator), the accelerator framework (for example, CUDA for NVIDIA GPUs, or ROCm for AMD GPUs), and the Python version (for example, 3.9 or 3.11).

    The following table shows some example base training images:

    Expand
    Table 2.2. Example base training images
    Image typeAccelerator frameworkPython versionExample base training imagePreinstalled packages

    Ray

    CUDA

    3.9

    ray:2.35.0-py39-cu121

    Ray 2.35.0, Python 3.9, CUDA 12.1

    Ray

    CUDA

    3.11

    ray:2.47.1-py311-cu121

    Ray 2.47.1, Python 3.11, CUDA 12.1

    Ray

    ROCm

    3.9

    ray:2.35.0-py39-rocm62

    Ray 2.35.0, Python 3.9, ROCm 6.2

    Ray

    ROCm

    3.11

    ray:2.47.1-py311-rocm62

    Ray 2.47.1, Python 3.11, ROCm 6.2

    KFTO

    CUDA

    3.11

    training:py311-cuda124-torch251

    Python 3.11, CUDA 12.4, PyTorch 2.5.1

    KFTO

    ROCm

    3.11

    training:py311-rocm62-torch251

    Python 3.11, ROCm 6.2, PyTorch 2.5.1

    For a complete list of the OpenShift AI base training images and their preinstalled packages, see Supported Configurations for 3.x.

  • You have Podman installed in your local environment, and you can access a container registry.

    For more information about Podman and container registries, see Building, running, and managing containers.

Procedure

  1. In a terminal window, create a directory for your work, and change to that directory.
  2. Set the IMG environment variable to the name of your custom image. In the example commands in this section, my_training_image is the name of the custom image.

    export IMG=my_training_image
  3. Create a file named Dockerfile with the following content:

    1. Use the FROM instruction to specify the location of a suitable base training image.

      In the following command, replace _<base-training-image>_ with the name of your chosen base training image:

      FROM quay.io/modh/<base-training-image>

      Examples:

      FROM quay.io/modh/ray:2.47.1-py311-cu121
      FROM quay.io/modh/training:py311-rocm62-torch251
    2. Use the RUN instruction to install additional packages. You can also add comments to the Dockerfile by prefixing each comment line with a number sign (#).

      The following example shows how to install a specific version of the Python PyTorch package:

      # Install PyTorch
      RUN python3 -m pip install torch==2.5.1
  4. Build the image file. Use the -t option with the podman build command to create an image tag that specifies the custom image name and version, to make it easier to reference and manage the image:

    podman build -t <custom-image-name>:_<version>_ -f Dockerfile

    Example:

    podman build -t ${IMG}:0.0.1 -f Dockerfile

    The build output indicates when the build process is complete.

  5. Display a list of your images:

    podman images

    If your new image was created successfully, it is included in the list of images.

  6. Push the image to your container registry:

    podman push ${IMG}:0.0.1
  7. Optional: Make your new image available to other users, as described in Pushing an image to the integrated OpenShift image registry.

2.3.3. Pushing an image to the integrated OpenShift image registry

To make an image available to other users in your OpenShift cluster, you can push the image to the integrated OpenShift image registry, a built-in container image registry.

For more information about the integrated OpenShift image registry, see Integrated OpenShift image registry.

Prerequisites

Procedure

  1. In a terminal window, log in to the OpenShift CLI (oc) as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Set the IMG environment variable to the name of your image. In the example commands in this section, my_training_image is the name of the image.

    export IMG=my_training_image
  3. Log in to the integrated image registry:

    podman login -u $(oc whoami) -p $(oc whoami -t) $(oc registry info)
  4. Tag the image for the integrated image registry:

    podman tag ${IMG} $(oc registry info)/$(oc project -q)/${IMG}
  5. Push the image to the integrated image registry:

    podman push $(oc registry info)/$(oc project -q)/${IMG}
  6. Retrieve the image repository location for the tag that you want:

    oc get is ${IMG} -o jsonpath='{.status.tags[?(@.tag=="<TAG>")].items[0].dockerImageReference}'

    Any user can now use your image by specifying this retrieved image location value in the image parameter of a Ray cluster or training job.

Red Hat logoGithubredditYoutubeTwitter

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Ayudamos a los usuarios de Red Hat a innovar y alcanzar sus objetivos con nuestros productos y servicios con contenido en el que pueden confiar. Explore nuestras recientes actualizaciones.

Hacer que el código abierto sea más inclusivo

Red Hat se compromete a reemplazar el lenguaje problemático en nuestro código, documentación y propiedades web. Para más detalles, consulte el Blog de Red Hat.

Acerca de Red Hat

Ofrecemos soluciones reforzadas que facilitan a las empresas trabajar en plataformas y entornos, desde el centro de datos central hasta el perímetro de la red.

Theme

© 2026 Red Hat
Volver arriba