Chapter 4. Running Training Operator-based distributed training workloads
To reduce the time needed to train a Large Language Model (LLM), you can run the training job in parallel. In Red Hat OpenShift AI, the Kubeflow Training Operator and Kubeflow Training Operator Python Software Development Kit (Training Operator SDK) simplify the job configuration.
You can use the Training Operator and the Training Operator SDK to configure a training job in a variety of ways. For example, you can use multiple nodes and multiple GPUs per node, fine-tune a model, or configure a training job to use Remote Direct Memory Access (RDMA).
4.1. Using the Kubeflow Training Operator to run distributed training workloads Copy linkLink copied to clipboard!
You can use the Training Operator PyTorchJob API to configure a PyTorchJob resource so that the training job runs on multiple nodes with multiple GPUs.
You can store the training script in a ConfigMap resource, or include it in a custom container image.
4.1.1. Creating a Training Operator PyTorch training script ConfigMap resource Copy linkLink copied to clipboard!
You can create a ConfigMap resource to store the Training Operator PyTorch training script.
Alternatively, you can use the example Dockerfile to include the training script in a custom container image, as described in Creating a custom training image.
Prerequisites
- Your cluster administrator has installed Red Hat OpenShift AI with the required distributed training components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- You can access the OpenShift Console for the cluster where OpenShift AI is installed.
Procedure
- Log in to the OpenShift Console.
Create a
ConfigMapresource, as follows:-
In the Administrator perspective, click Workloads
ConfigMaps. - From the Project list, select your project.
- Click Create ConfigMap.
In the Configure via section, select the YAML view option.
The Create ConfigMap page opens, with default YAML code automatically added.
-
In the Administrator perspective, click Workloads
Replace the default YAML code with your training-script code.
For example training scripts, see Example Training Operator PyTorch training scripts.
- Click Create.
Verification
-
In the OpenShift Console, in the Administrator perspective, click Workloads
ConfigMaps. - From the Project list, select your project.
- Click your ConfigMap resource to display the training script details.
4.1.2. Creating a Training Operator PyTorchJob resource Copy linkLink copied to clipboard!
You can create a PyTorchJob resource to run the Training Operator PyTorch training script.
Prerequisites
- You can access an OpenShift cluster that has multiple worker nodes with supported NVIDIA GPUs or AMD GPUs.
Your cluster administrator has configured the cluster as follows:
- Installed Red Hat OpenShift AI with the required distributed training components, as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- Configured the distributed training resources, as described in Managing distributed workloads.
- You can access a workbench that is suitable for distributed training, as described in Creating a workbench for distributed training.
You have administrator access for the project.
- If you created the project, you automatically have administrator access.
- If you did not create the project, your cluster administrator must give you administrator access.
Procedure
- Log in to the OpenShift Console.
Create a
PyTorchJobresource, as follows:-
In the Administrator perspective, click Home
Search. - From the Project list, select your project.
-
Click the Resources list, and in the search field, start typing
PyTorchJob. Select PyTorchJob, and click Create PyTorchJob.
The Create PyTorchJob page opens, with default YAML code automatically added.
-
In the Administrator perspective, click Home
Update the metadata to replace the
nameandnamespacevalues with the values for your environment, as shown in the following example:metadata: name: pytorch-multi-node-job namespace: test-namespace
metadata: name: pytorch-multi-node-job namespace: test-namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Configure the master node, as shown in the following example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
In the
replicasentry, specify1. Only one master node is needed. To use a ConfigMap resource to provide the training script for the PyTorchJob pods, add the ConfigMap volume mount information, as shown in the following example:
Adding the training script from a ConfigMap resource
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the appropriate resource constraints for your environment, as shown in the following example:
Adding the resource contraints
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
In the
Make similar edits in the
Workersection of thePyTorchJobresource.-
Update the
replicasentry to specify the number of worker nodes.
For a complete example
PyTorchJobresource, see Example Training Operator PyTorchJob resource for multi-node training.-
Update the
- Click Create.
Verification
- In the OpenShift Console, open the Administrator perspective.
- From the Project list, select your project.
-
Click Home
Search PyTorchJob and verify that the job was created. -
Click Workloads
Pods and verify that requested head pod and worker pods are running.
4.1.3. Creating a Training Operator PyTorchJob resource by using the CLI Copy linkLink copied to clipboard!
You can use the OpenShift CLI (oc) to create a PyTorchJob resource to run the Training Operator PyTorch training script.
Prerequisites
- You can access an OpenShift cluster that has multiple worker nodes with supported NVIDIA GPUs or AMD GPUs.
Your cluster administrator has configured the cluster as follows:
- Installed Red Hat OpenShift AI with the required distributed training components, as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- Configured the distributed training resources, as described in Managing distributed workloads.
- You can access a workbench that is suitable for distributed training, as described in Creating a workbench for distributed training.
You have administrator access for the project.
- If you created the project, you automatically have administrator access.
- If you did not create the project, your cluster administrator must give you administrator access.
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
Procedure
Log in to the OpenShift CLI (
oc), as follows:Logging in to the OpenShift CLI (
oc)oc login --token=<token> --server=<server>
oc login --token=<token> --server=<server>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For information about how to find the server and token details, see Using the cluster server and token to authenticate.
Create a file named
train.pyand populate it with your training script, as follows:Creating the training script
cat <<EOF > train.py <paste your content here> EOF
cat <<EOF > train.py <paste your content here> EOFCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <paste your content here> with your training script content.
For example training scripts, see Example Training Operator PyTorch training scripts.
Create a
ConfigMapresource to store the training script, as follows:Creating the ConfigMap resource
oc create configmap training-script-configmap --from-file=train.py -n <your-namespace>
oc create configmap training-script-configmap --from-file=train.py -n <your-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <your-namespace> with the name of your project.
Create a file named
pytorchjob.yamlto define the distributed training job setup, as follows:Defining the distributed training job
cat <<EOF > pytorchjob.py <paste your content here> EOF
cat <<EOF > pytorchjob.py <paste your content here> EOFCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <paste your content here> with your training job content.
For an example training job, see Example Training Operator PyTorchJob resource for multi-node training.
Create the distributed training job, as follows:
Creating the distributed training job
oc apply -f pytorchjob.yaml
oc apply -f pytorchjob.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Monitor the running distributed training job, as follows:
Monitoring the distributed training job
oc get pytorchjobs -n <your-namespace>
oc get pytorchjobs -n <your-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <your-namespace> with the name of your project.
Check the pod logs, as follows:
Checking the pod logs
oc logs <pod-name> -n <your-namespace>
oc logs <pod-name> -n <your-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <your-namespace> with the name of your project.
When you want to delete the job, run the following command:
Deleting the job
oc delete pytorchjobs/pytorch-multi-node-job -n <your-namespace>
oc delete pytorchjobs/pytorch-multi-node-job -n <your-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <your-namespace> with the name of your project.
4.1.4. Example Training Operator PyTorch training scripts Copy linkLink copied to clipboard!
The following examples show how to configure a PyTorch training script for NVIDIA Collective Communications Library (NCCL), Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP) training jobs.
If you have the required resources, you can run the example code without editing it.
Alternatively, you can modify the example code to specify the appropriate configuration for your training job.
4.1.4.1. Example Training Operator PyTorch training script: NCCL Copy linkLink copied to clipboard!
This NVIDIA Collective Communications Library (NCCL) example returns the rank and tensor value for each accelerator.
The backend value is automatically set to one of the following values:
-
nccl: Uses NVIDIA Collective Communications Library (NCCL) for NVIDIA GPUs or ROCm Communication Collectives Library (RCCL) for AMD GPUs -
gloo: Uses Gloo for CPUs
Specify backend="nccl" for both NVIDIA GPUs and AMD GPUs.
For AMD GPUs, even though the backend value is set to nccl, the ROCm environment uses RCCL for communication.
4.1.4.2. Example Training Operator PyTorch training script: DDP Copy linkLink copied to clipboard!
This example shows how to configure a training script for a Distributed Data Parallel (DDP) training job.
4.1.4.3. Example Training Operator PyTorch training script: FSDP Copy linkLink copied to clipboard!
This example shows how to configure a training script for a Fully Sharded Data Parallel (FSDP) training job.
4.1.5. Example Dockerfile for a Training Operator PyTorch training script Copy linkLink copied to clipboard!
You can use this example Dockerfile to include the training script in a custom training image.
FROM registry.redhat.io/rhoai/odh-training-cuda128-torch28-py312-rhel9:v3.0 WORKDIR /workspace COPY train.py /workspace/train.py CMD ["python", "train.py"]
FROM registry.redhat.io/rhoai/odh-training-cuda128-torch28-py312-rhel9:v3.0
WORKDIR /workspace
COPY train.py /workspace/train.py
CMD ["python", "train.py"]
This example copies the training script to the default PyTorch image, and runs the script.
For more information about how to use this Dockerfile to include the training script in a custom container image, see Creating a custom training image.
4.1.6. Example Training Operator PyTorchJob resource for multi-node training Copy linkLink copied to clipboard!
This example shows how to create a Training Operator PyTorch training job that runs on multiple nodes with multiple GPUs.
4.2. Using the Training Operator SDK to run distributed training workloads Copy linkLink copied to clipboard!
You can use the Training Operator SDK to configure a distributed training job to run on multiple nodes with multiple accelerators per node.
You can configure the PyTorchJob resource so that the training job runs on multiple nodes with multiple GPUs.
4.2.1. Configuring a training job by using the Training Operator SDK Copy linkLink copied to clipboard!
Before you can run a job to train a model, you must configure the training job. You must set the training parameters, define the training function, and configure the Training Operator SDK.
The code in this procedure specifies how to configure an example training job. If you have the specified resources, you can run the example code without editing it.
Alternatively, you can modify the example code to specify the appropriate configuration for your training job.
Prerequisites
- You can access an OpenShift cluster that has sufficient worker nodes with supported accelerators to run your training or tuning job.
- You can access a workbench that is suitable for distributed training, as described in Creating a workbench for distributed training.
You have administrator access for the project.
- If you created the project, you automatically have administrator access.
- If you did not create the project, your cluster administrator must give you administrator access.
Procedure
Open the workbench, as follows:
- Log in to the Red Hat OpenShift AI web console.
- Click Projects and click your project.
- Click the Workbenches tab.
- If your workbench is not already running, start the workbench.
- Click the Open link to open the IDE in a new window.
-
Click File
New Notebook. Create the training function as shown in the following example:
Create a cell with the following content:
Example training function
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteFor this example training job, you do not need to install any additional packages or set any training parameters.
For more information about how to add additional packages and set the training parameters, see Configuring the fine-tuning job.
- Optional: Edit the content to specify the appropriate values for your environment.
- Run the cell to create the training function.
Configure the Training Operator SDK client authentication as follows:
Create a cell with the following content:
Example Training Operator SDK client authentication
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
api_serverandtokenparameters to enter the values to authenticate to your OpenShift cluster.For information on how to find the server and token details, see Using the cluster server and token to authenticate.
- Run the cell to configure the Training Operator SDK client authentication.
- Click File > Save Notebook As, enter an appropriate file name, and click Save.
Verification
- All cells run successfully.
4.2.2. Running a training job by using the Training Operator SDK Copy linkLink copied to clipboard!
When you run a training job to tune a model, you must specify the resources needed, and provide any authorization credentials required.
The code in this procedure specifies how to run the example training job. If you have the specified resources, you can run the example code without editing it.
Alternatively, you can modify the example code to specify the appropriate details for your training job.
Prerequisites
- You can access an OpenShift cluster that has sufficient worker nodes with supported accelerators to run your training or tuning job.
- You can access a workbench that is suitable for distributed training, as described in Creating a workbench for distributed training.
You have administrator access for the project.
- If you created the project, you automatically have administrator access.
- If you did not create the project, your cluster administrator must give you administrator access.
-
You have enabled your project for Kueue management by applying the
kueue.openshift.io/managed=truelabel to the project namespace. - You have created resource flavor, cluster queue, and local queue Kueue objects for your project. For more information about creating these objects, see Configuring quota management for distributed workloads.
- You have access to a model.
- You have access to data that you can use to train the model.
- You have configured the training job as described in Configuring a training job by using the Training Operator SDK.
Procedure
Open the workbench, as follows:
- Log in to the Red Hat OpenShift AI web console.
- Click Projects and click your project.
- Click the Workbenches tab. If your workbench is not already running, start the workbench.
- Click the Open link to open the IDE in a new window.
-
Click File
Open, and open the Jupyter notebook that you used to configure the training job. Create a cell to run the job, and add the following content:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the content to specify the appropriate values for your environment, as follows:
-
Edit the
num_workersvalue to specify the number of worker nodes. -
Update the
resources_per_workervalues according to the job requirements and the resources available. -
Edit the value of the
kueue.x-k8s.io/queue-namelabel to match the name of your targetLocalQueue. The example provided is for NVIDIA GPUs. If you use AMD accelerators, make the following additional changes:
-
In the
resources_per_workerentry, changenvidia.com/gputoamd.com/gpu -
Change the
base_imagevalue toregistry.redhat.io/rhoai/odh-training-cuda128-torch28-py312-rhel9:v3.0 -
Remove the
NCCL_DEBUGentry
-
In the
-
Edit the
If the job_kind value is not explicitly set, the TrainingClient API automatically sets the job_kind value to PyTorchJob.
- Run the cell to run the job.
Verification
View the progress of the job as follows:
Create a cell with the following content:
client.get_job_logs( name="pytorch-ddp", job_kind="PyTorchJob", follow=True, )client.get_job_logs( name="pytorch-ddp", job_kind="PyTorchJob", follow=True, )Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Run the cell to view the job progress.
4.3. Fine-tuning a model by using Kubeflow Training Copy linkLink copied to clipboard!
Supervised fine-tuning (SFT) is the process of customizing a Large Language Model (LLM) for a specific task by using labelled data. In this example, you use the Kubeflow Training Operator and Kubeflow Training Operator Python Software Development Kit (Training Operator SDK) to supervise fine-tune an LLM in Red Hat OpenShift AI, by using the Hugging Face SFT Trainer.
Optionally, you can use Low-Rank Adaptation (LoRA) to efficiently fine-tune large language models. LORA optimizes computational requirements and reduces memory footprint, enabling you to fine-tune on consumer-grade GPUs. With SFT, you can combine PyTorch Fully Sharded Data Parallel (FSDP) and LoRA to enable scalable, cost-effective model training and inference, enhancing the flexibility and performance of AI workloads within OpenShift environments.
4.3.1. Configuring the fine-tuning job Copy linkLink copied to clipboard!
Before you can use a training job to fine-tune a model, you must configure the training job. You must set the training parameters, define the training function, and configure the Training Operator SDK.
The code in this procedure specifies how to configure an example fine-tuning job. If you have the specified resources, you can run the example code without editing it.
Alternatively, you can modify the example code to specify the appropriate configuration for your fine-tuning job.
Prerequisites
You can access an OpenShift cluster that has sufficient worker nodes with supported accelerators to run your training or tuning job.
The example fine-tuning job requires 8 worker nodes, where each worker node has 64 GiB memory, 4 CPUs, and 1 NVIDIA GPU.
- You can access a workbench that is suitable for distributed training, as described in Creating a workbench for distributed training.
- You can access a dynamic storage provisioner that supports ReadWriteMany (RWX) Persistent Volume Claim (PVC) provisioning, such as Red Hat OpenShift Data Foundation.
You have administrator access for the project.
- If you created the project, you automatically have administrator access.
- If you did not create the project, your cluster administrator must give you administrator access.
Procedure
Open the workbench, as follows:
- Log in to the Red Hat OpenShift AI web console.
- Click Projects and click your project.
- Click the Workbenches tab.
- Ensure that the workbench uses a storage class with RWX capability.
- If your workbench is not already running, start the workbench.
- Click the Open link to open the IDE in a new window.
-
Click File
New Notebook. Install any additional packages that are needed to run the training or tuning job.
In a notebook cell, add the code to install the additional packages, as follows:
Code to install dependencies
# Install the yamlmagic package !pip install yamlmagic %load_ext yamlmagic !pip install git+https://github.com/kubeflow/trainer.git@release-1.9#subdirectory=sdk/python
# Install the yamlmagic package !pip install yamlmagic %load_ext yamlmagic !pip install git+https://github.com/kubeflow/trainer.git@release-1.9#subdirectory=sdk/pythonCopy to Clipboard Copied! Toggle word wrap Toggle overflow Select the cell, and click Run > Run selected cell.
The additional packages are installed.
Set the training parameters as follows:
Create a cell with the following content:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Optional: If you specify a different model or dataset, edit the parameters to suit your model, dataset, and resources. If necessary, update the previous cell to specify the dependencies for your training or tuning job.
- Run the cell to set the training parameters.
Create the training function as follows:
Create a cell with the following content:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Optional: If you specify a different model or dataset, edit the
tokenizer.chat_templateparameter to specify the appropriate value for your model and dataset. - Run the cell to create the training function.
Configure the Training Operator SDK client authentication as follows:
Create a cell with the following content:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
api_serverandtokenparameters to enter the values to authenticate to your OpenShift cluster.For information about how to find the server and token details, see Using the cluster server and token to authenticate.
- Run the cell to configure the Training Operator SDK client authentication.
- Click File > Save Notebook As, enter an appropriate file name, and click Save.
Verification
- All cells run successfully.
4.3.2. Running the fine-tuning job Copy linkLink copied to clipboard!
When you run a training job to tune a model, you must specify the resources needed, and provide any authorization credentials required.
The code in this procedure specifies how to run the example fine-tuning job. If you have the specified resources, you can run the example code without editing it.
Alternatively, you can modify the example code to specify the appropriate details for your fine-tuning job.
Prerequisites
You can access an OpenShift cluster that has sufficient worker nodes with supported accelerators to run your training or tuning job.
The example fine-tuning job requires 8 worker nodes, where each worker node has 64 GiB memory, 4 CPUs, and 1 NVIDIA GPU.
- You can access a workbench that is suitable for distributed training, as described in Creating a workbench for distributed training.
You have administrator access for the project.
- If you created the project, you automatically have administrator access.
- If you did not create the project, your cluster administrator must give you administrator access.
- You have access to a model.
- You have access to data that you can use to train the model.
- You have configured the fine-tuning job as described in Configuring the fine-tuning job.
- You can access a dynamic storage provisioner that supports ReadWriteMany (RWX) Persistent Volume Claim (PVC) provisioning, such as Red Hat OpenShift Data Foundation.
-
A
PersistentVolumeClaimresource namedsharedwith RWX access mode is attached to your workbench. - You have a Hugging Face account and access token. For more information, search for "user access tokens" in the Hugging Face documentation.
Procedure
Open the workbench, as follows:
- Log in to the Red Hat OpenShift AI web console.
- Click Projects and click your project.
- Click the Workbenches tab. If your workbench is not already running, start the workbench.
- Click the Open link to open the IDE in a new window.
-
Click File
Open, and open the Jupyter notebook that you used to configure the fine-tuning job. Create a cell to run the job, and add the following content:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
HF_TOKENvalue to specify your Hugging Face access token.Optional: If you specify a different model, and your model is not a gated model from the Hugging Face Hub, remove the
HF_HOMEandHF_TOKENentries.Optional: Edit the other content to specify the appropriate values for your environment, as follows:
-
Edit the
num_workersvalue to specify the number of worker nodes. -
Update the
resources_per_workervalues according to the job requirements and the resources available. The example provided is for NVIDIA GPUs. If you use AMD accelerators, make the following additional changes:
-
In the
resources_per_workerentry, changenvidia.com/gputoamd.com/gpu -
Change the
base_imagevalue toregistry.redhat.io/rhoai/odh-training-cuda128-torch28-py312-rhel9:v3.0 -
Remove the
CUDAandNCCLentries
-
In the
If the RWX
PersistentVolumeClaimresource that is attached to your workbench has a different name instead ofshared, update the following values to replacesharedwith your PVC name:-
In this cell, update the
HF_HOMEvalue. In this cell, in the
volumesentry, update the PVC details:-
In the
V1Volumeentry, update thenameandclaim_namevalues. -
In the
volume_mountsentry, update thenameandmount_pathvalues.
-
In the
In the cell where you set the training parameters, update the
output_dirvalue.For more information about setting the training parameters, see Configuring the fine-tuning job.
-
In this cell, update the
-
Edit the
- Run the cell to run the job.
Verification
View the progress of the job as follows:
Create a cell with the following content:
client.get_job_logs( name="sft", job_kind="PyTorchJob", follow=True, )client.get_job_logs( name="sft", job_kind="PyTorchJob", follow=True, )Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Run the cell to view the job progress.
4.3.3. Deleting the fine-tuning job Copy linkLink copied to clipboard!
When you no longer need the fine-tuning job, delete the job to release the resources.
The code in this procedure specifies how to delete the example fine-tuning job. If you created the example fine-tuning job named sft, you can run the example code without editing it.
Alternatively, you can modify this example code to specify the name of your fine-tuning job.
Prerequisites
- You have created a fine-tuning job as described in Running the fine-tuning job.
Procedure
Open the workbench, as follows:
- Log in to the Red Hat OpenShift AI web console.
- Click Projects and click your project.
- Click the Workbenches tab. If your workbench is not already running, start the workbench.
- Click the Open link to open the IDE in a new window.
-
Click File
Open, and open the Jupyter notebook that you used to configure and run the example fine-tuning job. Create a cell with the following content:
client.delete_job(name="sft")
client.delete_job(name="sft")Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Optional: If you want to delete a different job, edit the content to replace
sftwith the name of your job. - Run the cell to delete the job.
Verification
-
In the OpenShift Console, in the Administrator perspective, click Workloads
Jobs. - From the Project list, select your project.
- Verify that the specified job is not listed.
4.4. Creating a multi-node PyTorch training job with RDMA Copy linkLink copied to clipboard!
NVIDIA GPUDirect RDMA uses Remote Direct Memory Access (RDMA) to provide direct GPU interconnect, enabling peripheral devices to access NVIDIA GPU memory in remote systems directly. RDMA improves the training job performance because it eliminates the overhead of using the operating system CPUs and memory. Running a training job on multiple nodes using multiple GPUs can significantly reduce the completion time.
In Red Hat OpenShift AI, NVIDIA GPUs can communicate directly by using GPUDirect RDMA across the following types of network:
- Ethernet: RDMA over Converged Ethernet (RoCE)
- InfiniBand
Before you create a PyTorch training job in a cluster configured for RDMA, you must configure the job to use the high-speed network interfaces.
Prerequisites
- You can access an OpenShift cluster that has multiple worker nodes with supported NVIDIA GPUs.
Your cluster administrator has configured the cluster as follows:
- Installed Red Hat OpenShift AI with the required distributed training components, as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
- Configured the distributed training resources, as described in Managing distributed workloads.
- Configured the cluster for RDMA, as described in Configuring a cluster for RDMA.
Procedure
- Log in to the OpenShift Console.
Create a
PyTorchJobresource, as follows:-
In the Administrator perspective, click Home
Search. - From the Project list, select your project.
-
Click the Resources list, and in the search field, start typing
PyTorchJob. Select PyTorchJob, and click Create PyTorchJob.
The Create PyTorchJob page opens, with default YAML code automatically added.
-
In the Administrator perspective, click Home
Attach the high-speed network interface to the
PyTorchJobpods, as follows:Edit the
PyTorchJobresource YAML code to include an annotation that adds the pod to an additional network, as shown in the following example:Example annotation to attach network interface to pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace the example network name
example-netwith the appropriate value for your configuration.
Configure the job to use NVIDIA Collective Communications Library (NCCL) interfaces, as follows:
Edit the
PyTorchJobresource YAML code to add the following environment variables:Example environment variables
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace the example environment-variable values with the appropriate values for your configuration:
-
Set the
*NCCL_SOCKET_IFNAME*environment variable to specify the IP interface to use for communication. -
[Optional] To explicitly specify the Host Channel Adapter (HCA) that NCCL should use, set the
*NCCL_IB_HCA*environment variable.
-
Set the
Specify the base training image name, as follows:
Edit the
PyTorchJobresource YAML code to add the following text:Example base training image
image: registry.redhat.io/rhoai/odh-training-cuda128-torch28-py312-rhel9:v3.0
image: registry.redhat.io/rhoai/odh-training-cuda128-torch28-py312-rhel9:v3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you want to use a different base training image, replace the image name accordingly.
For a list of supported training images, see Supported Configurations for 3.x.
Specify the requests and limits for the network interface resources.
The name of the resource varies, depending on the NVIDIA Network Operator configuration. The resource name might depend on the deployment mode, and is specified in the
NicClusterPolicyresource.NoteYou must use the resource name that matches your configuration. The name must correspond to the value advertised by the NVIDIA Network Operator on the cluster nodes.
The following example is for RDMA over Converged Ethernet (RoCE), where the Ethernet RDMA devices are using the RDMA shared device mode.
Review the
NicClusterPolicyresource to identify theresourceNamevalue.Example NicClusterPolicy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example
NicClusterPolicyresource, theresourceNamevalue isrdma_shared_device_eth.Edit the
PyTorchJobresource YAML code to add the following text:Example requests and limits for the network interface resources
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
In the
limitsandrequestssections, replace the resource name with the resource name from yourNicClusterPolicyresource (in this example,rdma_shared_device_eth). -
Replace the specified value
1with the number that you require. Ensure that the specified amount is available on your OpenShift cluster.
-
Repeat the above steps to make the same edits in the
Workersection of thePyTorchJobYAML code. - Click Create.
You have created a multi-node PyTorch training job that is configured to run with RDMA.
You can see the entire YAML code for this example PyTorchJob resource in the Example Training Operator PyTorchJob resource configured to run with RDMA.
Verification
- In the OpenShift Console, open the Administrator perspective.
- From the Project list, select your project.
-
Click Home
Search PyTorchJob and verify that the job was created. -
Click Workloads
Pods and verify that requested head pod and worker pods are running.
4.5. Example Training Operator PyTorchJob resource configured to run with RDMA Copy linkLink copied to clipboard!
This example shows how to create a Training Operator PyTorch training job that is configured to run with Remote Direct Memory Access (RDMA).