Chapter 6. Running a distributed workload
You can distribute the training of a machine learning model across many CPUs by using by using Ray or the Training Operator.
6.1. Distributing training jobs Copy linkLink copied to clipboard!
Earlier, you trained the fraud detection model directly in a notebook and then in a pipeline. You can also distribute the training of a machine learning model across many CPUs.
Distributing training is not necessary for a simple model. However, by applying it to the example fraud model, you learn how to train more complex models that require more compute power.
NOTE: Distributed training in OpenShift AI uses the Red Hat build of Kueue for admission and scheduling. Before you run the Ray or Training Operator examples in this tutorial, complete the setup tasks in Setting up Kueue resources.
You can try one or both of the following options:
- The Ray distributed computing framework, as described in Distributing training jobs with Ray.
- The Training Operator, as described in Distributing training jobs with the Training Operator.
6.1.1. Distributing training jobs with Ray Copy linkLink copied to clipboard!
You can use Ray, a distributed computing framework, to parallelize Python code across many CPUs or GPUs.
NOTE: Distributed training in OpenShift AI uses the Red Hat build of Kueue for admission and scheduling. Before you run a distributed training example in this tutorial, complete the setup tasks in Setting up Kueue resources.
In your notebook environment, open the 6_distributed_training.ipynb file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, creating Ray clusters, and working with jobs.
Optionally, if you want to view the Python code for this step, you can find it in the ray-scripts/train_tf_cpu.py file.
For more information about TensorFlow training on Ray, see the Ray TensorFlow guide.
6.1.2. Distributing training jobs with the Training Operator Copy linkLink copied to clipboard!
The Training Operator is a tool for scalable distributed training of machine learning (ML) models created with various ML frameworks, such as PyTorch.
NOTE: Distributed training in OpenShift AI uses the Red Hat build of Kueue for admission and scheduling. Before you run a distributed training example in this tutorial, complete the setup tasks in Setting up Kueue resources.
In your notebook environment, open the 7_distributed_training_kfto.ipynb file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, initializing the Training Operator client, and submitting a PyTorchJob.
You can also view the complete Python code in the kfto-scripts/train_pytorch_cpu.py file.
For more information about PyTorchJob training with the Training Operator, see the Training Operator PyTorchJob guide.