Chapter 6. Running a distributed workload


You can distribute the training of a machine learning model across many CPUs by using by using Ray or the Training Operator.

6.1. Distributing training jobs

Earlier, you trained the fraud detection model directly in a notebook and then in a pipeline. You can also distribute the training of a machine learning model across many CPUs.

Distributing training is not necessary for a simple model. However, by applying it to the example fraud model, you learn how to train more complex models that require more compute power.

NOTE: Distributed training in OpenShift AI uses the Red Hat build of Kueue for admission and scheduling. Before you run the Ray or Training Operator examples in this tutorial, complete the setup tasks in Setting up Kueue resources.

You can try one or both of the following options:

6.1.1. Distributing training jobs with Ray

You can use Ray, a distributed computing framework, to parallelize Python code across many CPUs or GPUs.

NOTE: Distributed training in OpenShift AI uses the Red Hat build of Kueue for admission and scheduling. Before you run a distributed training example in this tutorial, complete the setup tasks in Setting up Kueue resources.

In your notebook environment, open the 6_distributed_training.ipynb file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, creating Ray clusters, and working with jobs.

Optionally, if you want to view the Python code for this step, you can find it in the ray-scripts/train_tf_cpu.py file.

Jupyter Notebook

For more information about TensorFlow training on Ray, see the Ray TensorFlow guide.

The Training Operator is a tool for scalable distributed training of machine learning (ML) models created with various ML frameworks, such as PyTorch.

NOTE: Distributed training in OpenShift AI uses the Red Hat build of Kueue for admission and scheduling. Before you run a distributed training example in this tutorial, complete the setup tasks in Setting up Kueue resources.

In your notebook environment, open the 7_distributed_training_kfto.ipynb file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, initializing the Training Operator client, and submitting a PyTorchJob.

You can also view the complete Python code in the kfto-scripts/train_pytorch_cpu.py file.

Jupyter Notebook

For more information about PyTorchJob training with the Training Operator, see the Training Operator PyTorchJob guide.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top