Chapter 6. Running a distributed workload


6.1. Distributing training jobs with Ray

In previous sections of this tutorial, you trained the fraud model directly in a notebook and then in a pipeline. In this section, you learn how to train the model by using Ray. Ray is a distributed computing framework that you can use to parallelize Python code across multiple CPUs or GPUs.

This section demonstrates how you can use Ray to distribute the training of a machine learning model across multiple CPUs. While distributing training is not necessary for a simple model, applying it to the example fraud model is a good way for you to learn how to use Ray for more complex models that require more compute power, such as multiple GPUs across multiple machines.

In your notebook environment, open the 8_distributed_training.ipynb file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, creating Ray clusters, and working with jobs.

Optionally, if you want to view the Python code for this section, you can find it in the ray-scripts/train_tf_cpu.py file.

Jupyter Notebook

For more information about TensorFlow training on Ray, see the Ray TensorFlow guide.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.