Chapter 6. Running a distributed workload
6.1. Distributing training jobs with Ray
In previous sections of this tutorial, you trained the fraud model directly in a notebook and then in a pipeline. In this section, you learn how to train the model by using Ray. Ray is a distributed computing framework that you can use to parallelize Python code across multiple CPUs or GPUs.
This section demonstrates how you can use Ray to distribute the training of a machine learning model across multiple CPUs. While distributing training is not necessary for a simple model, applying it to the example fraud model is a good way for you to learn how to use Ray for more complex models that require more compute power, such as multiple GPUs across multiple machines.
In your notebook environment, open the 8_distributed_training.ipynb
file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, creating Ray clusters, and working with jobs.
Optionally, if you want to view the Python code for this section, you can find it in the ray-scripts/train_tf_cpu.py
file.
For more information about TensorFlow training on Ray, see the Ray TensorFlow guide.
Next step