Distributed Training (TensorFlow, MPI, & Horovod)
Distributed training enables training workloads to scale-up beyond the capacity of a single compute instance. Model training is performed across multiple instances, often called “workers,” and training time can decrease dramatically. Distributed training therefore helps tighten the feedback loop between training and evaluation, enabling data scientists to iterate more quickly.
The two most common types of distributed training are MPI/Horovod, a multi-framework tool from Uber, and Distributed TensorFlow, a TensorFlow-specific tool from Google.
Distributed Training + Gradient
Gradient provides first-class support for distributed training with both Distributed TensorFlow and MPI. With Gradient, you can run large-scale distributed training with almost no changes to your code. Here's a snippet of code showing the parameters of a distributed training experiment:
Here's a GitHub repo with a sample project.
Related Material
Last updated