Distributed Training (TensorFlow, MPI, & Horovod)

Distributed training enables training workloads to scale-up beyond the capacity of a single compute instance. Model training is performed across multiple instances, often called “workers,” and training time can decrease dramatically. Distributed training therefore helps tighten the feedback loop between training and evaluation, enabling data scientists to iterate more quickly.

The two most common types of distributed training are MPI/Horovod, a multi-framework tool from Uber, and Distributed TensorFlow, a TensorFlow-specific tool from Google.

Distributed Training + Gradient

Gradient provides first-class support for distributed training with both Distributed TensorFlow and MPI. With Gradient, you can run large-scale distributed training with almost no changes to your code. Here's a snippet of code showing the parameters of a distributed training experiment:

gradient experiments run multinode \
  --name multiEx \
  --projectId <your-project-id> \
  --experimentType GRPC \
  --workerContainer tensorflow/tensorflow:1.13.1-gpu-py3 \
  --workerMachineType K80 \
  --workerCommand "python mnist.py" \
  --workerCount 2 \
  --parameterServerContainer tensorflow/tensorflow:1.13.1-gpu-py3 \
  --parameterServerMachineType K80 \
  --parameterServerCommand "python mnist.py" \
  --parameterServerCount 1 \
  --workspaceUrl https://github.com/Paperspace/mnist-sample.git \
  --modelType Tensorflow

Here's a GitHub repo with a sample project.

Last updated