# Distributed Training (TensorFlow, MPI, & Horovod)

Distributed training enables training workloads to scale-up beyond the capacity of a single compute instance. Model training is performed across multiple instances, often called “workers,” and training time can decrease dramatically.  Distributed training therefore helps tighten the feedback loop between training and evaluation, enabling data scientists to iterate more quickly.

The two most common types of distributed training are MPI/[Horovod](https://eng.uber.com/horovod/), a multi-framework tool from Uber, and Distributed TensorFlow, a TensorFlow-specific tool from Google.

## Distributed Training + Gradient

Gradient provides first-class support for distributed training with both Distributed TensorFlow and MPI.  With Gradient, you can run large-scale distributed training with almost no changes to your code. Here's a snippet of code showing the parameters of a distributed training experiment:

```bash
gradient experiments run multinode \
  --name multiEx \
  --projectId <your-project-id> \
  --experimentType GRPC \
  --workerContainer tensorflow/tensorflow:1.13.1-gpu-py3 \
  --workerMachineType K80 \
  --workerCommand "python mnist.py" \
  --workerCount 2 \
  --parameterServerContainer tensorflow/tensorflow:1.13.1-gpu-py3 \
  --parameterServerMachineType K80 \
  --parameterServerCommand "python mnist.py" \
  --parameterServerCount 1 \
  --workspaceUrl https://github.com/Paperspace/mnist-sample.git \
  --modelType Tensorflow
```

Here's a GitHub [repo](https://github.com/Paperspace/mnist-sample) with a sample project.

### Related Material

{% embed url="<https://docs.paperspace.com/gradient/experiments/run-experiments-cli#creating-a-multinode-experiment-using-the-cli>" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://machine-learning.paperspace.com/wiki/distributed-training-tensorflow-mpi-and-horovod.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
