🤖
AI Wiki
Gradient PlatformDocsGet Started FreeContact Sales
  • Artificial Intelligence Wiki
  • Topics
    • Accuracy and Loss
    • Activation Function
    • AI Chips for Training and Inference
    • Artifacts
    • Artificial General Intelligence (AGI)
    • AUC (Area under the ROC Curve)
    • Automated Machine Learning (AutoML)
    • CI/CD for Machine Learning
    • Comparison of ML Frameworks
    • Confusion Matrix
    • Containers
    • Convergence
    • Convolutional Neural Network (CNN)
    • Datasets and Machine Learning
    • Data Science vs Machine Learning vs Deep Learning
    • Distributed Training (TensorFlow, MPI, & Horovod)
    • Generative Adversarial Network (GAN)
    • Epochs, Batch Size, & Iterations
    • ETL
    • Features, Feature Engineering, & Feature Stores
    • Gradient Boosting
    • Gradient Descent
    • Hyperparameter Optimization
    • Interpretability
    • Jupyter Notebooks
    • Kubernetes
    • Linear Regression
    • Logistic Regression
    • Long Short-Term Memory (LSTM)
    • Machine Learning Operations (MLOps)
    • Managing Machine Learning Models
    • ML Showcase
    • Metrics in Machine Learning
    • Machine Learning Models Explained
    • Model Deployment (Inference)
    • Model Drift & Decay
    • Model Training
    • MNIST
    • Overfitting vs Underfitting
    • Random Forest
    • Recurrent Neural Network (RNN)
    • Reproducibility in Machine Learning
    • REST and gRPC
    • Serverless ML: FaaS and Lambda
    • Synthetic Data
    • Structured vs Unstructured Data
    • Supervised, Unsupervised, & Reinforcement Learning
    • TensorBoard
    • Tensor Processing Unit (TPU)
    • Transfer Learning
    • Weights and Biases
Powered by GitBook
On this page
  • Distributed Training + Gradient
  • Related Material

Was this helpful?

  1. Topics

Distributed Training (TensorFlow, MPI, & Horovod)

PreviousData Science vs Machine Learning vs Deep LearningNextGenerative Adversarial Network (GAN)

Last updated 5 years ago

Was this helpful?

Distributed training enables training workloads to scale-up beyond the capacity of a single compute instance. Model training is performed across multiple instances, often called “workers,” and training time can decrease dramatically. Distributed training therefore helps tighten the feedback loop between training and evaluation, enabling data scientists to iterate more quickly.

The two most common types of distributed training are MPI/, a multi-framework tool from Uber, and Distributed TensorFlow, a TensorFlow-specific tool from Google.

Distributed Training + Gradient

Gradient provides first-class support for distributed training with both Distributed TensorFlow and MPI. With Gradient, you can run large-scale distributed training with almost no changes to your code. Here's a snippet of code showing the parameters of a distributed training experiment:

gradient experiments run multinode \
  --name multiEx \
  --projectId <your-project-id> \
  --experimentType GRPC \
  --workerContainer tensorflow/tensorflow:1.13.1-gpu-py3 \
  --workerMachineType K80 \
  --workerCommand "python mnist.py" \
  --workerCount 2 \
  --parameterServerContainer tensorflow/tensorflow:1.13.1-gpu-py3 \
  --parameterServerMachineType K80 \
  --parameterServerCommand "python mnist.py" \
  --parameterServerCount 1 \
  --workspaceUrl https://github.com/Paperspace/mnist-sample.git \
  --modelType Tensorflow

Here's a GitHub with a sample project.

Related Material

Horovod
repo
https://docs.paperspace.com/gradient/experiments/run-experiments-cli#creating-a-multinode-experiment-using-the-clidocs.paperspace.com