🤖
AI Wiki
Gradient PlatformDocsGet Started FreeContact Sales
  • Artificial Intelligence Wiki
  • Topics
    • Accuracy and Loss
    • Activation Function
    • AI Chips for Training and Inference
    • Artifacts
    • Artificial General Intelligence (AGI)
    • AUC (Area under the ROC Curve)
    • Automated Machine Learning (AutoML)
    • CI/CD for Machine Learning
    • Comparison of ML Frameworks
    • Confusion Matrix
    • Containers
    • Convergence
    • Convolutional Neural Network (CNN)
    • Datasets and Machine Learning
    • Data Science vs Machine Learning vs Deep Learning
    • Distributed Training (TensorFlow, MPI, & Horovod)
    • Generative Adversarial Network (GAN)
    • Epochs, Batch Size, & Iterations
    • ETL
    • Features, Feature Engineering, & Feature Stores
    • Gradient Boosting
    • Gradient Descent
    • Hyperparameter Optimization
    • Interpretability
    • Jupyter Notebooks
    • Kubernetes
    • Linear Regression
    • Logistic Regression
    • Long Short-Term Memory (LSTM)
    • Machine Learning Operations (MLOps)
    • Managing Machine Learning Models
    • ML Showcase
    • Metrics in Machine Learning
    • Machine Learning Models Explained
    • Model Deployment (Inference)
    • Model Drift & Decay
    • Model Training
    • MNIST
    • Overfitting vs Underfitting
    • Random Forest
    • Recurrent Neural Network (RNN)
    • Reproducibility in Machine Learning
    • REST and gRPC
    • Serverless ML: FaaS and Lambda
    • Synthetic Data
    • Structured vs Unstructured Data
    • Supervised, Unsupervised, & Reinforcement Learning
    • TensorBoard
    • Tensor Processing Unit (TPU)
    • Transfer Learning
    • Weights and Biases
Powered by GitBook
On this page
  • Inference + Gradient
  • Related Material

Was this helpful?

  1. Topics

Model Deployment (Inference)

PreviousMachine Learning Models ExplainedNextModel Drift & Decay

Last updated 5 years ago

Was this helpful?

Once a model is trained, it is typically deployed as an online API endpoint as part of a web service or to make batch predictions. To deal with latency-sensitive applications or devices that may experience intermittent or no connectivity, models can also be deployed to edge devices to be embedded as a component within an iPhone app, deployed within a driverless car, robot, IoT device, or wherever the model is needed.

Models are deployed to customer facing applications in the case of an ecommerce site that makes realtime product recommendations, or to internal services in the case of a company that performs realtime financial forecasting, sentiment analysis, or risk management.

Deployed models should be monitored in terms of infrastructure health (e.g. requests, response time, and load), (where the live model performance degrades on new, unseen data or the underlying assumptions about the data change), and other performance criteria.

In the case of a web service, a model may need to be autoscaled based on requests and/or load.

Frameworks and runtimes: In a basic scenario, a model can be deployed as a traditional web server with something like Flask. In a more sophisticated environment, inferencing will happen in an optimized ML-specific model-serving framework such as TensorFlow Serving, Clipper, TensorRT, or Seldon.

Server: Most model-serving frameworks are based on though TensorFlow Serving and TensorRT offer endpoints which are fussier but more performant.

Canary rollouts, blue-green deployments, multi-armed bandit, & A/B testing: These methods are not specific to machine learning, but they deal with how models are rolled out to production to catch errors, perform tests, and find the best performing model.

There is no de-facto standard or even established best practices for deploying, managing, and monitoring models in any of these scenarios.

Inference + Gradient

Gradient from streamlines model deployment:

  • Any framework/runtime is supported as well as both gRPC and HTTP inference endpoints

  • Models can be deployed to a wide variety of CPU and GPU instance types

  • Single or multiple instances are supported with out-of-the-box load balancing

  • Both basic auth (username & password) and JWT auth are supported

  • Autoscaling and monitoring of endpoints are available

Related Material

model drift and decay
REST
gRPC
Paperspace
https://docs.paperspace.com/gradient/deployments/aboutdocs.paperspace.com