# Model Deployment (Inference)

Once a model is trained, it is typically deployed as an online API endpoint as part of a web service or to make batch predictions.  To deal with latency-sensitive applications or devices that may experience intermittent or no connectivity, models can also be deployed to edge devices to be embedded as a component within an iPhone app, deployed within a driverless car, robot, IoT device, or wherever the model is needed.

Models are deployed to customer facing applications in the case of an ecommerce site that makes realtime product recommendations, or to internal services in the case of a company that performs realtime financial forecasting, sentiment analysis, or risk management.

Deployed models should be monitored in terms of infrastructure health (e.g. requests, response time, and load), [model drift and decay](https://machine-learning.paperspace.com/wiki/model-drift-and-decay) (where the live model performance degrades on new, unseen data or the underlying assumptions about the data change), and other performance criteria.  &#x20;

In the case of a web service, a model may need to be autoscaled based on requests and/or load. &#x20;

**Frameworks and runtimes**: In a basic scenario, a model can be deployed as a traditional web server with something like Flask.  In a more sophisticated environment, inferencing will happen in an optimized ML-specific model-serving framework such as TensorFlow Serving, Clipper, TensorRT, or Seldon.

**Server:** Most model-serving frameworks are based on [REST](https://machine-learning.paperspace.com/wiki/rest-and-grpc) though TensorFlow Serving and TensorRT offer [gRPC](https://machine-learning.paperspace.com/wiki/rest-and-grpc) endpoints which are fussier but more performant. &#x20;

**Canary rollouts, blue-green deployments, multi-armed bandit, & A/B testing**: These methods are not specific to machine learning, but they deal with how models are rolled out to production to catch errors, perform tests, and find the best performing model.

There is no de-facto standard or even established best practices for deploying, managing, and monitoring models in any of these scenarios.

## Inference + Gradient

Gradient from [Paperspace](https://www.paperspace.com/gradient) streamlines model deployment:

* Any framework/runtime is supported as well as both gRPC and HTTP inference endpoints
* Models can be deployed to a wide variety of CPU and GPU instance types
* Single or multiple instances are supported with out-of-the-box load balancing
* Both basic auth (username & password) and JWT auth are supported
* Autoscaling and monitoring of endpoints are available

### Related Material

{% embed url="<https://docs.paperspace.com/gradient/deployments/about>" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://machine-learning.paperspace.com/wiki/model-deployment.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
