Model Deployment (Inference)

Once a model is trained, it is typically deployed as an online API endpoint as part of a web service or to make batch predictions. To deal with latency-sensitive applications or devices that may experience intermittent or no connectivity, models can also be deployed to edge devices to be embedded as a component within an iPhone app, deployed within a driverless car, robot, IoT device, or wherever the model is needed.

Models are deployed to customer facing applications in the case of an ecommerce site that makes realtime product recommendations, or to internal services in the case of a company that performs realtime financial forecasting, sentiment analysis, or risk management.

Deployed models should be monitored in terms of infrastructure health (e.g. requests, response time, and load), model drift and decay (where the live model performance degrades on new, unseen data or the underlying assumptions about the data change), and other performance criteria.

In the case of a web service, a model may need to be autoscaled based on requests and/or load.

Frameworks and runtimes: In a basic scenario, a model can be deployed as a traditional web server with something like Flask. In a more sophisticated environment, inferencing will happen in an optimized ML-specific model-serving framework such as TensorFlow Serving, Clipper, TensorRT, or Seldon.

Server: Most model-serving frameworks are based on REST though TensorFlow Serving and TensorRT offer gRPC endpoints which are fussier but more performant.

Canary rollouts, blue-green deployments, multi-armed bandit, & A/B testing: These methods are not specific to machine learning, but they deal with how models are rolled out to production to catch errors, perform tests, and find the best performing model.

There is no de-facto standard or even established best practices for deploying, managing, and monitoring models in any of these scenarios.

Inference + Gradient

Gradient from Paperspace streamlines model deployment:

  • Any framework/runtime is supported as well as both gRPC and HTTP inference endpoints

  • Models can be deployed to a wide variety of CPU and GPU instance types

  • Single or multiple instances are supported with out-of-the-box load balancing

  • Both basic auth (username & password) and JWT auth are supported

  • Autoscaling and monitoring of endpoints are available

Last updated