Machine Learning Model Deployment Patterns

The Gap Between Notebooks and Production

The Jupyter notebook that achieves ninety-six percent accuracy on a held-out test set is not a model. It is a hypothesis. The gap between that hypothesis and a production system that serves predictions to real users at acceptable latency, cost, and reliability is where the majority of machine learning projects fail. Industry surveys consistently report that fewer than half of ML prototypes ever reach production, and the reasons are almost never about model accuracy.

Deploying a model means deploying a system, and systems have properties that notebooks abstract away: dependency management, version drift, data pipeline dependencies, monitoring, rollback, and the silent expectation that the statistical assumptions of the training distribution will continue to hold in the wild. They frequently do not.

Deployment Patterns and When to Use Each

The choice of deployment pattern should follow the latency, throughput, and cost requirements of the use case — not the other way around. I have seen teams default to real-time serving because it feels modern, then discover that their use case was perfectly served by a nightly batch job that cost ninety-eight percent less to operate.

Batch scoring — precompute predictions on a schedule, store results, serve from a cache. Ideal for recommendation ranking where slight staleness is acceptable and cost matters.
Real-time inference — load the model in a serving process, accept requests, return predictions within an SLO. Required for fraud detection, content moderation, and any use case where latency is the product.
Edge deployment — ship the model to the client or a nearby CDN worker. Reduces latency and cost but introduces version fragmentation and makes model updates slower.
Streaming inference — consume from a message queue, produce predictions asynchronously. Fits pipelines where the input is a continuous event stream rather than a request.

The Infrastructure Tax

Machine learning models are expensive to serve in ways that traditional software is not. A model with three hundred million parameters consumes memory proportional to parameter count regardless of request volume. A sudden traffic spike does not just increase CPU usage; it can cause out-of-memory errors as the serving infrastructure scales horizontally and each new replica must load the full model weights. Capacity planning for ML serving is a different discipline from capacity planning for stateless web services.

The mitigation strategies draw from classical systems engineering: model quantization to reduce memory footprint, request batching to amortize inference overhead, GPU sharing to improve utilization, and graceful degradation when capacity is exhausted. None of these are ML techniques. They are infrastructure techniques applied to ML artifacts, which is why ML engineering is increasingly indistinguishable from platform engineering.

A model in production is a liability the moment it ships. The question is not whether it will degrade — data drift guarantees it will — but whether you have the instrumentation to detect the degradation before your users do.

Monitoring Beyond Uptime

Traditional uptime monitoring is necessary but insufficient for ML systems. A serving endpoint can return HTTP 200 for every request while quietly producing garbage predictions because the input distribution has shifted. Detecting this requires monitoring the statistical properties of inputs and outputs, not just the operational properties of the infrastructure.

The practical implementation involves logging prediction confidence distributions, tracking feature drift via statistical tests comparing live inputs to training baselines, and setting alerting thresholds on business metrics that correlate with model quality — conversion rates, false positive costs, user complaints. The system that catches a degrading model before the support team notices is worth more than the model itself. Deployment is not a one-time event; it is the beginning of an operational commitment that lasts until the model is retired.

The Gap Between Notebooks and Production

Deployment Patterns and When to Use Each

The Infrastructure Tax

Monitoring Beyond Uptime

Share this article

Related Posts

Building Distributed Systems: Lessons From Production

Adaptive Resource Orchestration in Cloud-Native Systems

AI-Driven Test Generation: A Practical Guide