17  MLOps: From Notebook to Production

17.1 The Deployment Gap

Most data science education — including the earlier chapters in this course — focuses on building models: training, evaluating, and tuning within a Jupyter notebook. The value of a model, however, only materializes when it is actually used — by a system, an application, or a decision-making process running in the real world.

Research suggests that fewer than 20% of ML models make it from notebook to production, and many that do are abandoned within months. Common reasons include: no reproducible training pipeline, model performance that degrades as the world changes, no monitoring so failures are silent, experiments tracked in spreadsheets so nobody knows which model configuration was best, and no versioning so nobody knows what is running in production.

MLOps (Machine Learning Operations) is the set of practices that close this gap. It applies the principles of software engineering — version control, testing, CI/CD, monitoring — to machine learning systems. In this chapter we cover the pieces most relevant to individual data scientists: reproducible pipelines, experiment tracking, model serialization, serving, drift detection, and model documentation.

Code
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score
from scipy import stats
import joblib, json, os, datetime

np.random.seed(42)

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

17.2 The ML Lifecycle

A production ML system goes through a cycle, not a one-time pipeline. Data is ingested, features are engineered, a model is trained and evaluated, then registered, deployed, monitored, and eventually retrained as the world changes.

Stage Key Practice
Data Versioning (DVC), validation (Great Expectations, Pandera)
Training Reproducible pipelines, experiment tracking (MLflow)
Evaluation Performance gates, held-out test set, model registry
Deployment Docker, FastAPI, cloud endpoints (SageMaker, Vertex AI)
Monitoring Data drift, performance tracking (Evidently, Arize)
Retraining Scheduled or drift-triggered pipelines

This chapter covers the practices in the training through monitoring stages. Deployment infrastructure (containers, cloud platforms) is a topic on its own.

17.3 Reproducible Pipelines

The most common reproducibility bug in ML is preprocessing leakage: fitting a scaler on all the data (including the test set) rather than on the training set alone. The second most common is serving inconsistency: the scaling applied at training time differs from what is applied when the model runs in production.

Both bugs are solved by wrapping preprocessing and the model into a single sklearn Pipeline. When we call pipeline.fit(X_train, y_train), all transformers are fit on training data only. When we call pipeline.predict(X_new), the same transformations are applied automatically. The pipeline object can be serialized as a single artifact — no separate preprocessing code to maintain or coordinate.

A good rule of thumb: every production model should be a Pipeline, not a loose collection of preprocessing calls followed by a model call. The Pipeline is the contract between training and serving.

Code
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)
print(f'MAE: {mae:.4f}')
print(f'R2:  {r2:.4f}')
print()
print('Pipeline steps:')
for name, step in pipeline.steps:
    print(f'  {name}: {type(step).__name__}')
print()
print('Scaling and prediction happen in a single call:')
print(pipeline.predict(X_test.iloc[:3]))

17.4 Experiment Tracking with MLflow

Data scientists routinely run dozens or hundreds of experiments while building a model: different algorithms, hyperparameters, feature sets, preprocessing choices. Without systematic tracking, choosing the best configuration to deploy becomes nearly impossible.

MLflow stores parameters, metrics, and model artifacts for every training run. It also provides a model registry where models are versioned and promoted through stages: Staging → Production → Archived. The web UI (launched with mlflow ui) makes it straightforward to compare runs side by side.

Install: pip install mlflow

Code
import mlflow
import mlflow.sklearn

mlflow.set_tracking_uri('file:./mlruns')
mlflow.set_experiment('california_housing')

configs = [
    {'n_estimators': 50,  'max_depth': 3},
    {'n_estimators': 100, 'max_depth': 4},
    {'n_estimators': 200, 'max_depth': 5},
]

for cfg in configs:
    with mlflow.start_run():
        mlflow.log_params(cfg)
        pipe = Pipeline([
            ('scaler', StandardScaler()),
            ('model',  GradientBoostingRegressor(**cfg, random_state=42))
        ])
        pipe.fit(X_train, y_train)
        preds = pipe.predict(X_test)
        mlflow.log_metric('mae', mean_absolute_error(y_test, preds))
        mlflow.log_metric('r2',  r2_score(y_test, preds))
        mlflow.sklearn.log_model(pipe, 'pipeline')
        print(f"n_est={cfg['n_estimators']:3d}, depth={cfg['max_depth']}: "
              f"MAE={mean_absolute_error(y_test,preds):.4f}, R2={r2_score(y_test,preds):.4f}")

print()
print("Run: mlflow ui  to compare results in your browser")

17.5 Model Serialization

Before a model can be deployed, it must be serialized — saved from memory to disk. joblib is the standard for sklearn objects; it is faster and more compact than pickle for large numpy arrays.

What to save alongside the model:

  • The serialized pipeline
  • Training data version or hash
  • Hyperparameters and evaluation metrics
  • Expected input feature schema
  • Python environment (requirements.txt)

The question what model is running in production and why did we choose it? should always have a clear, traceable answer. Saving these artifacts at training time, linked to the MLflow run ID, provides that traceability.

Code
os.makedirs('model_artifacts', exist_ok=True)

joblib.dump(pipeline, 'model_artifacts/pipeline_v1.pkl')

metadata = {
    'model_version':   '1.0.0',
    'trained_on':      str(datetime.date.today()),
    'algorithm':       'GradientBoostingRegressor',
    'hyperparameters': {'n_estimators': 100, 'max_depth': 4},
    'features':        list(X_train.columns),
    'test_mae':        round(mae, 4),
    'test_r2':         round(r2, 4),
}
with open('model_artifacts/model_card.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print('Saved artifacts:')
for f in os.listdir('model_artifacts'):
    print(f'  {f}  ({os.path.getsize("model_artifacts/"+f):,} bytes)')

# Verify the pipeline round-trips correctly
loaded = joblib.load('model_artifacts/pipeline_v1.pkl')
assert np.allclose(y_pred, loaded.predict(X_test))
print("Round-trip: OK")

17.6 Model Serving

The most common pattern for serving a sklearn model in production is a lightweight HTTP API built with FastAPI. It accepts a JSON payload, runs the pipeline, and returns the prediction. Save the file below as serve.py and start it with uvicorn serve:app.

import joblib, numpy as np
from fastapi import FastAPI
from pydantic import BaseModel

app      = FastAPI()
pipeline = joblib.load('model_artifacts/pipeline_v1.pkl')

class HousingFeatures(BaseModel):
    MedInc: float;  HouseAge: float;  AveRooms: float;  AveBedrms: float
    Population: float;  AveOccup: float;  Latitude: float;  Longitude: float

@app.post('/predict')
def predict(f: HousingFeatures):
    X = np.array([[f.MedInc, f.HouseAge, f.AveRooms, f.AveBedrms,
                   f.Population, f.AveOccup, f.Latitude, f.Longitude]])
    return {'predicted_value': round(float(pipeline.predict(X)[0]), 4)}

@app.get('/health')
def health(): return {'status': 'ok'}

Install: pip install fastapi uvicorn

For larger scale, look at BentoML, Seldon Core, or cloud-native endpoints (SageMaker, Vertex AI, Azure ML) which add auto-scaling, A/B traffic splitting, and integrated monitoring.

17.7 Monitoring: Data Drift

Data drift occurs when the statistical properties of the model’s input features change after deployment. The model was trained on distribution \(P_{\text{train}}(X)\); data arriving in production follows a different \(P_{\text{prod}}(X)\). Performance degrades because the model is generalizing beyond its training distribution.

Common examples: a fraud model trained on pre-pandemic spending patterns; a churn model that encounters a new demographic segment; a demand forecast deployed in a new region with different seasonality.

The Kolmogorov-Smirnov (KS) test detects distribution shift in a single continuous feature. The Population Stability Index (PSI) quantifies the magnitude: PSI < 0.1 is stable; 0.1–0.25 is moderate drift; above 0.25 signals significant drift and should trigger a review.

Code
def psi(expected, actual, buckets=10):
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets+1))
    breakpoints[0] = -np.inf; breakpoints[-1] = np.inf
    exp_pct = np.histogram(expected, breakpoints)[0] / len(expected)
    act_pct = np.histogram(actual,   breakpoints)[0] / len(actual)
    exp_pct = np.where(exp_pct==0, 1e-6, exp_pct)
    act_pct = np.where(act_pct==0, 1e-6, act_pct)
    return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))

train_medinc = X_train['MedInc'].values
prod_stable  = np.random.normal(train_medinc.mean(), train_medinc.std(), 1000)
prod_drifted = np.random.normal(train_medinc.mean()+2.0, train_medinc.std()*1.5, 1000)

for label, prod in [('Stable', prod_stable), ('Drifted', prod_drifted)]:
    ks, ks_p = stats.ks_2samp(train_medinc, prod)
    p = psi(train_medinc, prod)
    alert = '  --> ALERT: significant drift' if p > 0.25 else ''
    print(f'[{label}]  KS={ks:.3f} (p={ks_p:.4f}),  PSI={p:.3f}{alert}')

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
for ax, prod, label in zip(axes, [prod_stable, prod_drifted], ['Stable','Drifted']):
    ax.hist(train_medinc, bins=40, alpha=0.5, label='Training', density=True, color='steelblue')
    ax.hist(prod, bins=40, alpha=0.5, label='Production', density=True, color='darkorange')
    ax.set_title(f'MedInc — {label}'); ax.legend()
plt.tight_layout(); plt.show()

17.8 Monitoring: Concept Drift

Concept drift is distinct from data drift: the relationship \(P(Y \mid X)\) changes, even if the input distribution \(P(X)\) stays stable. Consumer spending behavior shifts during an economic downturn; fraud patterns evolve as fraudsters adapt; a sentiment classifier degrades as language changes.

Concept drift cannot be detected from inputs alone — we need ground truth labels, which often arrive with a delay. The most practical approach is to track a rolling window of model performance over time and alert when it crosses a threshold. The Page-Hinkley test and CUSUM are more rigorous sequential change-point methods for high-stakes environments.

Code
np.random.seed(10)
n_periods = 24

stable_maes  = mae + np.random.normal(0, 0.005, 12)
drifted_maes = mae + np.linspace(0, 0.15, 12) + np.random.normal(0, 0.008, 12)
monthly_maes = np.concatenate([stable_maes, drifted_maes])
alert_threshold = stable_maes.mean() + 2*stable_maes.std()

months = np.arange(1, n_periods+1)
colors = ['darkorange' if m > alert_threshold else 'steelblue' for m in monthly_maes]
fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(months, monthly_maes, color=colors, alpha=0.8)
ax.axhline(mae,             color='steelblue', linestyle='--', lw=1.5, label='Baseline MAE')
ax.axhline(alert_threshold, color='red',       linestyle='--', lw=1.5, label='Alert threshold')
ax.axvline(12.5, color='black', linestyle=':', lw=1.5, label='Drift begins')
ax.set_xlabel('Month'); ax.set_ylabel('MAE')
ax.set_title('Model Performance Over Time — Concept Drift Detection')
ax.legend(); plt.tight_layout(); plt.show()

alert_months = [m for m,v in zip(months, monthly_maes) if v > alert_threshold]
print(f"Months triggering alert: {alert_months}")

17.9 Retraining

When drift is detected, the model needs to be retrained. The key decisions are: how often, on what data, and triggered by what?

Scheduled retraining (weekly, monthly) is simple to implement but wastes compute if drift is infrequent, and is too slow if drift is rapid. Triggered retraining fires when a performance or drift metric crosses a threshold — more efficient, but requires reliable ground truth labels. Online learning updates the model incrementally without full retraining; the river library (pip install river) supports this.

What data to train on also depends on the setting. A rolling window of the most recent months is appropriate when the distribution changes rapidly, such as fraud detection. For stable relationships, using all historical data is generally better.

One important constraint: every retrained model should go through the same evaluation gate as the original. Automatically promoting a retrained model that performs worse than the deployed version is a production failure mode that is surprisingly common.

17.10 Model Cards

A model card documents what a model does, how it was built, and what its limitations are. Introduced by Google (Mitchell et al., 2019), model cards are now a standard practice for responsible deployment and are required for high-risk AI systems under the EU AI Act.

A model card typically covers:

  • Model details (algorithm, version, training date, author)
  • Intended use and out-of-scope uses
  • Training data (dataset, size, known limitations)
  • Evaluation results, broken down by subgroup where relevant
  • Ethical considerations: fairness analysis, potential for misuse
  • Caveats and conditions that may cause the model to fail

Serving multiple audiences — engineers, product managers, auditors, regulators — a model card is the single most important governance document for a deployed model.

Code
model_card = {
    'model_details': {
        'name': 'California Housing Price Estimator',
        'version': '1.0.0',
        'type': 'GradientBoostingRegressor (sklearn Pipeline)',
        'date': str(datetime.date.today()),
    },
    'intended_use': {
        'primary_use': 'Estimate median house values in California census tracts.',
        'out_of_scope': [
            'Individual property appraisals (unit is census tract, not house)',
            'Regions outside California',
            'Data from after 1990 (model trained on 1990 census)'
        ]
    },
    'evaluation': {
        'test_set_size': f'{X_test.shape[0]:,} census tracts',
        'MAE': round(mae, 4),
        'R2':  round(r2, 4),
        'note': 'Target is median_house_value / 100000'
    },
    'ethical_considerations': (
        'No subgroup fairness analysis performed. '
        'Should not be used for individual loan decisions without human review.'
    ),
    'caveats': [
        'Performance degrades significantly outside the 1990 California context.',
        'Monitor for data drift if deploying in a different time period.'
    ]
}

with open('model_artifacts/model_card_full.json', 'w') as f:
    json.dump(model_card, f, indent=2)
print(json.dumps(model_card, indent=2))

17.11 Key Takeaways

A model is not done when it achieves good accuracy on a test set. It is done when it is deployed, monitored, versioned, documented, and has a clear path to being updated when the world changes.

  • Wrap every production model in a Pipeline — no separate preprocessing code
  • Track every training run in MLflow — log parameters, metrics, and the model artifact
  • Serialize the pipeline and a model card together as a single deployment artifact
  • Monitor for data drift (KS test, PSI) and concept drift (rolling performance windows)
  • Every retrained model goes through the same evaluation gate before promotion

Recommended reading: - Designing Machine Learning Systems — Chip Huyen - Machine Learning Engineering — Andriy Burkov (free PDF) - Evidently AI documentation: evidentlyai.com