Most data science education — including the earlier chapters in this course — focuses on building models: training, evaluating, and tuning within a Jupyter notebook. The value of a model, however, only materializes when it is actually used — by a system, an application, or a decision-making process running in the real world.
Research suggests that fewer than 20% of ML models make it from notebook to production, and many that do are abandoned within months. Common reasons include: no reproducible training pipeline, model performance that degrades as the world changes, no monitoring so failures are silent, experiments tracked in spreadsheets so nobody knows which model configuration was best, and no versioning so nobody knows what is running in production.
MLOps (Machine Learning Operations) is the set of practices that close this gap. It applies the principles of software engineering — version control, testing, CI/CD, monitoring — to machine learning systems. In this chapter we cover the pieces most relevant to individual data scientists: reproducible pipelines, experiment tracking, model serialization, serving, drift detection, and model documentation.
A production ML system goes through a cycle, not a one-time pipeline. Data is ingested, features are engineered, a model is trained and evaluated, then registered, deployed, monitored, and eventually retrained as the world changes.
Data drift, performance tracking (Evidently, Arize)
Retraining
Scheduled or drift-triggered pipelines
This chapter covers the practices in the training through monitoring stages. Deployment infrastructure (containers, cloud platforms) is a topic on its own.
17.3 Reproducible Pipelines
The most common reproducibility bug in ML is preprocessing leakage: fitting a scaler on all the data (including the test set) rather than on the training set alone. The second most common is serving inconsistency: the scaling applied at training time differs from what is applied when the model runs in production.
Both bugs are solved by wrapping preprocessing and the model into a single sklearn Pipeline. When we call pipeline.fit(X_train, y_train), all transformers are fit on training data only. When we call pipeline.predict(X_new), the same transformations are applied automatically. The pipeline object can be serialized as a single artifact — no separate preprocessing code to maintain or coordinate.
A good rule of thumb: every production model should be a Pipeline, not a loose collection of preprocessing calls followed by a model call. The Pipeline is the contract between training and serving.
Code
pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42))])pipeline.fit(X_train, y_train)y_pred = pipeline.predict(X_test)mae = mean_absolute_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f'MAE: {mae:.4f}')print(f'R2: {r2:.4f}')print()print('Pipeline steps:')for name, step in pipeline.steps:print(f' {name}: {type(step).__name__}')print()print('Scaling and prediction happen in a single call:')print(pipeline.predict(X_test.iloc[:3]))
17.4 Experiment Tracking with MLflow
Data scientists routinely run dozens or hundreds of experiments while building a model: different algorithms, hyperparameters, feature sets, preprocessing choices. Without systematic tracking, choosing the best configuration to deploy becomes nearly impossible.
MLflow stores parameters, metrics, and model artifacts for every training run. It also provides a model registry where models are versioned and promoted through stages: Staging → Production → Archived. The web UI (launched with mlflow ui) makes it straightforward to compare runs side by side.
Before a model can be deployed, it must be serialized — saved from memory to disk. joblib is the standard for sklearn objects; it is faster and more compact than pickle for large numpy arrays.
What to save alongside the model:
The serialized pipeline
Training data version or hash
Hyperparameters and evaluation metrics
Expected input feature schema
Python environment (requirements.txt)
The question what model is running in production and why did we choose it? should always have a clear, traceable answer. Saving these artifacts at training time, linked to the MLflow run ID, provides that traceability.
Code
os.makedirs('model_artifacts', exist_ok=True)joblib.dump(pipeline, 'model_artifacts/pipeline_v1.pkl')metadata = {'model_version': '1.0.0','trained_on': str(datetime.date.today()),'algorithm': 'GradientBoostingRegressor','hyperparameters': {'n_estimators': 100, 'max_depth': 4},'features': list(X_train.columns),'test_mae': round(mae, 4),'test_r2': round(r2, 4),}withopen('model_artifacts/model_card.json', 'w') as f: json.dump(metadata, f, indent=2)print('Saved artifacts:')for f in os.listdir('model_artifacts'):print(f' {f} ({os.path.getsize("model_artifacts/"+f):,} bytes)')# Verify the pipeline round-trips correctlyloaded = joblib.load('model_artifacts/pipeline_v1.pkl')assert np.allclose(y_pred, loaded.predict(X_test))print("Round-trip: OK")
17.6 Model Serving
The most common pattern for serving a sklearn model in production is a lightweight HTTP API built with FastAPI. It accepts a JSON payload, runs the pipeline, and returns the prediction. Save the file below as serve.py and start it with uvicorn serve:app.
For larger scale, look at BentoML, Seldon Core, or cloud-native endpoints (SageMaker, Vertex AI, Azure ML) which add auto-scaling, A/B traffic splitting, and integrated monitoring.
17.7 Monitoring: Data Drift
Data drift occurs when the statistical properties of the model’s input features change after deployment. The model was trained on distribution \(P_{\text{train}}(X)\); data arriving in production follows a different \(P_{\text{prod}}(X)\). Performance degrades because the model is generalizing beyond its training distribution.
Common examples: a fraud model trained on pre-pandemic spending patterns; a churn model that encounters a new demographic segment; a demand forecast deployed in a new region with different seasonality.
The Kolmogorov-Smirnov (KS) test detects distribution shift in a single continuous feature. The Population Stability Index (PSI) quantifies the magnitude: PSI < 0.1 is stable; 0.1–0.25 is moderate drift; above 0.25 signals significant drift and should trigger a review.
Concept drift is distinct from data drift: the relationship \(P(Y \mid X)\) changes, even if the input distribution \(P(X)\) stays stable. Consumer spending behavior shifts during an economic downturn; fraud patterns evolve as fraudsters adapt; a sentiment classifier degrades as language changes.
Concept drift cannot be detected from inputs alone — we need ground truth labels, which often arrive with a delay. The most practical approach is to track a rolling window of model performance over time and alert when it crosses a threshold. The Page-Hinkley test and CUSUM are more rigorous sequential change-point methods for high-stakes environments.
Code
np.random.seed(10)n_periods =24stable_maes = mae + np.random.normal(0, 0.005, 12)drifted_maes = mae + np.linspace(0, 0.15, 12) + np.random.normal(0, 0.008, 12)monthly_maes = np.concatenate([stable_maes, drifted_maes])alert_threshold = stable_maes.mean() +2*stable_maes.std()months = np.arange(1, n_periods+1)colors = ['darkorange'if m > alert_threshold else'steelblue'for m in monthly_maes]fig, ax = plt.subplots(figsize=(10, 4))ax.bar(months, monthly_maes, color=colors, alpha=0.8)ax.axhline(mae, color='steelblue', linestyle='--', lw=1.5, label='Baseline MAE')ax.axhline(alert_threshold, color='red', linestyle='--', lw=1.5, label='Alert threshold')ax.axvline(12.5, color='black', linestyle=':', lw=1.5, label='Drift begins')ax.set_xlabel('Month'); ax.set_ylabel('MAE')ax.set_title('Model Performance Over Time — Concept Drift Detection')ax.legend(); plt.tight_layout(); plt.show()alert_months = [m for m,v inzip(months, monthly_maes) if v > alert_threshold]print(f"Months triggering alert: {alert_months}")
17.9 Retraining
When drift is detected, the model needs to be retrained. The key decisions are: how often, on what data, and triggered by what?
Scheduled retraining (weekly, monthly) is simple to implement but wastes compute if drift is infrequent, and is too slow if drift is rapid. Triggered retraining fires when a performance or drift metric crosses a threshold — more efficient, but requires reliable ground truth labels. Online learning updates the model incrementally without full retraining; the river library (pip install river) supports this.
What data to train on also depends on the setting. A rolling window of the most recent months is appropriate when the distribution changes rapidly, such as fraud detection. For stable relationships, using all historical data is generally better.
One important constraint: every retrained model should go through the same evaluation gate as the original. Automatically promoting a retrained model that performs worse than the deployed version is a production failure mode that is surprisingly common.
17.10 Model Cards
A model card documents what a model does, how it was built, and what its limitations are. Introduced by Google (Mitchell et al., 2019), model cards are now a standard practice for responsible deployment and are required for high-risk AI systems under the EU AI Act.
A model card typically covers:
Model details (algorithm, version, training date, author)
Intended use and out-of-scope uses
Training data (dataset, size, known limitations)
Evaluation results, broken down by subgroup where relevant
Ethical considerations: fairness analysis, potential for misuse
Caveats and conditions that may cause the model to fail
Serving multiple audiences — engineers, product managers, auditors, regulators — a model card is the single most important governance document for a deployed model.
Code
model_card = {'model_details': {'name': 'California Housing Price Estimator','version': '1.0.0','type': 'GradientBoostingRegressor (sklearn Pipeline)','date': str(datetime.date.today()), },'intended_use': {'primary_use': 'Estimate median house values in California census tracts.','out_of_scope': ['Individual property appraisals (unit is census tract, not house)','Regions outside California','Data from after 1990 (model trained on 1990 census)' ] },'evaluation': {'test_set_size': f'{X_test.shape[0]:,} census tracts','MAE': round(mae, 4),'R2': round(r2, 4),'note': 'Target is median_house_value / 100000' },'ethical_considerations': ('No subgroup fairness analysis performed. ''Should not be used for individual loan decisions without human review.' ),'caveats': ['Performance degrades significantly outside the 1990 California context.','Monitor for data drift if deploying in a different time period.' ]}withopen('model_artifacts/model_card_full.json', 'w') as f: json.dump(model_card, f, indent=2)print(json.dumps(model_card, indent=2))
17.11 Key Takeaways
A model is not done when it achieves good accuracy on a test set. It is done when it is deployed, monitored, versioned, documented, and has a clear path to being updated when the world changes.
Wrap every production model in a Pipeline — no separate preprocessing code
Track every training run in MLflow — log parameters, metrics, and the model artifact
Serialize the pipeline and a model card together as a single deployment artifact
Monitor for data drift (KS test, PSI) and concept drift (rolling performance windows)
Every retrained model goes through the same evaluation gate before promotion