Files
claude-skills-reference/engineering-team/senior-ml-engineer/references/mlops_production_patterns.md
Alireza Rezvani 757b0c9dd7 fix(skill): rewrite senior-ml-engineer with real ML content (#74) (#141)
- Replace 3 boilerplate reference files with real technical content:
  - mlops_production_patterns.md: deployment, feature stores, A/B testing
  - llm_integration_guide.md: provider abstraction, cost management
  - rag_system_architecture.md: vector DBs, chunking, reranking
- Rewrite SKILL.md: add trigger phrases, TOC, numbered workflows
- Remove "world-class" marketing language (appeared 5+ times)
- Standardize terminology to "MLOps" (not "Mlops")
- Add validation checkpoints to all workflows

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 13:03:09 +01:00

7.1 KiB

MLOps Production Patterns

Production ML infrastructure patterns for model deployment, monitoring, and lifecycle management.


Table of Contents


Model Deployment Pipeline

Deployment Workflow

  1. Export trained model to standardized format (ONNX, TorchScript, SavedModel)
  2. Package model with dependencies in Docker container
  3. Deploy to staging environment
  4. Run integration tests against staging
  5. Deploy canary (5% traffic) to production
  6. Monitor latency and error rates for 1 hour
  7. Promote to full production if metrics pass
  8. Validation: p95 latency < 100ms, error rate < 0.1%

Container Structure

FROM python:3.11-slim

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model artifacts
COPY model/ /app/model/
COPY src/ /app/src/

# Health check endpoint
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

Model Serving Options

Option Latency Throughput Use Case
FastAPI + Uvicorn Low Medium REST APIs, small models
Triton Inference Server Very Low Very High GPU inference, batching
TensorFlow Serving Low High TensorFlow models
TorchServe Low High PyTorch models
Ray Serve Medium High Complex pipelines, multi-model

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-serving
  template:
    spec:
      containers:
      - name: model
        image: model:v1.0.0
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Feature Store Architecture

Feature Store Components

Component Purpose Tools
Offline Store Training data, batch features BigQuery, Snowflake, S3
Online Store Low-latency serving Redis, DynamoDB, Feast
Feature Registry Metadata, lineage Feast, Tecton, Hopsworks
Transformation Feature engineering Spark, Flink, dbt

Feature Pipeline Workflow

  1. Define feature schema in registry
  2. Implement transformation logic (SQL or Python)
  3. Backfill historical features to offline store
  4. Schedule incremental updates
  5. Materialize to online store for serving
  6. Monitor feature freshness and quality
  7. Validation: Feature values within expected ranges, no nulls in required fields

Feature Definition Example

from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
        Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)

Model Monitoring

Monitoring Dimensions

Dimension Metrics Alert Threshold
Latency p50, p95, p99 p95 > 100ms
Throughput requests/sec < 80% baseline
Errors error rate, 5xx count > 0.1%
Data Drift PSI, KS statistic PSI > 0.2
Model Drift accuracy, AUC decay > 5% drop

Data Drift Detection

from scipy.stats import ks_2samp
import numpy as np

def detect_drift(reference: np.array, current: np.array, threshold: float = 0.05):
    """Detect distribution drift using Kolmogorov-Smirnov test."""
    statistic, p_value = ks_2samp(reference, current)

    drift_detected = p_value < threshold

    return {
        "drift_detected": drift_detected,
        "ks_statistic": statistic,
        "p_value": p_value,
        "threshold": threshold
    }

Monitoring Dashboard Metrics

Infrastructure:

  • Request latency (p50, p95, p99)
  • Requests per second
  • Error rate by type
  • CPU/memory utilization
  • GPU utilization (if applicable)

Model Performance:

  • Prediction distribution
  • Feature value distributions
  • Model output confidence
  • Ground truth vs predictions (when available)

A/B Testing Infrastructure

Experiment Workflow

  1. Define experiment hypothesis and success metrics
  2. Calculate required sample size for statistical power
  3. Configure traffic split (control vs treatment)
  4. Deploy treatment model alongside control
  5. Route traffic based on user/session hash
  6. Collect metrics for both variants
  7. Run statistical significance test
  8. Validation: p-value < 0.05, minimum sample size reached

Traffic Splitting

import hashlib

def get_variant(user_id: str, experiment: str, control_pct: float = 0.5) -> str:
    """Deterministic traffic splitting based on user ID."""
    hash_input = f"{user_id}:{experiment}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = (hash_value % 100) / 100.0

    return "control" if bucket < control_pct else "treatment"

Metrics Collection

Metric Type Examples Collection Method
Primary Conversion rate, revenue Event logging
Secondary Latency, engagement Request logs
Guardrail Error rate, crashes Monitoring system

Automated Retraining

Retraining Triggers

Trigger Detection Method Action
Scheduled Cron (weekly/monthly) Full retrain
Performance drop Accuracy < threshold Immediate retrain
Data drift PSI > 0.2 Evaluate, then retrain
New data volume X new samples Incremental update

Retraining Pipeline

  1. Trigger detection (schedule, drift, performance)
  2. Fetch latest training data from feature store
  3. Run training job with hyperparameter config
  4. Evaluate model on holdout set
  5. Compare against production model
  6. If improved: register new model version
  7. Deploy to staging for validation
  8. Promote to production via canary
  9. Validation: New model outperforms baseline on key metrics

MLflow Model Registry Integration

import mlflow

def register_model(model, metrics: dict, model_name: str):
    """Register trained model with MLflow."""
    with mlflow.start_run():
        # Log metrics
        for name, value in metrics.items():
            mlflow.log_metric(name, value)

        # Log model
        mlflow.sklearn.log_model(model, "model")

        # Register in model registry
        model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
        mlflow.register_model(model_uri, model_name)