Module 3 - Infrastructure & Deployment

Module 3 - Infrastructure & Deployment

Infrastructure & Deployment Module

Leverage your infrastructure expertise to build scalable, reliable ML systems.

Learning Objectives

  • Design ML-specific infrastructure patterns
  • Implement containerized ML deployments
  • Orchestrate ML workloads on Kubernetes
  • Optimize model serving for production

Week 9: Containerization for ML

Docker for ML Applications

Multi-stage Build Pattern

# Build stage
FROM python:3.9 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

# Runtime stage
FROM python:3.9-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "serve.py"]

GPU Support

FROM nvidia/cuda:11.8.0-base-ubuntu22.04
# Install Python and ML frameworks
RUN apt-get update && apt-get install -y python3-pip
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Container Optimization

  • Layer caching strategies
  • Size reduction techniques
  • Security scanning
  • Registry management

Week 10: Kubernetes for ML

ML Workload Patterns

Training Jobs

apiVersion: batch/v1
kind: Job
metadata:
  name: model-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: ml-training:latest
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
      restartPolicy: OnFailure

Model Serving

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: model
        image: model-serve:latest
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080

Kubeflow Components

  • Pipelines for orchestration
  • KFServing for model serving
  • Katib for hyperparameter tuning
  • Notebooks for development

Week 11: Model Serving Architectures

Serving Frameworks Comparison

FrameworkUse CaseProsCons
TensorFlow ServingTF modelsHigh performanceTF-specific
TorchServePyTorch modelsEasy deploymentPyTorch only
TritonMulti-frameworkGPU optimizationComplex setup
Seldon CoreAny frameworkK8s nativeOverhead

Implementation Examples

FastAPI Model Server

from fastapi import FastAPI
import torch
import numpy as np

app = FastAPI()
model = torch.load("model.pt")

@app.post("/predict")
async def predict(data: dict):
    input_tensor = torch.tensor(data["features"])
    with torch.no_grad():
        prediction = model(input_tensor)
    return {"prediction": prediction.tolist()}

gRPC for Low Latency

import grpc
from concurrent import futures
import model_pb2
import model_pb2_grpc

class ModelService(model_pb2_grpc.ModelServicer):
    def Predict(self, request, context):
        features = np.array(request.features)
        prediction = self.model.predict(features)
        return model_pb2.PredictionResponse(
            predictions=prediction.tolist()
        )

Edge Deployment

  • Model quantization
  • ONNX conversion
  • TensorFlow Lite
  • Core ML for iOS

Week 12: Infrastructure as Code

Terraform for ML Infrastructure

# GPU-enabled training cluster
resource "aws_eks_node_group" "gpu_nodes" {
  cluster_name    = aws_eks_cluster.ml_cluster.name
  node_group_name = "gpu-nodes"
  
  instance_types = ["p3.2xlarge"]
  
  scaling_config {
    desired_size = 2
    max_size     = 10
    min_size     = 1
  }
  
  labels = {
    workload = "ml-training"
    gpu      = "true"
  }
}

# Model storage
resource "aws_s3_bucket" "model_artifacts" {
  bucket = "ml-model-artifacts"
  
  versioning {
    enabled = true
  }
  
  lifecycle_rule {
    enabled = true
    
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
  }
}

Cost Optimization

  • Spot instances for training
  • Auto-scaling policies
  • Resource tagging
  • Cost monitoring dashboards

Hands-on Projects

Project 1: Production ML Service

Build a complete ML service with:

  • Containerized model server
  • Kubernetes deployment
  • Auto-scaling based on metrics
  • Blue-green deployment
  • Monitoring and logging

Project 2: Multi-Cloud ML Platform

Design infrastructure supporting:

  • Training on AWS/GCP
  • Model registry on S3
  • Serving on edge devices
  • Cost optimization

Performance Optimization

Model Optimization Techniques

  • Quantization (INT8, FP16)
  • Pruning unused weights
  • Knowledge distillation
  • Batch inference

Infrastructure Optimization

  • GPU utilization monitoring
  • Memory optimization
  • Network bandwidth management
  • Caching strategies

Assessment

Practical Tasks

  • Containerize an ML application with GPU support
  • Deploy model on Kubernetes with auto-scaling
  • Implement A/B testing for models
  • Create IaC for ML infrastructure
  • Optimize model serving latency

Performance Metrics

  • Model serving latency < 100ms
  • 99.9% availability SLA
  • Cost per inference < $0.001
  • GPU utilization > 80%

Tools & Resources

Essential Tools

  • Containers: Docker, Podman
  • Orchestration: Kubernetes, EKS, GKE
  • IaC: Terraform, Pulumi
  • Serving: TF Serving, TorchServe, Triton
  • Monitoring: Prometheus, Grafana

Learning Resources

Next Steps

Proceed to Module 4: Data Engineering to master data pipeline design and feature engineering for ML systems.