Docker Best Practices for ML Projects

Machine learning projects have notoriously complex dependency chains — specific Python versions, CUDA toolkit versions, cuDNN, framework-specific requirements, and system libraries that conflict with each other. Docker solves this, but ML containers come with their own challenges.

Base Image Selection

Start with the right base. For GPU workloads:

# NVIDIA's CUDA images are the standard
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

# For PyTorch specifically
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

# For TensorFlow
FROM tensorflow/tensorflow:2.15.0-gpu

For CPU-only workloads, use slim Python images:

FROM python:3.11-slim

Avoid python:latest or ubuntu:latest. Pin your versions.

Multi-Stage Builds

ML Docker images bloat quickly. Multi-stage builds keep the final image small:

# Stage 1: Build dependencies
FROM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .

RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.11-slim

COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app

CMD ["python", "inference.py"]

This separates build tools (compilers, headers) from the runtime image. I've seen 60% size reductions with this pattern.

Layer Caching Strategy

Docker caches layers sequentially. Order your Dockerfile from least-frequently-changed to most:

# 1. System dependencies (rarely change)
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# 2. Python dependencies (change occasionally)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 3. Model weights (change sometimes)
COPY models/ ./models/

# 4. Application code (changes frequently)
COPY src/ ./src/

This way, changing your code doesn't trigger a full pip install rebuild.

Dependency Management

Use pinned dependencies. Always.

# requirements.txt — pin everything
torch==2.1.0
numpy==1.24.3
scikit-learn==1.3.0
pandas==2.0.3

Generate with pip freeze > requirements.txt, then manually review and clean up.

For better reproducibility, consider pip-tools:

# requirements.in (your direct dependencies)
torch>=2.0
numpy
scikit-learn

# Generate pinned requirements.txt
pip-compile requirements.in

GPU Passthrough

To use the GPU inside Docker:

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Run with GPU access
docker run --gpus all my-ml-image

Verify inside the container:

nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

Model Weight Management

Don't bake large model files into the image. Options:

Volume mounts: Mount a host directory with model files

docker run -v /path/to/models:/app/models my-ml-image

Download at startup: Fetch from S3/GCS on container start

# entrypoint.py
if not os.path.exists("models/model.pt"):
    download_model("s3://bucket/model.pt", "models/model.pt")

Docker volumes: Named volumes persist across container restarts

Health Checks

For ML services, add meaningful health checks:

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import requests; r = requests.get('http://localhost:8000/health'); assert r.status_code == 200"

.dockerignore

Essential for ML projects where you have large datasets and checkpoints locally:

# .dockerignore
data/
*.pt
*.pth
*.h5
*.ckpt
__pycache__/
.git/
notebooks/
wandb/
mlruns/

Docker Compose for ML Pipelines

services:
  inference:
    build: .
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - model-cache:/app/models
    ports:
      - "8000:8000"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  model-cache:

Conclusion

Docker transforms ML projects from "it works on my machine" to genuinely reproducible environments. The key principles: pin everything, use multi-stage builds, order layers for cache efficiency, and don't bake large files into images. Your future self (and your teammates) will thank you.