Docker Best Practices for ML Projects
Practical advice for containerizing machine learning workflows — from multi-stage builds to GPU passthrough and reproducible environments.
Docker Best Practices for ML Projects
Machine learning projects have notoriously complex dependency chains — specific Python versions, CUDA toolkit versions, cuDNN, framework-specific requirements, and system libraries that conflict with each other. Docker solves this, but ML containers come with their own challenges.
Base Image Selection
Start with the right base. For GPU workloads:
# NVIDIA's CUDA images are the standard
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
# For PyTorch specifically
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
# For TensorFlow
FROM tensorflow/tensorflow:2.15.0-gpu
For CPU-only workloads, use slim Python images:
FROM python:3.11-slim
Avoid python:latest or ubuntu:latest. Pin your versions.
Multi-Stage Builds
ML Docker images bloat quickly. Multi-stage builds keep the final image small:
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Runtime
FROM python:3.11-slim
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app
CMD ["python", "inference.py"]
This separates build tools (compilers, headers) from the runtime image. I've seen 60% size reductions with this pattern.
Layer Caching Strategy
Docker caches layers sequentially. Order your Dockerfile from least-frequently-changed to most:
# 1. System dependencies (rarely change)
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx \
libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# 2. Python dependencies (change occasionally)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 3. Model weights (change sometimes)
COPY models/ ./models/
# 4. Application code (changes frequently)
COPY src/ ./src/
This way, changing your code doesn't trigger a full pip install rebuild.
Dependency Management
Use pinned dependencies. Always.
# requirements.txt — pin everything
torch==2.1.0
numpy==1.24.3
scikit-learn==1.3.0
pandas==2.0.3
Generate with pip freeze > requirements.txt, then manually review and clean up.
For better reproducibility, consider pip-tools:
# requirements.in (your direct dependencies)
torch>=2.0
numpy
scikit-learn
# Generate pinned requirements.txt
pip-compile requirements.in
GPU Passthrough
To use the GPU inside Docker:
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Run with GPU access
docker run --gpus all my-ml-image
Verify inside the container:
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
Model Weight Management
Don't bake large model files into the image. Options:
-
Volume mounts: Mount a host directory with model files
docker run -v /path/to/models:/app/models my-ml-image -
Download at startup: Fetch from S3/GCS on container start
# entrypoint.py if not os.path.exists("models/model.pt"): download_model("s3://bucket/model.pt", "models/model.pt") -
Docker volumes: Named volumes persist across container restarts
Health Checks
For ML services, add meaningful health checks:
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD python -c "import requests; r = requests.get('http://localhost:8000/health'); assert r.status_code == 200"
.dockerignore
Essential for ML projects where you have large datasets and checkpoints locally:
# .dockerignore
data/
*.pt
*.pth
*.h5
*.ckpt
__pycache__/
.git/
notebooks/
wandb/
mlruns/
Docker Compose for ML Pipelines
services:
inference:
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- model-cache:/app/models
ports:
- "8000:8000"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
model-cache:
Conclusion
Docker transforms ML projects from "it works on my machine" to genuinely reproducible environments. The key principles: pin everything, use multi-stage builds, order layers for cache efficiency, and don't bake large files into images. Your future self (and your teammates) will thank you.