Building and Deploying a Machine Learning Model API with FastAPI and Docker

Introduction

Deploying machine learning models into production is one of the most critical — and often most challenging — steps in the ML lifecycle. While building a high-accuracy model in a Jupyter notebook is rewarding, making that model accessible to applications, users, and other services through a reliable API requires a solid engineering foundation.

In this tutorial, you will learn how to build a complete, production-ready ML model serving API using FastAPI and Docker. FastAPI provides automatic OpenAPI documentation, asynchronous support, and high performance, while Docker guarantees consistency across development, staging, and production environments. By the end, you'll have a containerized REST API that can serve predictions from a scikit-learn model, complete with health checks, input validation, and easy horizontal scaling via Docker Compose.

Prerequisites: Basic knowledge of Python, familiarity with pip and virtual environments, and Docker installed on your machine (Docker Desktop or a Linux Docker engine). No prior FastAPI experience is required.

Why FastAPI for ML Model Serving?

FastAPI has quickly become the go-to Python web framework for ML serving, and for good reason:

  • Performance: FastAPI is built on Starlette and Pydantic, delivering async performance comparable to Node.js and Go.
  • Automatic Documentation: Every endpoint you build is automatically documented with Swagger UI and ReDoc, making it easy for frontend teams to integrate.
  • Built-in Validation: Pydantic models provide automatic request/response validation, which is critical when accepting numerical features for predictions.
  • Type Hints: Full type hint support makes code self-documenting and reduces runtime errors.

Project Structure

We will build a project that predicts house prices using a simple linear regression model trained on the classic Boston Housing dataset (loaded via scikit-learn). Here is the complete project structure:

ml-api/
├── app/
│   ├── __init__.py
│   ├── main.py           # FastAPI application and endpoints
│   ├── model.py          # Model loading and prediction logic
│   └── schemas.py        # Pydantic request/response models
├── model/
│   └── train.py          # Script to train and save the model
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Step 1: Train and Save the Model

First, let's create the training script that will train a simple regression model and save it to disk. Create the file model/train.py:

import pickle
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset (a standard regression dataset)
data = load_diabetes()
X, y = data.data, data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model trained successfully!")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R2 Score: {r2:.4f}")

# Save the model and feature names
with open('model.pkl', 'wb') as f:
    pickle.dump({
        'model': model,
        'feature_names': data.feature_names,
        'description': data.DESCR[:200]
    }, f)

print("Model saved to model.pkl")

Run this script locally (or in the Docker build) to generate model.pkl. This model will predict diabetes progression based on ten baseline variables such as age, sex, BMI, and blood pressure measurements.

Step 2: Define the FastAPI Application

Now let's build the API layer. Start with the Pydantic schemas in app/schemas.py:

from pydantic import BaseModel, Field
from typing import List, Optional

class PredictionInput(BaseModel):
    """Input features for prediction. All 10 features from the diabetes dataset."""
    age: float = Field(..., description="Age (normalized)")
    sex: float = Field(..., description="Sex (normalized)")
    bmi: float = Field(..., description="Body Mass Index (normalized)")
    bp: float = Field(..., description="Average Blood Pressure (normalized)")
    s1: float = Field(..., description="Blood serum measurement 1 (normalized)")
    s2: float = Field(..., description="Blood serum measurement 2 (normalized)")
    s3: float = Field(..., description="Blood serum measurement 3 (normalized)")
    s4: float = Field(..., description="Blood serum measurement 4 (normalized)")
    s5: float = Field(..., description="Blood serum measurement 5 (normalized)")
    s6: float = Field(..., description="Blood serum measurement 6 (normalized)")

class PredictionOutput(BaseModel):
    prediction: float = Field(..., description="Predicted diabetes progression value")
    message: str = Field(default="Prediction generated successfully")

class HealthResponse(BaseModel):
    status: str = Field(default="healthy")
    model_loaded: bool

Next, create the model loading logic in app/model.py:

import pickle
import os
from typing import Dict, Any
import numpy as np

MODEL_PATH = os.environ.get("MODEL_PATH", "model/model.pkl")

_model = None

def load_model() -> Dict[str, Any]:
    """Load the trained model from disk."""
    global _model
    if _model is None:
        with open(MODEL_PATH, 'rb') as f:
            _model = pickle.load(f)
        print(f"Model loaded from {MODEL_PATH}")
    return _model

def predict(features: list) -> float:
    """Run prediction on a list of feature values."""
    model_data = load_model()
    model = model_data['model']
    features_array = np.array(features).reshape(1, -1)
    prediction = model.predict(features_array)
    return float(prediction[0])

def is_model_loaded() -> bool:
    """Check if the model is loaded and ready."""
    try:
        load_model()
        return True
    except Exception:
        return False

Finally, create the main FastAPI application in app/main.py:

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from app.schemas import PredictionInput, PredictionOutput, HealthResponse
from app.model import predict, is_model_loaded

app = FastAPI(
    title="ML Model Serving API",
    description="Production-ready API for serving machine learning predictions",
    version="1.0.0"
)

# Allow cross-origin requests (for frontend integration)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.get("/", tags=["Root"])
def root():
    return {
        "service": "ML Model API",
        "version": "1.0.0",
        "docs": "/docs"
    }

@app.get("/health", response_model=HealthResponse, tags=["Health"])
def health_check():
    """Health check endpoint for container orchestration."""
    return HealthResponse(
        status="healthy",
        model_loaded=is_model_loaded()
    )

@app.post("/predict", response_model=PredictionOutput, tags=["Prediction"])
def predict_endpoint(input_data: PredictionInput):
    """
    Predict diabetes progression from input features.
    
    Accepts 10 normalized medical features and returns a
    quantitative measure of diabetes progression one year after baseline.
    """
    if not is_model_loaded():
        raise HTTPException(
            status_code=503,
            detail="Model not loaded. Please try again later."
        )
    
    features = [
        input_data.age, input_data.sex, input_data.bmi, input_data.bp,
        input_data.s1, input_data.s2, input_data.s3, input_data.s4,
        input_data.s5, input_data.s6
    ]
    
    try:
        result = predict(features)
        return PredictionOutput(prediction=round(result, 2))
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Prediction failed: {str(e)}"
        )

Step 3: Define Dependencies

Create requirements.txt:

fastapi==0.104.1
uvicorn[standard]==0.24.0
scikit-learn==1.3.2
numpy==1.24.3
pydantic==2.5.2
python-multipart==0.0.6

Step 4: Containerize with Docker

Now for the most important part — creating a Docker image that packages everything together. Create the Dockerfile:

# Stage 1: Build stage
FROM python:3.11-slim as builder

WORKDIR /build

# Install system dependencies for scikit-learn
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc build-essential && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Runtime stage (slimmer final image)
FROM python:3.11-slim

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY app/ ./app/
COPY model/ ./model/

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH
ENV MODEL_PATH=/app/model/model.pkl

# Expose the API port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1

# Run with uvicorn
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

This Dockerfile uses a multi-stage build pattern. The first stage installs compilers and builds scikit-learn dependencies, while the second stage copies only the built Python packages — resulting in a significantly smaller final image.

Step 5: Orchestrate with Docker Compose

For local development and testing, Docker Compose provides a convenient way to run the API with proper configuration. Create docker-compose.yml:

version: '3.8'

services:
  ml-api:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ml-api
    ports:
      - "8000:8000"
    environment:
      - MODEL_PATH=/app/model/model.pkl
    volumes:
      - ./model:/app/model
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/health')"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 20s

Step 6: Build and Run

With all files in place, build and start the service:

# Build the Docker image
docker compose build

# Start the service in detached mode
docker compose up -d

# Check logs
docker compose logs -f

# Verify the health endpoint
curl http://localhost:8000/health

# Open the interactive API docs in your browser
# Visit: http://localhost:8000/docs

Step 7: Test the API

Once the container is running, send a prediction request:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "age": 0.0380759,
    "sex": 0.0506801,
    "bmi": 0.0616962,
    "bp": 0.0218724,
    "s1": -0.0442235,
    "s2": -0.0348208,
    "s3": -0.0434008,
    "s4": -0.0025923,
    "s5": 0.0198619,
    "s6": -0.0176461
  }'

You should receive a response similar to:

{
  "prediction": 150.21,
  "message": "Prediction generated successfully"
}

Step 8: Scale with Multiple Workers

For production deployments, you can increase throughput by running multiple workers with Gunicorn. Update the Dockerfile CMD:

CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", "0.0.0.0:8000"]

Add gunicorn to your requirements.txt:

gunicorn==21.2.0

Best Practices for Production ML APIs

Practice Why It Matters
Input Validation Never trust raw user input. Pydantic models prevent malformed requests from crashing your model.
Health Checks Essential for container orchestrators (Kubernetes, ECS, Nomad) to know when your service is ready.
Model Versioning Store models with version tags (e.g., model_v2.1.pkl) so you can roll back if needed.
Asynchronous Inference For heavy models (deep learning), use background tasks or a task queue like Celery + Redis.
Logging & Monitoring Add structured logging and metrics (Prometheus) to track prediction latency, error rates, and model drift.
Resource Limits Set CPU/memory limits in Docker Compose or Kubernetes to prevent a single model from starving other services.
Caching For expensive predictions on repeated inputs, consider an LRU cache with TTL.
Security Add API keys, rate limiting, and input sanitization for public-facing endpoints.

Conclusion

You have now built a complete, containerized ML model serving API using FastAPI and Docker. Your API automatically generates interactive documentation, validates all inputs, includes health checks, and can be scaled horizontally with Docker Compose or deployed to any container orchestration platform.

This architecture serves as a production-ready template that you can adapt for any ML model — whether it's a simple linear regression, a random forest, or a deep neural network. The key takeaway is that separating model training from model serving, containerizing the serving layer, and using a framework with built-in validation and documentation leads to more reliable, maintainable, and scalable ML deployments.

Next steps: Try adding a /batch-predict endpoint that accepts multiple input rows, integrate model monitoring with Prometheus metrics, or deploy the container to AWS ECS, Google Cloud Run, or a Kubernetes cluster.

Pro Tip: Store your trained models in a cloud object store (S3, GCS) and download them at container startup using a startup script. This lets you update the model without rebuilding the Docker image — simply restart the container and it fetches the latest version.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *