Building and Deploying a Machine Learning Model API with FastAPI and Docker
Introduction
Deploying machine learning models into production is one of the most critical — and often most challenging — steps in the ML lifecycle. While building a high-accuracy model in a Jupyter notebook is rewarding, making that model accessible to applications, users, and other services through a reliable API requires a solid engineering foundation.
In this tutorial, you will learn how to build a complete, production-ready ML model serving API using FastAPI and Docker. FastAPI provides automatic OpenAPI documentation, asynchronous support, and high performance, while Docker guarantees consistency across development, staging, and production environments. By the end, you'll have a containerized REST API that can serve predictions from a scikit-learn model, complete with health checks, input validation, and easy horizontal scaling via Docker Compose.
Prerequisites: Basic knowledge of Python, familiarity with pip and virtual environments, and Docker installed on your machine (Docker Desktop or a Linux Docker engine). No prior FastAPI experience is required.
Why FastAPI for ML Model Serving?
FastAPI has quickly become the go-to Python web framework for ML serving, and for good reason:
- Performance: FastAPI is built on Starlette and Pydantic, delivering async performance comparable to Node.js and Go.
- Automatic Documentation: Every endpoint you build is automatically documented with Swagger UI and ReDoc, making it easy for frontend teams to integrate.
- Built-in Validation: Pydantic models provide automatic request/response validation, which is critical when accepting numerical features for predictions.
- Type Hints: Full type hint support makes code self-documenting and reduces runtime errors.
Project Structure
We will build a project that predicts house prices using a simple linear regression model trained on the classic Boston Housing dataset (loaded via scikit-learn). Here is the complete project structure:
ml-api/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application and endpoints
│ ├── model.py # Model loading and prediction logic
│ └── schemas.py # Pydantic request/response models
├── model/
│ └── train.py # Script to train and save the model
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
Step 1: Train and Save the Model
First, let's create the training script that will train a simple regression model and save it to disk. Create the file model/train.py:
import pickle
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset (a standard regression dataset)
data = load_diabetes()
X, y = data.data, data.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Model trained successfully!")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R2 Score: {r2:.4f}")
# Save the model and feature names
with open('model.pkl', 'wb') as f:
pickle.dump({
'model': model,
'feature_names': data.feature_names,
'description': data.DESCR[:200]
}, f)
print("Model saved to model.pkl")
Run this script locally (or in the Docker build) to generate model.pkl. This model will predict diabetes progression based on ten baseline variables such as age, sex, BMI, and blood pressure measurements.
Step 2: Define the FastAPI Application
Now let's build the API layer. Start with the Pydantic schemas in app/schemas.py:
from pydantic import BaseModel, Field
from typing import List, Optional
class PredictionInput(BaseModel):
"""Input features for prediction. All 10 features from the diabetes dataset."""
age: float = Field(..., description="Age (normalized)")
sex: float = Field(..., description="Sex (normalized)")
bmi: float = Field(..., description="Body Mass Index (normalized)")
bp: float = Field(..., description="Average Blood Pressure (normalized)")
s1: float = Field(..., description="Blood serum measurement 1 (normalized)")
s2: float = Field(..., description="Blood serum measurement 2 (normalized)")
s3: float = Field(..., description="Blood serum measurement 3 (normalized)")
s4: float = Field(..., description="Blood serum measurement 4 (normalized)")
s5: float = Field(..., description="Blood serum measurement 5 (normalized)")
s6: float = Field(..., description="Blood serum measurement 6 (normalized)")
class PredictionOutput(BaseModel):
prediction: float = Field(..., description="Predicted diabetes progression value")
message: str = Field(default="Prediction generated successfully")
class HealthResponse(BaseModel):
status: str = Field(default="healthy")
model_loaded: bool
Next, create the model loading logic in app/model.py:
import pickle
import os
from typing import Dict, Any
import numpy as np
MODEL_PATH = os.environ.get("MODEL_PATH", "model/model.pkl")
_model = None
def load_model() -> Dict[str, Any]:
"""Load the trained model from disk."""
global _model
if _model is None:
with open(MODEL_PATH, 'rb') as f:
_model = pickle.load(f)
print(f"Model loaded from {MODEL_PATH}")
return _model
def predict(features: list) -> float:
"""Run prediction on a list of feature values."""
model_data = load_model()
model = model_data['model']
features_array = np.array(features).reshape(1, -1)
prediction = model.predict(features_array)
return float(prediction[0])
def is_model_loaded() -> bool:
"""Check if the model is loaded and ready."""
try:
load_model()
return True
except Exception:
return False
Finally, create the main FastAPI application in app/main.py:
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from app.schemas import PredictionInput, PredictionOutput, HealthResponse
from app.model import predict, is_model_loaded
app = FastAPI(
title="ML Model Serving API",
description="Production-ready API for serving machine learning predictions",
version="1.0.0"
)
# Allow cross-origin requests (for frontend integration)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.get("/", tags=["Root"])
def root():
return {
"service": "ML Model API",
"version": "1.0.0",
"docs": "/docs"
}
@app.get("/health", response_model=HealthResponse, tags=["Health"])
def health_check():
"""Health check endpoint for container orchestration."""
return HealthResponse(
status="healthy",
model_loaded=is_model_loaded()
)
@app.post("/predict", response_model=PredictionOutput, tags=["Prediction"])
def predict_endpoint(input_data: PredictionInput):
"""
Predict diabetes progression from input features.
Accepts 10 normalized medical features and returns a
quantitative measure of diabetes progression one year after baseline.
"""
if not is_model_loaded():
raise HTTPException(
status_code=503,
detail="Model not loaded. Please try again later."
)
features = [
input_data.age, input_data.sex, input_data.bmi, input_data.bp,
input_data.s1, input_data.s2, input_data.s3, input_data.s4,
input_data.s5, input_data.s6
]
try:
result = predict(features)
return PredictionOutput(prediction=round(result, 2))
except Exception as e:
raise HTTPException(
status_code=500,
detail=f"Prediction failed: {str(e)}"
)
Step 3: Define Dependencies
Create requirements.txt:
fastapi==0.104.1
uvicorn[standard]==0.24.0
scikit-learn==1.3.2
numpy==1.24.3
pydantic==2.5.2
python-multipart==0.0.6
Step 4: Containerize with Docker
Now for the most important part — creating a Docker image that packages everything together. Create the Dockerfile:
# Stage 1: Build stage
FROM python:3.11-slim as builder
WORKDIR /build
# Install system dependencies for scikit-learn
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc build-essential && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: Runtime stage (slimmer final image)
FROM python:3.11-slim
WORKDIR /app
# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local
# Copy application code
COPY app/ ./app/
COPY model/ ./model/
# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH
ENV MODEL_PATH=/app/model/model.pkl
# Expose the API port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1
# Run with uvicorn
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
This Dockerfile uses a multi-stage build pattern. The first stage installs compilers and builds scikit-learn dependencies, while the second stage copies only the built Python packages — resulting in a significantly smaller final image.
Step 5: Orchestrate with Docker Compose
For local development and testing, Docker Compose provides a convenient way to run the API with proper configuration. Create docker-compose.yml:
version: '3.8'
services:
ml-api:
build:
context: .
dockerfile: Dockerfile
container_name: ml-api
ports:
- "8000:8000"
environment:
- MODEL_PATH=/app/model/model.pkl
volumes:
- ./model:/app/model
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 5
start_period: 20s
Step 6: Build and Run
With all files in place, build and start the service:
# Build the Docker image
docker compose build
# Start the service in detached mode
docker compose up -d
# Check logs
docker compose logs -f
# Verify the health endpoint
curl http://localhost:8000/health
# Open the interactive API docs in your browser
# Visit: http://localhost:8000/docs
Step 7: Test the API
Once the container is running, send a prediction request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"age": 0.0380759,
"sex": 0.0506801,
"bmi": 0.0616962,
"bp": 0.0218724,
"s1": -0.0442235,
"s2": -0.0348208,
"s3": -0.0434008,
"s4": -0.0025923,
"s5": 0.0198619,
"s6": -0.0176461
}'
You should receive a response similar to:
{
"prediction": 150.21,
"message": "Prediction generated successfully"
}
Step 8: Scale with Multiple Workers
For production deployments, you can increase throughput by running multiple workers with Gunicorn. Update the Dockerfile CMD:
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", "0.0.0.0:8000"]
Add gunicorn to your requirements.txt:
gunicorn==21.2.0
Best Practices for Production ML APIs
| Practice | Why It Matters |
| Input Validation | Never trust raw user input. Pydantic models prevent malformed requests from crashing your model. |
| Health Checks | Essential for container orchestrators (Kubernetes, ECS, Nomad) to know when your service is ready. |
| Model Versioning | Store models with version tags (e.g., model_v2.1.pkl) so you can roll back if needed. |
| Asynchronous Inference | For heavy models (deep learning), use background tasks or a task queue like Celery + Redis. |
| Logging & Monitoring | Add structured logging and metrics (Prometheus) to track prediction latency, error rates, and model drift. |
| Resource Limits | Set CPU/memory limits in Docker Compose or Kubernetes to prevent a single model from starving other services. |
| Caching | For expensive predictions on repeated inputs, consider an LRU cache with TTL. |
| Security | Add API keys, rate limiting, and input sanitization for public-facing endpoints. |
Conclusion
You have now built a complete, containerized ML model serving API using FastAPI and Docker. Your API automatically generates interactive documentation, validates all inputs, includes health checks, and can be scaled horizontally with Docker Compose or deployed to any container orchestration platform.
This architecture serves as a production-ready template that you can adapt for any ML model — whether it's a simple linear regression, a random forest, or a deep neural network. The key takeaway is that separating model training from model serving, containerizing the serving layer, and using a framework with built-in validation and documentation leads to more reliable, maintainable, and scalable ML deployments.
Next steps: Try adding a /batch-predict endpoint that accepts multiple input rows, integrate model monitoring with Prometheus metrics, or deploy the container to AWS ECS, Google Cloud Run, or a Kubernetes cluster.
Pro Tip: Store your trained models in a cloud object store (S3, GCS) and download them at container startup using a startup script. This lets you update the model without rebuilding the Docker image — simply restart the container and it fetches the latest version.