Introduction
Figure 1: Distributed tracing architecture with OpenTelemetry SDKs, Collector, and Jaeger/Tempo backends
Modern microservices architectures are distributed by nature. A single user request can traverse dozens of services—from API gateways to authentication services, databases, queues, and third-party integrations. When something goes wrong, identifying the root cause becomes a nightmare without proper observability tools.
While metrics (Prometheus) and logging (ELK/Loki) tell you what happened, only distributed tracing can tell you exactly where and why it happened across your entire service mesh. This is where OpenTelemetry enters the picture.
OpenTelemetry (OTel) has become the industry standard for generating, collecting, and exporting telemetry data. In this tutorial, you will learn how to implement distributed tracing in a microservices environment using OpenTelemetry, configure end-to-end trace propagation, visualize traces in Jaeger, and apply best practices for production deployments.
📖 What You Will Learn
- Understanding distributed tracing and OpenTelemetry concepts
- Instrumenting Python and Node.js microservices with OpenTelemetry SDKs
- Configuring trace context propagation across service boundaries
- Setting up the OpenTelemetry Collector for trace aggregation
- Visualizing traces with Jaeger and Grafana Tempo
- Production best practices for sampling, storage, and cost management
Prerequisites
Before starting, ensure you have the following:
- Docker and Docker Compose installed (for running Jaeger and the OTel Collector)
- Python 3.9+ and Node.js 18+ for service instrumentation
- Basic familiarity with microservices architecture
- A Kubernetes cluster (optional, for advanced deployment scenarios)
1. Understanding Distributed Tracing and OpenTelemetry
1.1 What Is Distributed Tracing?
Distributed tracing tracks the journey of a single request as it travels through multiple services. Each trace is composed of spans—individual units of work representing operations within a service. Spans carry metadata such as start time, duration, status, and attributes (e.g., HTTP method, status code, database query).
A trace is identified by a unique Trace ID, which is propagated across service boundaries via HTTP headers (e.g., traceparent). This allows all spans belonging to the same request to be correlated, even across different programming languages and infrastructure boundaries.
1.2 OpenTelemetry Architecture
OpenTelemetry provides three core components:
| Component | Role | Example |
| OTel SDK | Instrumentation libraries embedded in your application code | Python SDK, Node.js SDK, Java SDK |
| OTel Collector | Central agent that receives, processes, and exports telemetry | otel-collector binary or Docker image |
| Backend | Storage and visualization platform | Jaeger, Grafana Tempo, Zipkin, Datadog |
The SDK sends trace data to the Collector via OTLP (OpenTelemetry Protocol). The Collector can then filter, batch, sample, and forward traces to one or more backends.
2. Setting Up the Tracing Infrastructure
2.1 Deploying Jaeger and the OTel Collector
We will use Docker Compose to spin up a development environment with Jaeger (for trace storage and visualization) and the OpenTelemetry Collector (for trace ingestion and processing).
Create a docker-compose.yml file:
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
depends_on:
- jaeger
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # Jaeger gRPC for OTel Collector
environment:
- COLLECTOR_OTLP_ENABLED=true
2.2 Configuring the OpenTelemetry Collector
Create otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
Start the infrastructure:
docker-compose up -d
Verify the services are running:
docker-compose ps
curl -s http://localhost:16686/ | head -5 # Jaeger UI should respond
3. Instrumenting a Python Microservice
3.1 Installing the OpenTelemetry Python SDK
Create a Python microservice that serves as an order service. Install the required packages:
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
flask requests
3.2 Writing the Instrumented Service
Create order_service.py:
from flask import Flask, jsonify, request
import requests
import os
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure the tracer provider
resource = Resource.create({"service.name": "order-service"})
provider = TracerProvider(resource=resource)
# Use the OTLP exporter to send traces to the Collector
otlp_exporter = OTLPSpanExporter(
endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
insecure=True
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
app = Flask(__name__)
# Auto-instrument Flask and requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
@app.route("/orders", methods=["POST"])
def create_order():
order_data = request.get_json()
user_id = order_data.get("user_id")
# Call the payment service (simulated downstream service)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order.user_id", user_id)
span.set_attribute("order.amount", order_data.get("amount"))
payment_response = requests.post(
"http://payment-service:5001/process-payment",
json={"user_id": user_id, "amount": order_data.get("amount")}
)
span.set_attribute("payment.status", payment_response.status_code)
return jsonify({"status": "order_created", "user_id": user_id}), 201
@app.route("/health")
def health():
return jsonify({"status": "healthy"})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
3.3 Key Concepts in the Code
- FlaskInstrumentor automatically wraps every incoming HTTP request in a span, capturing method, URL, status code, and duration.
- RequestsInstrumentor wraps outgoing HTTP calls, propagating the trace context via the
traceparentheader. - Manual spans (e.g.,
process_payment) let you add business-specific attributes and create logical groupings within a request. - BatchSpanProcessor buffers and exports spans in batches for efficiency.
4. Instrumenting a Node.js Microservice
4.1 Installing the OpenTelemetry Node.js SDK
Now let us create a payment service in Node.js that the Python order service will call. Initialize the project:
mkdir payment-service && cd payment-service
npm init -y
npm install @opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
express
4.2 Setting Up Instrumentation
Create instrumentation.js as the entry point (must be loaded before any other module):
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SEMRESATTRS_SERVICE_NAME } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'payment-service',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
4.3 Writing the Service
Create payment_service.js:
const express = require('express');
const app = express();
app.use(express.json());
const port = 5001;
app.post('/process-payment', (req, res) => {
const { user_id, amount } = req.body;
console.log(`Processing payment for user ${user_id}: $${amount}`);
// Simulate payment processing delay
const paymentSuccess = Math.random() > 0.1; // 90% success rate
setTimeout(() => {
if (paymentSuccess) {
res.json({ status: 'payment_success', transaction_id: `txn_${Date.now()}` });
} else {
res.status(402).json({ status: 'payment_failed', reason: 'insufficient_funds' });
}
}, 100 + Math.random() * 200); // 100-300ms delay
});
app.get('/health', (req, res) => {
res.json({ status: 'healthy' });
});
app.listen(port, () => {
console.log(`Payment service listening on port ${port}`);
});
Start the payment service with instrumentation enabled:
node -r ./instrumentation.js payment_service.js
5. Verifying Trace Propagation
5.1 Sending Test Requests
With both services and the tracing infrastructure running, send a test request through the order service:
curl -X POST http://localhost:5000/orders \
-H "Content-Type: application/json" \
-d '{"user_id": "user-42", "amount": 99.95}'
5.2 Visualizing Traces in Jaeger
Open Jaeger UI at http://localhost:16686. You should see:
- Select order-service from the Service dropdown
- Click Find Traces to see recent traces
- Click on a trace to view its waterfall view
The waterfall should show:
- order-service: A root span for the HTTP POST
/ordersrequest - process_payment: A child span created manually in the Python code
- payment-service: A child span for the HTTP POST
/process-payment(automatically linked via trace context propagation)
Each span displays start time, duration, status (OK/error), and custom attributes like order.user_id and payment.status.
6. Advanced Trace Enrichment
6.1 Adding Custom Attributes and Events
Attributes help you filter and search traces. Events let you record significant moments within a span:
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM orders WHERE user_id = ?")
span.set_attribute("db.operation", "SELECT")
# Record events at specific moments
span.add_event("query_started", {"query_id": "q-123"})
# ... execute query ...
span.add_event("query_completed", {"rows_returned": 5})
6.2 Span Status and Error Tracking
Always mark spans that encounter errors:
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
with tracer.start_as_current_span("external_api_call") as span:
try:
response = requests.get("https://api.example.com/data")
response.raise_for_status()
span.set_status(Status(StatusCode.OK))
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
The record_exception() method adds an exception event to the span with the stack trace, making debugging significantly easier from the Jaeger UI.
Figure 2: Sampling strategies and pipeline configuration for production distributed tracing
7. Production Considerations
7.1 Sampling Strategies
In production, collecting every single trace is expensive and often unnecessary. OpenTelemetry supports several sampling strategies:
| Strategy | Description | When to Use |
| Head-Based (Probability) | Randomly sample a percentage of traces (e.g., 10%) | High-traffic services where you want statistical coverage |
| Tail-Based | Keep all traces initially, then selectively retain interesting ones (e.g., errors, slow requests) | When you need full coverage but want to control storage costs |
| Rate Limiting | Cap the number of traces per second | Services with unpredictable traffic spikes |
Configure head-based sampling in the OTel Collector:
processors:
probabilistic_sampler:
hash_seed: 42
sampling_percentage: 10.0 # Keep 10% of traces
7.2 Context Propagation Across Protocols
OpenTelemetry supports the W3C Trace Context standard (traceparent header). When services communicate over message queues (Kafka, RabbitMQ, SQS), you must propagate the trace context manually through message headers:
# Python example: Propagating context through Kafka headers
from opentelemetry.propagate import inject
headers = {}
inject(headers)
producer.send('orders-topic', value=message_body, headers=headers)
On the consumer side, extract the context and create a child span:
from opentelemetry.propagate import extract
ctx = extract(headers)
with tracer.start_as_current_span("process_order_from_queue", context=ctx):
# Process the message
pass
7.3 Storage Considerations
Distributed tracing generates significant data volume. For production deployments:
- Use Grafana Tempo with object storage (S3, GCS) for cost-effective long-term retention
- Configure retention policies: keep raw traces for 7 days, aggregated metrics for 30+ days
- Set span size limits in the Collector (max 8KB per span attribute value, max 128 attributes per span)
- Consider using tail sampling to retain only error traces and slow traces while dropping everything else
7.4 Deploying the OTel Collector on Kubernetes
For Kubernetes deployments, deploy the OpenTelemetry Collector as a DaemonSet (one per node) for trace collection, plus a Gateway Collector for aggregation:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-conf
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
exporters:
otlp:
endpoint: "otel-collector-gateway:4317"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
Deploy as a DaemonSet so every pod in your cluster can send traces to the local Collector agent.
💡 Pro Tip
Start by instrumenting just your critical path services (API gateway, auth, payment) and validate trace propagation before rolling out to your entire fleet. A phased approach reduces risk and helps your team learn the OpenTelemetry APIs gradually.
8. Integrating with Grafana for Unified Observability
To get a unified view of metrics, logs, and traces in one dashboard, configure Grafana with Tempo as the trace datasource:
- Install Tempo:
docker run -p 3200:3200 grafana/tempo:latest - Configure the OTel Collector to export traces to Tempo instead of Jaeger
- In Grafana, add a Tempo datasource pointing to
http://tempo:3200 - Use Grafana Explore to search traces by service name, duration, or tags
- Link trace spans directly from your Grafana dashboards for seamless drills
This integration enables the “Golden Signals” approach—metrics alert you to a problem, logs provide context, and traces pinpoint the root cause.
Conclusion
Distributed tracing with OpenTelemetry transforms how you debug and optimize microservices architectures. By instrumenting your services with just a few lines of code, you gain end-to-end visibility into every request’s journey across your entire system.
In this tutorial, you:
- Deployed Jaeger and the OpenTelemetry Collector as your tracing backend
- Instrumented Python (Flask) and Node.js (Express) services with automatic and manual instrumentation
- Verified end-to-end trace propagation with context headers
- Applied production-grade sampling, error tracking, and storage strategies
The next step is to extend this setup to all your services, configure tail-based sampling, and integrate tracing into your existing Grafana dashboards. With OpenTelemetry’s growing ecosystem, you can also add support for gRPC, GraphQL, databases, and message queues—all with the same consistent API.
🏫 Next Steps
Explore the OpenTelemetry documentation for additional SDKs (Java, Go, .NET), advanced sampling configurations, and integration with your existing monitoring stack.