Distributed Tracing with OpenTelemetry

Introduction

Figure 1: Distributed tracing architecture with OpenTelemetry SDKs, Collector, and Jaeger/Tempo backends

Modern microservices architectures are distributed by nature. A single user request can traverse dozens of services—from API gateways to authentication services, databases, queues, and third-party integrations. When something goes wrong, identifying the root cause becomes a nightmare without proper observability tools.

While metrics (Prometheus) and logging (ELK/Loki) tell you what happened, only distributed tracing can tell you exactly where and why it happened across your entire service mesh. This is where OpenTelemetry enters the picture.

OpenTelemetry (OTel) has become the industry standard for generating, collecting, and exporting telemetry data. In this tutorial, you will learn how to implement distributed tracing in a microservices environment using OpenTelemetry, configure end-to-end trace propagation, visualize traces in Jaeger, and apply best practices for production deployments.

📖 What You Will Learn
Understanding distributed tracing and OpenTelemetry concepts
Instrumenting Python and Node.js microservices with OpenTelemetry SDKs
Configuring trace context propagation across service boundaries
Setting up the OpenTelemetry Collector for trace aggregation
Visualizing traces with Jaeger and Grafana Tempo
Production best practices for sampling, storage, and cost management

Prerequisites

Before starting, ensure you have the following:

Docker and Docker Compose installed (for running Jaeger and the OTel Collector)
Python 3.9+ and Node.js 18+ for service instrumentation
Basic familiarity with microservices architecture
A Kubernetes cluster (optional, for advanced deployment scenarios)

1. Understanding Distributed Tracing and OpenTelemetry

1.1 What Is Distributed Tracing?

Distributed tracing tracks the journey of a single request as it travels through multiple services. Each trace is composed of spans—individual units of work representing operations within a service. Spans carry metadata such as start time, duration, status, and attributes (e.g., HTTP method, status code, database query).

A trace is identified by a unique Trace ID, which is propagated across service boundaries via HTTP headers (e.g., traceparent). This allows all spans belonging to the same request to be correlated, even across different programming languages and infrastructure boundaries.

1.2 OpenTelemetry Architecture

OpenTelemetry provides three core components:

Component	Role	Example
OTel SDK	Instrumentation libraries embedded in your application code	Python SDK, Node.js SDK, Java SDK
OTel Collector	Central agent that receives, processes, and exports telemetry	otel-collector binary or Docker image
Backend	Storage and visualization platform	Jaeger, Grafana Tempo, Zipkin, Datadog

The SDK sends trace data to the Collector via OTLP (OpenTelemetry Protocol). The Collector can then filter, batch, sample, and forward traces to one or more backends.

2. Setting Up the Tracing Infrastructure

2.1 Deploying Jaeger and the OTel Collector

We will use Docker Compose to spin up a development environment with Jaeger (for trace storage and visualization) and the OpenTelemetry Collector (for trace ingestion and processing).

Create a docker-compose.yml file:

version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    depends_on:
      - jaeger

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686" # Jaeger UI
      - "14250:14250" # Jaeger gRPC for OTel Collector
    environment:
      - COLLECTOR_OTLP_ENABLED=true

2.2 Configuring the OpenTelemetry Collector

Create otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]

Start the infrastructure:

docker-compose up -d

Verify the services are running:

docker-compose ps
curl -s http://localhost:16686/ | head -5  # Jaeger UI should respond

3. Instrumenting a Python Microservice

3.1 Installing the OpenTelemetry Python SDK

Create a Python microservice that serves as an order service. Install the required packages:

pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            opentelemetry-instrumentation-flask \
            opentelemetry-instrumentation-requests \
            flask requests

3.2 Writing the Instrumented Service

Create order_service.py:

from flask import Flask, jsonify, request
import requests
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure the tracer provider
resource = Resource.create({"service.name": "order-service"})
provider = TracerProvider(resource=resource)

# Use the OTLP exporter to send traces to the Collector
otlp_exporter = OTLPSpanExporter(
    endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
    insecure=True
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

app = Flask(__name__)

# Auto-instrument Flask and requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.route("/orders", methods=["POST"])
def create_order():
    order_data = request.get_json()
    user_id = order_data.get("user_id")
    
    # Call the payment service (simulated downstream service)
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.user_id", user_id)
        span.set_attribute("order.amount", order_data.get("amount"))
        
        payment_response = requests.post(
            "http://payment-service:5001/process-payment",
            json={"user_id": user_id, "amount": order_data.get("amount")}
        )
        span.set_attribute("payment.status", payment_response.status_code)
    
    return jsonify({"status": "order_created", "user_id": user_id}), 201

@app.route("/health")
def health():
    return jsonify({"status": "healthy"})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

3.3 Key Concepts in the Code

FlaskInstrumentor automatically wraps every incoming HTTP request in a span, capturing method, URL, status code, and duration.
RequestsInstrumentor wraps outgoing HTTP calls, propagating the trace context via the traceparent header.
Manual spans (e.g., process_payment) let you add business-specific attributes and create logical groupings within a request.
BatchSpanProcessor buffers and exports spans in batches for efficiency.

4. Instrumenting a Node.js Microservice

4.1 Installing the OpenTelemetry Node.js SDK

Now let us create a payment service in Node.js that the Python order service will call. Initialize the project:

mkdir payment-service && cd payment-service
npm init -y
npm install @opentelemetry/api \
            @opentelemetry/sdk-node \
            @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-trace-otlp-grpc \
            express

4.2 Setting Up Instrumentation

Create instrumentation.js as the entry point (must be loaded before any other module):

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SEMRESATTRS_SERVICE_NAME } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'payment-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

4.3 Writing the Service

Create payment_service.js:

const express = require('express');
const app = express();
app.use(express.json());
const port = 5001;

app.post('/process-payment', (req, res) => {
  const { user_id, amount } = req.body;
  console.log(`Processing payment for user ${user_id}: $${amount}`);
  
  // Simulate payment processing delay
  const paymentSuccess = Math.random() > 0.1; // 90% success rate
  setTimeout(() => {
    if (paymentSuccess) {
      res.json({ status: 'payment_success', transaction_id: `txn_${Date.now()}` });
    } else {
      res.status(402).json({ status: 'payment_failed', reason: 'insufficient_funds' });
    }
  }, 100 + Math.random() * 200); // 100-300ms delay
});

app.get('/health', (req, res) => {
  res.json({ status: 'healthy' });
});

app.listen(port, () => {
  console.log(`Payment service listening on port ${port}`);
});

Start the payment service with instrumentation enabled:

node -r ./instrumentation.js payment_service.js

5. Verifying Trace Propagation

5.1 Sending Test Requests

With both services and the tracing infrastructure running, send a test request through the order service:

curl -X POST http://localhost:5000/orders \
  -H "Content-Type: application/json" \
  -d '{"user_id": "user-42", "amount": 99.95}'

5.2 Visualizing Traces in Jaeger

Open Jaeger UI at http://localhost:16686. You should see:

Select order-service from the Service dropdown
Click Find Traces to see recent traces
Click on a trace to view its waterfall view

The waterfall should show:

order-service: A root span for the HTTP POST /orders request
process_payment: A child span created manually in the Python code
payment-service: A child span for the HTTP POST /process-payment (automatically linked via trace context propagation)

Each span displays start time, duration, status (OK/error), and custom attributes like order.user_id and payment.status.

6. Advanced Trace Enrichment

6.1 Adding Custom Attributes and Events

Attributes help you filter and search traces. Events let you record significant moments within a span:

with tracer.start_as_current_span("database_query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT * FROM orders WHERE user_id = ?")
    span.set_attribute("db.operation", "SELECT")
    
    # Record events at specific moments
    span.add_event("query_started", {"query_id": "q-123"})
    # ... execute query ...
    span.add_event("query_completed", {"rows_returned": 5})

6.2 Span Status and Error Tracking

Always mark spans that encounter errors:

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("external_api_call") as span:
    try:
        response = requests.get("https://api.example.com/data")
        response.raise_for_status()
        span.set_status(Status(StatusCode.OK))
    except Exception as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise

The record_exception() method adds an exception event to the span with the stack trace, making debugging significantly easier from the Jaeger UI.

Figure 2: Sampling strategies and pipeline configuration for production distributed tracing

7. Production Considerations

7.1 Sampling Strategies

In production, collecting every single trace is expensive and often unnecessary. OpenTelemetry supports several sampling strategies:

Strategy	Description	When to Use
Head-Based (Probability)	Randomly sample a percentage of traces (e.g., 10%)	High-traffic services where you want statistical coverage
Tail-Based	Keep all traces initially, then selectively retain interesting ones (e.g., errors, slow requests)	When you need full coverage but want to control storage costs
Rate Limiting	Cap the number of traces per second	Services with unpredictable traffic spikes

Configure head-based sampling in the OTel Collector:

processors:
  probabilistic_sampler:
    hash_seed: 42
    sampling_percentage: 10.0   # Keep 10% of traces

7.2 Context Propagation Across Protocols

OpenTelemetry supports the W3C Trace Context standard (traceparent header). When services communicate over message queues (Kafka, RabbitMQ, SQS), you must propagate the trace context manually through message headers:

# Python example: Propagating context through Kafka headers
from opentelemetry.propagate import inject

headers = {}
inject(headers)

producer.send('orders-topic', value=message_body, headers=headers)

On the consumer side, extract the context and create a child span:

from opentelemetry.propagate import extract

ctx = extract(headers)
with tracer.start_as_current_span("process_order_from_queue", context=ctx):
    # Process the message
    pass

7.3 Storage Considerations

Distributed tracing generates significant data volume. For production deployments:

Use Grafana Tempo with object storage (S3, GCS) for cost-effective long-term retention
Configure retention policies: keep raw traces for 7 days, aggregated metrics for 30+ days
Set span size limits in the Collector (max 8KB per span attribute value, max 128 attributes per span)
Consider using tail sampling to retain only error traces and slow traces while dropping everything else

7.4 Deploying the OTel Collector on Kubernetes

For Kubernetes deployments, deploy the OpenTelemetry Collector as a DaemonSet (one per node) for trace collection, plus a Gateway Collector for aggregation:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    processors:
      batch:
        timeout: 1s
    exporters:
      otlp:
        endpoint: "otel-collector-gateway:4317"
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlp]

Deploy as a DaemonSet so every pod in your cluster can send traces to the local Collector agent.

💡 Pro Tip
Start by instrumenting just your critical path services (API gateway, auth, payment) and validate trace propagation before rolling out to your entire fleet. A phased approach reduces risk and helps your team learn the OpenTelemetry APIs gradually.

8. Integrating with Grafana for Unified Observability

To get a unified view of metrics, logs, and traces in one dashboard, configure Grafana with Tempo as the trace datasource:

Install Tempo: docker run -p 3200:3200 grafana/tempo:latest
Configure the OTel Collector to export traces to Tempo instead of Jaeger
In Grafana, add a Tempo datasource pointing to http://tempo:3200
Use Grafana Explore to search traces by service name, duration, or tags
Link trace spans directly from your Grafana dashboards for seamless drills

This integration enables the “Golden Signals” approach—metrics alert you to a problem, logs provide context, and traces pinpoint the root cause.

Conclusion

Distributed tracing with OpenTelemetry transforms how you debug and optimize microservices architectures. By instrumenting your services with just a few lines of code, you gain end-to-end visibility into every request’s journey across your entire system.

In this tutorial, you:

Deployed Jaeger and the OpenTelemetry Collector as your tracing backend
Instrumented Python (Flask) and Node.js (Express) services with automatic and manual instrumentation
Verified end-to-end trace propagation with context headers
Applied production-grade sampling, error tracking, and storage strategies

The next step is to extend this setup to all your services, configure tail-based sampling, and integrate tracing into your existing Grafana dashboards. With OpenTelemetry’s growing ecosystem, you can also add support for gRPC, GraphQL, databases, and message queues—all with the same consistent API.

🏫 Next Steps
Explore the OpenTelemetry documentation for additional SDKs (Java, Go, .NET), advanced sampling configurations, and integration with your existing monitoring stack.