Setting Up Prometheus and Grafana for Cloud-Native Application Monitoring

Introduction

In today’s cloud-native landscape, observability is not a luxury — it’s a necessity. Microservices architectures, ephemeral containers, and dynamic orchestration platforms like Kubernetes have rendered traditional monitoring approaches obsolete. When your application spans dozens of services that scale up and down in seconds, you need a monitoring stack designed for that complexity.

Enter Prometheus and Grafana — the de facto open-source observability stack. Prometheus handles time-series data collection and alerting, while Grafana provides powerful dashboards and visualization. Together, they give you full visibility into your cloud-native applications.

In this tutorial, you will learn:

  • What Prometheus and Grafana are and how they complement each other
  • How to deploy Prometheus to scrape metrics from your applications
  • How to install and configure Grafana with Prometheus as a data source
  • How to build meaningful dashboards for cloud-native workloads
  • How to set up alerting rules and receive notifications
  • Best practices for production deployments

Prerequisites: Basic knowledge of Linux commands, Docker, and Docker Compose. A server with Docker Engine installed (or a local machine with Docker Desktop). About 30 minutes of your time.

Understanding the Stack

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It is now a CNCF (Cloud Native Computing Foundation) graduated project — the same foundation that governs Kubernetes. Prometheus works on a pull model: it scrapes metrics from HTTP endpoints on your services at regular intervals, stores them in a time-series database, and evaluates alerting rules against that data.

Key features:

  • Multi-dimensional data model — Metrics are identified by a metric name and key/value pairs called labels
  • PromQL — A powerful query language for slicing and aggregating time-series data
  • Pull-based architecture — No agents required; just expose an HTTP endpoint
  • Service discovery — Automatically finds targets in Kubernetes, EC2, Consul, and more
  • Alertmanager — Built-in alerting with deduplication, grouping, and routing

What is Grafana?

Grafana is an open-source analytics and interactive visualization platform. It connects to virtually any data source — Prometheus, Elasticsearch, InfluxDB, PostgreSQL, AWS CloudWatch, and hundreds more — and lets you build rich dashboards with graphs, tables, heatmaps, and alerts.

Grafana does not store the time-series data itself. It queries Prometheus (or other backends) on the fly and renders the results. This separation of concerns means you can swap visualization tools without touching your data layer.

Step 1: Deploying Prometheus

Let’s start by deploying Prometheus using Docker Compose. Create a directory structure for the project:

mkdir -p ~/monitoring-stack/prometheus
mkdir -p ~/monitoring-stack/grafana
cd ~/monitoring-stack

Prometheus Configuration

Create the Prometheus configuration file at ~/monitoring-stack/prometheus/prometheus.yml:

global:
  scrape_interval: 15s      # How often to scrape targets by default
  evaluation_interval: 15s  # How often to evaluate alerting rules
  scrape_timeout: 10s       # Timeout for each scrape request

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Rule files to load
rule_files:
  # - "alerts.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter for system metrics (we'll add this)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # A sample application (we'll add this)
  - job_name: 'sample-app'
    static_configs:
      - targets: ['sample-app:8080']

Docker Compose Setup

Create ~/monitoring-stack/docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=
    ports:
      - "3000:3000"
    restart: unless-stopped
    depends_on:
      - prometheus

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    command:
      - '--path.rootfs=/host'
    volumes:
      - /:/host:ro,rslave
    ports:
      - "9100:9100"
    restart: unless-stopped

  sample-app:
    image: prom/example-app:latest
    container_name: sample-app
    ports:
      - "8080:8080"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Start the Stack

Launch all services:

docker-compose up -d

Verify that everything is running:

docker-compose ps

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq .

Open your browser to http://localhost:9090/targets — you should see all targets as “UP”.

Tip: Prometheus exposes its own metrics at /metrics. Every Prometheus exporter follows this convention, which makes auto-discovery easy.

Step 2: Configuring Grafana

Open http://localhost:3000 in your browser. Log in with username admin and password admin (you’ll be prompted to change it on first login).

Add Prometheus as a Data Source

  1. Click Connections (gear icon) → Data SourcesAdd data source
  2. Search for and select Prometheus
  3. In the URL field, enter: http://prometheus:9090
  4. Click Save & Test — you should see a green confirmation: “Data source is working”

The URL uses the Docker Compose service name (prometheus) because Grafana and Prometheus are on the same Docker network. If they were on separate machines, you’d use the actual IP or hostname.

Import a Pre-built Dashboard

Grafana has a huge library of community dashboards. Let’s import one for Node Exporter:

  1. Click the + icon → Import
  2. In the Import via grafana.com field, enter dashboard ID 1860 (Node Exporter Full)
  3. Click Load
  4. Select your Prometheus data source from the dropdown
  5. Click Import

You should now see a comprehensive dashboard showing CPU, memory, disk, and network metrics for the host running node-exporter.

Step 3: Writing PromQL Queries

PromQL (Prometheus Query Language) is the heart of working with Prometheus. Let’s cover the essential query patterns.

Basic Queries

Query Description
up Returns 1 if a target is healthy, 0 if down
node_cpu_seconds_total Total CPU seconds consumed (counter)
node_memory_MemAvailable_bytes Available memory in bytes (gauge)
rate(node_cpu_seconds_total[5m]) Per-second rate of CPU usage, averaged over 5 minutes
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) CPU utilization percentage per instance

Common Aggregation Operators

# Average CPU usage by mode across all instances
avg(rate(node_cpu_seconds_total[5m])) by (mode)

# Top 5 memory-consuming processes
topk(5, process_resident_memory_bytes)

# 95th percentile of request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Total requests per second across all instances
sum(rate(http_requests_total[5m]))

Building a Dashboard Panel

Let’s create a custom dashboard panel step by step:

  1. Click +DashboardAdd panel
  2. In the query editor, enter: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  3. Set the Title to “CPU Utilization %”
  4. Under Unit, select Percent (0-100)
  5. Click Apply

You’ve just built your first dashboard panel from scratch!

Step 4: Instrumenting Your Own Application

To get the most out of Prometheus, your applications should expose custom metrics. Here’s a minimal Python example using the prometheus_client library.

Python Application with Prometheus Metrics

# app.py
from prometheus_client import start_http_server, Counter, Histogram, Gauge
import time
import random

# Define metrics
REQUESTS = Counter('app_requests_total', 'Total requests', ['endpoint', 'method'])
REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request latency', ['endpoint'])
IN_FLIGHT = Gauge('app_requests_in_flight', 'Current requests in flight')
ERRORS = Counter('app_errors_total', 'Total errors', ['type'])

def handle_request(endpoint):
    IN_FLIGHT.inc()
    start = time.time()
    
    try:
        # Simulate some work
        time.sleep(random.uniform(0.1, 0.5))
        REQUESTS.labels(endpoint=endpoint, method='GET').inc()
        
        # Simulate occasional errors
        if random.random() < 0.05:
            ERRORS.labels(type='timeout').inc()
            raise Exception("simulated error")
    finally:
        REQUEST_DURATION.labels(endpoint=endpoint).observe(time.time() - start)
        IN_FLIGHT.dec()

if __name__ == '__main__':
    # Start metrics HTTP server on port 8000
    start_http_server(8000)
    
    while True:
        handle_request('/api/users')
        handle_request('/api/orders')
        time.sleep(1)
# Dockerfile
FROM python:3.11-slim
RUN pip install prometheus_client
COPY app.py /app/app.py
CMD ["python", "/app/app.py"]

When this container runs, Prometheus scrapes http://<container>:8000/metrics and automatically collects:

  • app_requests_total — How many requests each endpoint handled
  • app_request_duration_seconds — Latency distribution (with histogram buckets)
  • app_requests_in_flight — Current concurrency
  • app_errors_total — Error count by type

Best Practice: Use _total suffix for counters, _seconds for durations, and _bytes for memory. Follow these conventions so your metrics are consistent with the ecosystem.

Step 5: Setting Up Alerting

Monitoring is useless if you don't know when things break. Let's configure Prometheus alerting rules and route them through Alertmanager.

Alerting Rules

Create ~/monitoring-stack/prometheus/alerts.yml:

groups:
  - name: infrastructure
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 10 minutes."

  - name: application
    rules:
      - alert: HighErrorRate
        expr: rate(app_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "Error rate is above 0.1/s for 5 minutes."

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.instance }}"
          description: "P95 latency is above 1 second for 5 minutes."

Update prometheus.yml to reference the rule file:

rule_files:
  - "alerts.yml"

Alertmanager Configuration

Create ~/monitoring-stack/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
    email_configs:
      - to: 'ops@nova-tech.cloud'
        from: 'alerts@nova-tech.cloud'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@nova-tech.cloud'
        auth_password: 'your-password'

Add Alertmanager to your docker-compose.yml:

alertmanager:
  image: prom/alertmanager:latest
  container_name: alertmanager
  volumes:
    - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  command:
    - '--config.file=/etc/alertmanager/alertmanager.yml'
  ports:
    - "9093:9093"
  restart: unless-stopped

Then in prometheus.yml, uncomment the alertmanager target:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'

Restart the stack for changes to take effect:

docker-compose up -d

Step 6: Advanced Grafana Features

Dashboard Variables

Variables make dashboards dynamic. For example, a dropdown to switch between environments:

  1. Go to your dashboard → SettingsVariablesAdd variable
  2. Name: instance, Type: Query
  3. Data source: Prometheus, Query: label_values(up, instance)
  4. In your panel queries, use $instance as a filter: up{instance="$instance"}

Annotations

Annotations overlay events on your graphs. Link them to deployments:

# Prometheus setup for deployment events (requires an event exporter)
- job_name: 'deployments'
  static_configs:
    - targets: ['deployment-exporter:8080']

In Grafana dashboard settings, add an annotation query: deployments_total{status="success"}. Now every deployment appears as a vertical line on your graphs — perfect for correlating code changes with performance shifts.

Grafana Alerting (Alternative to Prometheus Alertmanager)

Grafana 8+ has its own alerting engine. To configure:

  1. Go to AlertingAlert rulesNew alert rule
  2. Define a query condition (e.g., max(up) < 1)
  3. Set evaluation interval and pending period
  4. Add contact points (Slack, email, PagerDuty, webhook)
  5. Create a notification policy

Grafana alerts have a cleaner UI than Prometheus Alertmanager, but Prometheus alerts are more portable if you ever switch visualization tools.

Best Practices for Production

Prometheus

  • Storage retention: Set --storage.tsdb.retention.time based on your needs. 30 days is typical for operational data; archive historical data to Thanos or Cortex.
  • High availability: Run two identical Prometheus instances. They don't cluster — each pulls the same data. Use a load balancer in front of them for query resilience.
  • Cardinality management: Avoid labels with unbounded values (user IDs, email addresses, transaction IDs). High cardinality blows up the TSDB and slows queries.
  • Resource sizing: Rule of thumb — 1 GB RAM per 1 million time series. Monitor Prometheus itself with prometheus_tsdb_head_series.

Grafana

  • Provisioning: Define dashboards and data sources as YAML files instead of clicking through the UI. Store them in Git for version control and CI/CD reproducibility.
  • Authentication: Integrate with OAuth, LDAP, or SAML. Never use the default admin credentials in production.
  • Performance: Use dashboard caching for slow queries. Set GF_DASHBOARDS_MIN_REFRESH_INTERVAL to prevent overly frequent refreshes.
  • Backups: Back up /var/lib/grafana/grafana.db (SQLite database) and your provisioning files.

Sample Provisioning Dashboard (YAML)

Create ~/monitoring-stack/grafana/provisioning/dashboards/system.yml:

apiVersion: 1

providers:
  - name: 'System Dashboards'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards

Mount this into Grafana via Docker Compose to automatically load dashboards on startup — no manual import needed.

Real-World Architecture: Prometheus in Kubernetes

On Kubernetes, Prometheus is typically deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, Grafana, and node exporters together. It also automatically discovers:

  • kube-state-metrics — Cluster-level metrics (deployments, pods, nodes)
  • cAdvisor — Container-level metrics (CPU, memory, network per container)
  • Kubernetes API server — Control plane health
  • Custom PodMonitors/ServiceMonitors — Your own applications

Installation is one command:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

After installation, you get a full observability stack for your Kubernetes cluster in under 5 minutes.

Conclusion

You've now set up a complete Prometheus and Grafana monitoring stack from scratch. You've learned how to:

  • Deploy Prometheus and configure it to scrape metrics
  • Use Grafana to visualize those metrics on interactive dashboards
  • Write PromQL queries for real-time analysis and troubleshooting
  • Instrument your own applications with custom metrics
  • Configure alerting rules for proactive incident response

This stack scales from a single Docker host to a multi-cluster Kubernetes environment spanning hundreds of microservices. The same principles apply — expose metrics at /metrics, scrape them with Prometheus, visualize with Grafana, and alert when thresholds are breached.

For next steps, consider adding:

  • Loki for log aggregation alongside your metrics
  • Tempo for distributed tracing
  • Thanos for long-term Prometheus storage and global querying
  • Karma or Pyrra for enhanced alert management

Remember: observability is a journey, not a destination. Start with the basics — CPU, memory, latency, error rates — and iterate from there. Happy monitoring!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top