Setting Up Prometheus and Grafana for Cloud-Native Application Monitoring

Introduction

In today’s cloud-native landscape, observability is not a luxury — it’s a necessity. Microservices architectures, ephemeral containers, and dynamic orchestration platforms like Kubernetes have rendered traditional monitoring approaches obsolete. When your application spans dozens of services that scale up and down in seconds, you need a monitoring stack designed for that complexity.

Enter Prometheus and Grafana — the de facto open-source observability stack. Prometheus handles time-series data collection and alerting, while Grafana provides powerful dashboards and visualization. Together, they give you full visibility into your cloud-native applications.

In this tutorial, you will learn:

What Prometheus and Grafana are and how they complement each other
How to deploy Prometheus to scrape metrics from your applications
How to install and configure Grafana with Prometheus as a data source
How to build meaningful dashboards for cloud-native workloads
How to set up alerting rules and receive notifications
Best practices for production deployments

Prerequisites: Basic knowledge of Linux commands, Docker, and Docker Compose. A server with Docker Engine installed (or a local machine with Docker Desktop). About 30 minutes of your time.

Understanding the Stack

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It is now a CNCF (Cloud Native Computing Foundation) graduated project — the same foundation that governs Kubernetes. Prometheus works on a pull model: it scrapes metrics from HTTP endpoints on your services at regular intervals, stores them in a time-series database, and evaluates alerting rules against that data.

Key features:

Multi-dimensional data model — Metrics are identified by a metric name and key/value pairs called labels
PromQL — A powerful query language for slicing and aggregating time-series data
Pull-based architecture — No agents required; just expose an HTTP endpoint
Service discovery — Automatically finds targets in Kubernetes, EC2, Consul, and more
Alertmanager — Built-in alerting with deduplication, grouping, and routing

What is Grafana?

Grafana is an open-source analytics and interactive visualization platform. It connects to virtually any data source — Prometheus, Elasticsearch, InfluxDB, PostgreSQL, AWS CloudWatch, and hundreds more — and lets you build rich dashboards with graphs, tables, heatmaps, and alerts.

Grafana does not store the time-series data itself. It queries Prometheus (or other backends) on the fly and renders the results. This separation of concerns means you can swap visualization tools without touching your data layer.

Step 1: Deploying Prometheus

Let’s start by deploying Prometheus using Docker Compose. Create a directory structure for the project:

mkdir -p ~/monitoring-stack/prometheus
mkdir -p ~/monitoring-stack/grafana
cd ~/monitoring-stack

Prometheus Configuration

Create the Prometheus configuration file at ~/monitoring-stack/prometheus/prometheus.yml:

global:
  scrape_interval: 15s      # How often to scrape targets by default
  evaluation_interval: 15s  # How often to evaluate alerting rules
  scrape_timeout: 10s       # Timeout for each scrape request

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Rule files to load
rule_files:
  # - "alerts.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter for system metrics (we'll add this)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # A sample application (we'll add this)
  - job_name: 'sample-app'
    static_configs:
      - targets: ['sample-app:8080']

Docker Compose Setup

Create ~/monitoring-stack/docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=
    ports:
      - "3000:3000"
    restart: unless-stopped
    depends_on:
      - prometheus

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    command:
      - '--path.rootfs=/host'
    volumes:
      - /:/host:ro,rslave
    ports:
      - "9100:9100"
    restart: unless-stopped

  sample-app:
    image: prom/example-app:latest
    container_name: sample-app
    ports:
      - "8080:8080"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Start the Stack

Launch all services:

docker-compose up -d

Verify that everything is running:

docker-compose ps

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq .

Open your browser to http://localhost:9090/targets — you should see all targets as “UP”.

Tip: Prometheus exposes its own metrics at /metrics. Every Prometheus exporter follows this convention, which makes auto-discovery easy.

Step 2: Configuring Grafana

Open http://localhost:3000 in your browser. Log in with username admin and password admin (you’ll be prompted to change it on first login).

Add Prometheus as a Data Source

Click Connections (gear icon) → Data Sources → Add data source
Search for and select Prometheus
In the URL field, enter: http://prometheus:9090
Click Save & Test — you should see a green confirmation: “Data source is working”

The URL uses the Docker Compose service name (prometheus) because Grafana and Prometheus are on the same Docker network. If they were on separate machines, you’d use the actual IP or hostname.

Import a Pre-built Dashboard

Grafana has a huge library of community dashboards. Let’s import one for Node Exporter:

Click the + icon → Import
In the Import via grafana.com field, enter dashboard ID 1860 (Node Exporter Full)
Click Load
Select your Prometheus data source from the dropdown
Click Import

You should now see a comprehensive dashboard showing CPU, memory, disk, and network metrics for the host running node-exporter.

Step 3: Writing PromQL Queries

PromQL (Prometheus Query Language) is the heart of working with Prometheus. Let’s cover the essential query patterns.

Basic Queries

Query	Description
`up`	Returns 1 if a target is healthy, 0 if down
`node_cpu_seconds_total`	Total CPU seconds consumed (counter)
`node_memory_MemAvailable_bytes`	Available memory in bytes (gauge)
`rate(node_cpu_seconds_total[5m])`	Per-second rate of CPU usage, averaged over 5 minutes
`100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	CPU utilization percentage per instance

Common Aggregation Operators

# Average CPU usage by mode across all instances
avg(rate(node_cpu_seconds_total[5m])) by (mode)

# Top 5 memory-consuming processes
topk(5, process_resident_memory_bytes)

# 95th percentile of request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Total requests per second across all instances
sum(rate(http_requests_total[5m]))

Building a Dashboard Panel

Let’s create a custom dashboard panel step by step:

Click + → Dashboard → Add panel
In the query editor, enter: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Set the Title to “CPU Utilization %”
Under Unit, select Percent (0-100)
Click Apply

You’ve just built your first dashboard panel from scratch!

Step 4: Instrumenting Your Own Application

To get the most out of Prometheus, your applications should expose custom metrics. Here’s a minimal Python example using the prometheus_client library.

Python Application with Prometheus Metrics

# app.py
from prometheus_client import start_http_server, Counter, Histogram, Gauge
import time
import random

# Define metrics
REQUESTS = Counter('app_requests_total', 'Total requests', ['endpoint', 'method'])
REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request latency', ['endpoint'])
IN_FLIGHT = Gauge('app_requests_in_flight', 'Current requests in flight')
ERRORS = Counter('app_errors_total', 'Total errors', ['type'])

def handle_request(endpoint):
    IN_FLIGHT.inc()
    start = time.time()
    
    try:
        # Simulate some work
        time.sleep(random.uniform(0.1, 0.5))
        REQUESTS.labels(endpoint=endpoint, method='GET').inc()
        
        # Simulate occasional errors
        if random.random() < 0.05:
            ERRORS.labels(type='timeout').inc()
            raise Exception("simulated error")
    finally:
        REQUEST_DURATION.labels(endpoint=endpoint).observe(time.time() - start)
        IN_FLIGHT.dec()

if __name__ == '__main__':
    # Start metrics HTTP server on port 8000
    start_http_server(8000)
    
    while True:
        handle_request('/api/users')
        handle_request('/api/orders')
        time.sleep(1)

# Dockerfile
FROM python:3.11-slim
RUN pip install prometheus_client
COPY app.py /app/app.py
CMD ["python", "/app/app.py"]

When this container runs, Prometheus scrapes http://<container>:8000/metrics and automatically collects:

app_requests_total — How many requests each endpoint handled
app_request_duration_seconds — Latency distribution (with histogram buckets)
app_requests_in_flight — Current concurrency
app_errors_total — Error count by type

Best Practice: Use _total suffix for counters, _seconds for durations, and _bytes for memory. Follow these conventions so your metrics are consistent with the ecosystem.

Step 5: Setting Up Alerting

Monitoring is useless if you don't know when things break. Let's configure Prometheus alerting rules and route them through Alertmanager.

Alerting Rules

Create ~/monitoring-stack/prometheus/alerts.yml:

groups:
  - name: infrastructure
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 10 minutes."

  - name: application
    rules:
      - alert: HighErrorRate
        expr: rate(app_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "Error rate is above 0.1/s for 5 minutes."

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.instance }}"
          description: "P95 latency is above 1 second for 5 minutes."

Update prometheus.yml to reference the rule file:

rule_files:
  - "alerts.yml"

Alertmanager Configuration

Create ~/monitoring-stack/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
    email_configs:
      - to: 'ops@nova-tech.cloud'
        from: 'alerts@nova-tech.cloud'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@nova-tech.cloud'
        auth_password: 'your-password'

Add Alertmanager to your docker-compose.yml:

alertmanager:
  image: prom/alertmanager:latest
  container_name: alertmanager
  volumes:
    - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  command:
    - '--config.file=/etc/alertmanager/alertmanager.yml'
  ports:
    - "9093:9093"
  restart: unless-stopped

Then in prometheus.yml, uncomment the alertmanager target:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'

Restart the stack for changes to take effect:

docker-compose up -d

Step 6: Advanced Grafana Features

Dashboard Variables

Variables make dashboards dynamic. For example, a dropdown to switch between environments:

Go to your dashboard → Settings → Variables → Add variable
Name: instance, Type: Query
Data source: Prometheus, Query: label_values(up, instance)
In your panel queries, use $instance as a filter: up{instance="$instance"}

Annotations

Annotations overlay events on your graphs. Link them to deployments:

# Prometheus setup for deployment events (requires an event exporter)
- job_name: 'deployments'
  static_configs:
    - targets: ['deployment-exporter:8080']

In Grafana dashboard settings, add an annotation query: deployments_total{status="success"}. Now every deployment appears as a vertical line on your graphs — perfect for correlating code changes with performance shifts.

Grafana Alerting (Alternative to Prometheus Alertmanager)

Grafana 8+ has its own alerting engine. To configure:

Go to Alerting → Alert rules → New alert rule
Define a query condition (e.g., max(up) < 1)
Set evaluation interval and pending period
Add contact points (Slack, email, PagerDuty, webhook)
Create a notification policy

Grafana alerts have a cleaner UI than Prometheus Alertmanager, but Prometheus alerts are more portable if you ever switch visualization tools.

Best Practices for Production

Prometheus

Storage retention: Set --storage.tsdb.retention.time based on your needs. 30 days is typical for operational data; archive historical data to Thanos or Cortex.
High availability: Run two identical Prometheus instances. They don't cluster — each pulls the same data. Use a load balancer in front of them for query resilience.
Cardinality management: Avoid labels with unbounded values (user IDs, email addresses, transaction IDs). High cardinality blows up the TSDB and slows queries.
Resource sizing: Rule of thumb — 1 GB RAM per 1 million time series. Monitor Prometheus itself with prometheus_tsdb_head_series.

Grafana

Provisioning: Define dashboards and data sources as YAML files instead of clicking through the UI. Store them in Git for version control and CI/CD reproducibility.
Authentication: Integrate with OAuth, LDAP, or SAML. Never use the default admin credentials in production.
Performance: Use dashboard caching for slow queries. Set GF_DASHBOARDS_MIN_REFRESH_INTERVAL to prevent overly frequent refreshes.
Backups: Back up /var/lib/grafana/grafana.db (SQLite database) and your provisioning files.

Sample Provisioning Dashboard (YAML)

Create ~/monitoring-stack/grafana/provisioning/dashboards/system.yml:

apiVersion: 1

providers:
  - name: 'System Dashboards'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards

Mount this into Grafana via Docker Compose to automatically load dashboards on startup — no manual import needed.

Real-World Architecture: Prometheus in Kubernetes

On Kubernetes, Prometheus is typically deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, Grafana, and node exporters together. It also automatically discovers:

kube-state-metrics — Cluster-level metrics (deployments, pods, nodes)
cAdvisor — Container-level metrics (CPU, memory, network per container)
Kubernetes API server — Control plane health
Custom PodMonitors/ServiceMonitors — Your own applications

Installation is one command:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

After installation, you get a full observability stack for your Kubernetes cluster in under 5 minutes.

Conclusion

You've now set up a complete Prometheus and Grafana monitoring stack from scratch. You've learned how to:

Deploy Prometheus and configure it to scrape metrics
Use Grafana to visualize those metrics on interactive dashboards
Write PromQL queries for real-time analysis and troubleshooting
Instrument your own applications with custom metrics
Configure alerting rules for proactive incident response

This stack scales from a single Docker host to a multi-cluster Kubernetes environment spanning hundreds of microservices. The same principles apply — expose metrics at /metrics, scrape them with Prometheus, visualize with Grafana, and alert when thresholds are breached.

For next steps, consider adding:

Loki for log aggregation alongside your metrics
Tempo for distributed tracing
Thanos for long-term Prometheus storage and global querying
Karma or Pyrra for enhanced alert management

Remember: observability is a journey, not a destination. Start with the basics — CPU, memory, latency, error rates — and iterate from there. Happy monitoring!