Introduction
In today’s cloud-native landscape, observability is not a luxury — it’s a necessity. Microservices architectures, ephemeral containers, and dynamic orchestration platforms like Kubernetes have rendered traditional monitoring approaches obsolete. When your application spans dozens of services that scale up and down in seconds, you need a monitoring stack designed for that complexity.
Enter Prometheus and Grafana — the de facto open-source observability stack. Prometheus handles time-series data collection and alerting, while Grafana provides powerful dashboards and visualization. Together, they give you full visibility into your cloud-native applications.
In this tutorial, you will learn:
- What Prometheus and Grafana are and how they complement each other
- How to deploy Prometheus to scrape metrics from your applications
- How to install and configure Grafana with Prometheus as a data source
- How to build meaningful dashboards for cloud-native workloads
- How to set up alerting rules and receive notifications
- Best practices for production deployments
Prerequisites: Basic knowledge of Linux commands, Docker, and Docker Compose. A server with Docker Engine installed (or a local machine with Docker Desktop). About 30 minutes of your time.
Understanding the Stack
What is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It is now a CNCF (Cloud Native Computing Foundation) graduated project — the same foundation that governs Kubernetes. Prometheus works on a pull model: it scrapes metrics from HTTP endpoints on your services at regular intervals, stores them in a time-series database, and evaluates alerting rules against that data.
Key features:
- Multi-dimensional data model — Metrics are identified by a metric name and key/value pairs called labels
- PromQL — A powerful query language for slicing and aggregating time-series data
- Pull-based architecture — No agents required; just expose an HTTP endpoint
- Service discovery — Automatically finds targets in Kubernetes, EC2, Consul, and more
- Alertmanager — Built-in alerting with deduplication, grouping, and routing
What is Grafana?
Grafana is an open-source analytics and interactive visualization platform. It connects to virtually any data source — Prometheus, Elasticsearch, InfluxDB, PostgreSQL, AWS CloudWatch, and hundreds more — and lets you build rich dashboards with graphs, tables, heatmaps, and alerts.
Grafana does not store the time-series data itself. It queries Prometheus (or other backends) on the fly and renders the results. This separation of concerns means you can swap visualization tools without touching your data layer.
Step 1: Deploying Prometheus
Let’s start by deploying Prometheus using Docker Compose. Create a directory structure for the project:
mkdir -p ~/monitoring-stack/prometheus
mkdir -p ~/monitoring-stack/grafana
cd ~/monitoring-stack
Prometheus Configuration
Create the Prometheus configuration file at ~/monitoring-stack/prometheus/prometheus.yml:
global:
scrape_interval: 15s # How often to scrape targets by default
evaluation_interval: 15s # How often to evaluate alerting rules
scrape_timeout: 10s # Timeout for each scrape request
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Rule files to load
rule_files:
# - "alerts.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporter for system metrics (we'll add this)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# A sample application (we'll add this)
- job_name: 'sample-app'
static_configs:
- targets: ['sample-app:8080']
Docker Compose Setup
Create ~/monitoring-stack/docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=
ports:
- "3000:3000"
restart: unless-stopped
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
command:
- '--path.rootfs=/host'
volumes:
- /:/host:ro,rslave
ports:
- "9100:9100"
restart: unless-stopped
sample-app:
image: prom/example-app:latest
container_name: sample-app
ports:
- "8080:8080"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Start the Stack
Launch all services:
docker-compose up -d
Verify that everything is running:
docker-compose ps
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq .
Open your browser to http://localhost:9090/targets — you should see all targets as “UP”.
Tip: Prometheus exposes its own metrics at
/metrics. Every Prometheus exporter follows this convention, which makes auto-discovery easy.
Step 2: Configuring Grafana
Open http://localhost:3000 in your browser. Log in with username admin and password admin (you’ll be prompted to change it on first login).
Add Prometheus as a Data Source
- Click Connections (gear icon) → Data Sources → Add data source
- Search for and select Prometheus
- In the URL field, enter:
http://prometheus:9090 - Click Save & Test — you should see a green confirmation: “Data source is working”
The URL uses the Docker Compose service name (prometheus) because Grafana and Prometheus are on the same Docker network. If they were on separate machines, you’d use the actual IP or hostname.
Import a Pre-built Dashboard
Grafana has a huge library of community dashboards. Let’s import one for Node Exporter:
- Click the + icon → Import
- In the Import via grafana.com field, enter dashboard ID 1860 (Node Exporter Full)
- Click Load
- Select your Prometheus data source from the dropdown
- Click Import
You should now see a comprehensive dashboard showing CPU, memory, disk, and network metrics for the host running node-exporter.
Step 3: Writing PromQL Queries
PromQL (Prometheus Query Language) is the heart of working with Prometheus. Let’s cover the essential query patterns.
Basic Queries
| Query | Description |
up |
Returns 1 if a target is healthy, 0 if down |
node_cpu_seconds_total |
Total CPU seconds consumed (counter) |
node_memory_MemAvailable_bytes |
Available memory in bytes (gauge) |
rate(node_cpu_seconds_total[5m]) |
Per-second rate of CPU usage, averaged over 5 minutes |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
CPU utilization percentage per instance |
Common Aggregation Operators
# Average CPU usage by mode across all instances
avg(rate(node_cpu_seconds_total[5m])) by (mode)
# Top 5 memory-consuming processes
topk(5, process_resident_memory_bytes)
# 95th percentile of request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Total requests per second across all instances
sum(rate(http_requests_total[5m]))
Building a Dashboard Panel
Let’s create a custom dashboard panel step by step:
- Click + → Dashboard → Add panel
- In the query editor, enter:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - Set the Title to “CPU Utilization %”
- Under Unit, select Percent (0-100)
- Click Apply
You’ve just built your first dashboard panel from scratch!
Step 4: Instrumenting Your Own Application
To get the most out of Prometheus, your applications should expose custom metrics. Here’s a minimal Python example using the prometheus_client library.
Python Application with Prometheus Metrics
# app.py
from prometheus_client import start_http_server, Counter, Histogram, Gauge
import time
import random
# Define metrics
REQUESTS = Counter('app_requests_total', 'Total requests', ['endpoint', 'method'])
REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request latency', ['endpoint'])
IN_FLIGHT = Gauge('app_requests_in_flight', 'Current requests in flight')
ERRORS = Counter('app_errors_total', 'Total errors', ['type'])
def handle_request(endpoint):
IN_FLIGHT.inc()
start = time.time()
try:
# Simulate some work
time.sleep(random.uniform(0.1, 0.5))
REQUESTS.labels(endpoint=endpoint, method='GET').inc()
# Simulate occasional errors
if random.random() < 0.05:
ERRORS.labels(type='timeout').inc()
raise Exception("simulated error")
finally:
REQUEST_DURATION.labels(endpoint=endpoint).observe(time.time() - start)
IN_FLIGHT.dec()
if __name__ == '__main__':
# Start metrics HTTP server on port 8000
start_http_server(8000)
while True:
handle_request('/api/users')
handle_request('/api/orders')
time.sleep(1)
# Dockerfile
FROM python:3.11-slim
RUN pip install prometheus_client
COPY app.py /app/app.py
CMD ["python", "/app/app.py"]
When this container runs, Prometheus scrapes http://<container>:8000/metrics and automatically collects:
app_requests_total— How many requests each endpoint handledapp_request_duration_seconds— Latency distribution (with histogram buckets)app_requests_in_flight— Current concurrencyapp_errors_total— Error count by type
Best Practice: Use
_totalsuffix for counters,_secondsfor durations, and_bytesfor memory. Follow these conventions so your metrics are consistent with the ecosystem.
Step 5: Setting Up Alerting
Monitoring is useless if you don't know when things break. Let's configure Prometheus alerting rules and route them through Alertmanager.
Alerting Rules
Create ~/monitoring-stack/prometheus/alerts.yml:
groups:
- name: infrastructure
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 10 minutes."
- name: application
rules:
- alert: HighErrorRate
expr: rate(app_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is above 0.1/s for 5 minutes."
- alert: HighLatency
expr: histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.instance }}"
description: "P95 latency is above 1 second for 5 minutes."
Update prometheus.yml to reference the rule file:
rule_files:
- "alerts.yml"
Alertmanager Configuration
Create ~/monitoring-stack/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
email_configs:
- to: 'ops@nova-tech.cloud'
from: 'alerts@nova-tech.cloud'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@nova-tech.cloud'
auth_password: 'your-password'
Add Alertmanager to your docker-compose.yml:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
ports:
- "9093:9093"
restart: unless-stopped
Then in prometheus.yml, uncomment the alertmanager target:
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
Restart the stack for changes to take effect:
docker-compose up -d
Step 6: Advanced Grafana Features
Dashboard Variables
Variables make dashboards dynamic. For example, a dropdown to switch between environments:
- Go to your dashboard → Settings → Variables → Add variable
- Name:
instance, Type:Query - Data source: Prometheus, Query:
label_values(up, instance) - In your panel queries, use
$instanceas a filter:up{instance="$instance"}
Annotations
Annotations overlay events on your graphs. Link them to deployments:
# Prometheus setup for deployment events (requires an event exporter)
- job_name: 'deployments'
static_configs:
- targets: ['deployment-exporter:8080']
In Grafana dashboard settings, add an annotation query: deployments_total{status="success"}. Now every deployment appears as a vertical line on your graphs — perfect for correlating code changes with performance shifts.
Grafana Alerting (Alternative to Prometheus Alertmanager)
Grafana 8+ has its own alerting engine. To configure:
- Go to Alerting → Alert rules → New alert rule
- Define a query condition (e.g.,
max(up) < 1) - Set evaluation interval and pending period
- Add contact points (Slack, email, PagerDuty, webhook)
- Create a notification policy
Grafana alerts have a cleaner UI than Prometheus Alertmanager, but Prometheus alerts are more portable if you ever switch visualization tools.
Best Practices for Production
Prometheus
- Storage retention: Set
--storage.tsdb.retention.timebased on your needs. 30 days is typical for operational data; archive historical data to Thanos or Cortex. - High availability: Run two identical Prometheus instances. They don't cluster — each pulls the same data. Use a load balancer in front of them for query resilience.
- Cardinality management: Avoid labels with unbounded values (user IDs, email addresses, transaction IDs). High cardinality blows up the TSDB and slows queries.
- Resource sizing: Rule of thumb — 1 GB RAM per 1 million time series. Monitor Prometheus itself with
prometheus_tsdb_head_series.
Grafana
- Provisioning: Define dashboards and data sources as YAML files instead of clicking through the UI. Store them in Git for version control and CI/CD reproducibility.
- Authentication: Integrate with OAuth, LDAP, or SAML. Never use the default admin credentials in production.
- Performance: Use dashboard caching for slow queries. Set
GF_DASHBOARDS_MIN_REFRESH_INTERVALto prevent overly frequent refreshes. - Backups: Back up
/var/lib/grafana/grafana.db(SQLite database) and your provisioning files.
Sample Provisioning Dashboard (YAML)
Create ~/monitoring-stack/grafana/provisioning/dashboards/system.yml:
apiVersion: 1
providers:
- name: 'System Dashboards'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
Mount this into Grafana via Docker Compose to automatically load dashboards on startup — no manual import needed.
Real-World Architecture: Prometheus in Kubernetes
On Kubernetes, Prometheus is typically deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, Grafana, and node exporters together. It also automatically discovers:
- kube-state-metrics — Cluster-level metrics (deployments, pods, nodes)
- cAdvisor — Container-level metrics (CPU, memory, network per container)
- Kubernetes API server — Control plane health
- Custom PodMonitors/ServiceMonitors — Your own applications
Installation is one command:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
After installation, you get a full observability stack for your Kubernetes cluster in under 5 minutes.
Conclusion
You've now set up a complete Prometheus and Grafana monitoring stack from scratch. You've learned how to:
- Deploy Prometheus and configure it to scrape metrics
- Use Grafana to visualize those metrics on interactive dashboards
- Write PromQL queries for real-time analysis and troubleshooting
- Instrument your own applications with custom metrics
- Configure alerting rules for proactive incident response
This stack scales from a single Docker host to a multi-cluster Kubernetes environment spanning hundreds of microservices. The same principles apply — expose metrics at /metrics, scrape them with Prometheus, visualize with Grafana, and alert when thresholds are breached.
For next steps, consider adding:
- Loki for log aggregation alongside your metrics
- Tempo for distributed tracing
- Thanos for long-term Prometheus storage and global querying
- Karma or Pyrra for enhanced alert management
Remember: observability is a journey, not a destination. Start with the basics — CPU, memory, latency, error rates — and iterate from there. Happy monitoring!