Building a robust monitoring infrastructure is crucial for maintaining healthy systems and applications. Prometheus and Grafana form a powerful combination for metrics collection, storage, and visualization. This comprehensive guide walks you through deploying a complete monitoring stack on Rocky Linux, from basic setup to advanced configurations and custom dashboards.
Understanding Prometheus and Grafana
What is Prometheus?
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. Key features include:
- Pull-based metrics collection: Prometheus scrapes metrics from configured targets
- Time-series database: Efficient storage of metrics with timestamps
- Powerful query language: PromQL for data retrieval and analysis
- Service discovery: Automatic discovery of monitoring targets
- Built-in alerting: Alert rules and integration with Alertmanager
What is Grafana?
Grafana is a multi-platform analytics and visualization platform that supports multiple data sources:
- Beautiful dashboards: Create stunning visualizations of your metrics
- Multiple data sources: Supports Prometheus, InfluxDB, Elasticsearch, and more
- Alerting: Visual alert rules with multiple notification channels
- User management: Role-based access control and teams
- Plugin ecosystem: Extend functionality with community plugins
Why Use Them Together?
Feature | Prometheus | Grafana | Combined Benefit |
---|---|---|---|
Data Collection | ✓ | ✗ | Reliable metrics gathering |
Storage | ✓ | ✗ | Efficient time-series storage |
Visualization | Basic | ✓ | Professional dashboards |
Alerting | Rule-based | Visual | Comprehensive alerting |
User Interface | Minimal | Rich | User-friendly monitoring |
Architecture Overview
Component Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Applications │────▶│ Exporters │◀────│ Prometheus │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alertmanager │◀────│ Alert Rules │ │ Grafana │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Network Ports
- Prometheus: 9090 (web UI and API)
- Grafana: 3000 (web UI)
- Node Exporter: 9100
- Alertmanager: 9093
- Pushgateway: 9091
Installing Prometheus
Prerequisites
# Update system
sudo dnf update -y
# Install dependencies
sudo dnf install -y wget curl tar
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /var/log/prometheus
Download and Install Prometheus
# Set version
PROMETHEUS_VERSION="2.45.0"
# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
# Extract archive
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
# Copy binaries
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool /usr/local/bin/
# Copy configuration files
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/console_libraries /etc/prometheus
# Set ownership
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /var/log/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
Create Prometheus Configuration
# Create basic configuration
sudo nano /etc/prometheus/prometheus.yml
# Global configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
external_labels:
monitor: 'prometheus-stack'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them
rule_files:
- "rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'prometheus-server'
# Node Exporter
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
labels:
instance: 'prometheus-node'
# Grafana
- job_name: 'grafana'
static_configs:
- targets: ['localhost:3000']
Create Systemd Service
sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=10GB \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.enable-lifecycle \
--web.enable-admin-api \
--log.level=info \
--log.format=logfmt
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
SyslogIdentifier=prometheus
KillMode=mixed
KillSignal=SIGTERM
[Install]
WantedBy=multi-user.target
Start Prometheus
# Reload systemd
sudo systemctl daemon-reload
# Enable and start Prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus
# Check status
sudo systemctl status prometheus
# Check logs
sudo journalctl -u prometheus -f
# Verify installation
curl http://localhost:9090/metrics
Installing Grafana
Method 1: Install from Repository
# Add Grafana repository
sudo nano /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
# Install Grafana
sudo dnf install -y grafana
# Enable and start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
# Check status
sudo systemctl status grafana-server
Method 2: Install from Binary
# Set version
GRAFANA_VERSION="10.0.3"
# Download Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz
# Extract and install
tar -zxvf grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz
sudo mv grafana-${GRAFANA_VERSION} /opt/grafana
# Create user
sudo useradd --no-create-home --shell /bin/false grafana
# Set permissions
sudo chown -R grafana:grafana /opt/grafana
# Create systemd service
sudo nano /etc/systemd/system/grafana.service
[Unit]
Description=Grafana
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=grafana
Group=grafana
ExecStart=/opt/grafana/bin/grafana-server \
--config=/opt/grafana/conf/defaults.ini \
--homepath=/opt/grafana
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
Configure Grafana
# Edit Grafana configuration
sudo nano /etc/grafana/grafana.ini
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = monitoring.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false
[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = false
cookie_samesite = lax
allow_embedding = false
[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
[auth.anonymous]
enabled = false
[auth.basic]
enabled = true
[database]
type = sqlite3
path = grafana.db
[session]
provider = file
provider_config = sessions
[analytics]
reporting_enabled = false
check_for_updates = false
[log]
mode = console file
level = info
filters =
[alerting]
enabled = true
execute_alerts = true
Configuring Prometheus
Advanced Configuration
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
query_log_file: /var/log/prometheus/query.log
external_labels:
cluster: 'production'
replica: '1'
# Remote write configuration (optional)
remote_write:
- url: 'http://remote-storage:9201/write'
queue_config:
capacity: 10000
max_shards: 5
max_samples_per_send: 1000
# Alerting configuration
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- 'alertmanager:9093'
timeout: 10s
# Rule files
rule_files:
- "/etc/prometheus/rules/alerts.yml"
- "/etc/prometheus/rules/recording.yml"
# Scrape configurations
scrape_configs:
# Service discovery for node exporters
- job_name: 'node-exporter'
consul_sd_configs:
- server: 'consul:8500'
services: ['node-exporter']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_node]
target_label: instance
- source_labels: [__meta_consul_tags]
regex: '.*,production,.*'
action: keep
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# File-based service discovery
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/file_sd/*.json'
refresh_interval: 5m
# Static targets with different intervals
- job_name: 'high-frequency'
scrape_interval: 5s
static_configs:
- targets: ['app1:8080', 'app2:8080']
labels:
env: 'production'
team: 'backend'
# Blackbox exporter for endpoint monitoring
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Create Alert Rules
# Create rules directory
sudo mkdir -p /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus/rules
# Create alert rules
sudo nano /etc/prometheus/rules/alerts.yml
groups:
- name: node_alerts
interval: 30s
rules:
# Node down
- alert: NodeDown
expr: up{job="node"} == 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 5 minutes."
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }}%)"
# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current value: {{ $value }}%)"
# Disk space low
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 15% (current value: {{ $value }}%)"
# High load average
- alert: HighLoadAverage
expr: node_load1 > (count by(instance)(node_cpu_seconds_total{mode="idle"})) * 2
for: 15m
labels:
severity: warning
annotations:
summary: "High load average on {{ $labels.instance }}"
description: "Load average is high (current value: {{ $value }})"
- name: prometheus_alerts
rules:
# Prometheus target down
- alert: PrometheusTargetDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus target {{ $labels.job }}/{{ $labels.instance }} is down"
description: "Target has been down for more than 5 minutes."
# Too many scrape errors
- alert: PrometheusScrapingError
expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus scraping error"
description: "Prometheus has scraping errors for {{ $labels.job }}/{{ $labels.instance }}"
# Prometheus config reload failed
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful != 1
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus configuration reload failed"
description: "Prometheus configuration reload has failed"
Create Recording Rules
sudo nano /etc/prometheus/rules/recording.yml
groups:
- name: node_recording
interval: 30s
rules:
# CPU usage percentage
- record: instance:node_cpu_utilisation:rate5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
- record: instance:node_memory_utilisation:percentage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
- record: instance:node_filesystem_utilisation:percentage
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Network receive bandwidth
- record: instance:node_network_receive_bytes:rate5m
expr: sum by(instance) (rate(node_network_receive_bytes_total[5m]))
# Network transmit bandwidth
- record: instance:node_network_transmit_bytes:rate5m
expr: sum by(instance) (rate(node_network_transmit_bytes_total[5m]))
- name: aggregated_metrics
interval: 60s
rules:
# Average CPU across all nodes
- record: job:node_cpu_utilisation:avg
expr: avg(instance:node_cpu_utilisation:rate5m)
# Total memory usage
- record: job:node_memory_bytes:sum
expr: sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
Setting Up Exporters
Node Exporter
# Download Node Exporter
NODE_EXPORTER_VERSION="1.6.1"
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter
# Set permissions
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Create systemd service
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
--collector.netclass.ignored-devices=^(veth.*)$$ \
--web.listen-address=:9100 \
--web.telemetry-path=/metrics
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Blackbox Exporter
# Download Blackbox Exporter
BLACKBOX_VERSION="0.24.0"
cd /tmp
wget https://github.com/prometheus/blackbox_exporter/releases/download/v${BLACKBOX_VERSION}/blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz
sudo cp blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64/blackbox_exporter /usr/local/bin/
# Create configuration
sudo mkdir -p /etc/blackbox_exporter
sudo nano /etc/blackbox_exporter/blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [] # Defaults to 2xx
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{"test": "data"}'
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
dns_tcp:
prober: dns
timeout: 5s
dns:
query_name: "example.com"
query_type: "A"
transport_protocol: "tcp"
MySQL/MariaDB Exporter
# Download MySQL Exporter
MYSQL_EXPORTER_VERSION="0.15.0"
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v${MYSQL_EXPORTER_VERSION}/mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/
# Create MySQL user for exporter
mysql -u root -p << EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF
# Create credentials file
sudo mkdir -p /etc/mysqld_exporter
sudo nano /etc/mysqld_exporter/.my.cnf
[client]
host=localhost
port=3306
user=exporter
password=ExporterPassword123!
# Set permissions
sudo chmod 600 /etc/mysqld_exporter/.my.cnf
sudo chown prometheus:prometheus /etc/mysqld_exporter/.my.cnf
Custom Application Metrics
# Example Python application with Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random
# Define metrics
request_count = Counter('app_requests_total', 'Total number of requests', ['method', 'endpoint'])
request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint'])
active_users = Gauge('app_active_users', 'Number of active users')
# Expose metrics
start_http_server(8000)
# Application logic
while True:
# Simulate requests
method = random.choice(['GET', 'POST'])
endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
with request_duration.labels(method=method, endpoint=endpoint).time():
# Simulate processing time
time.sleep(random.random())
request_count.labels(method=method, endpoint=endpoint).inc()
active_users.set(random.randint(50, 200))
time.sleep(1)
Integrating Prometheus with Grafana
Add Prometheus Data Source
# Using Grafana API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true,
"jsonData": {
"timeInterval": "15s",
"queryTimeout": "60s",
"httpMethod": "POST"
}
}'
Configure Data Source in UI
- Navigate to Configuration → Data Sources
- Click “Add data source”
- Select “Prometheus”
- Configure:
- URL:
http://localhost:9090
- Access: Server (default)
- Scrape interval: 15s
- Query timeout: 60s
- HTTP Method: POST
- URL:
Creating Dashboards
Import Community Dashboards
# Popular dashboard IDs:
# 1860 - Node Exporter Full
# 7362 - MySQL Overview
# 3662 - Prometheus 2.0 Overview
# 11074 - Node Exporter for Prometheus
# Import via API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d '{
"dashboard": {
"id": 1860,
"uid": null,
"title": "Node Exporter Full"
},
"overwrite": true,
"inputs": [{
"name": "DS_PROMETHEUS",
"type": "datasource",
"pluginId": "prometheus",
"value": "Prometheus"
}]
}'
Create Custom Dashboard
{
"dashboard": {
"title": "System Metrics Overview",
"panels": [
{
"title": "CPU Usage",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"type": "graph",
"targets": [{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}",
"refId": "A"
}],
"yaxes": [{
"format": "percent",
"min": 0,
"max": 100
}]
},
{
"title": "Memory Usage",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"type": "graph",
"targets": [{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}",
"refId": "A"
}],
"yaxes": [{
"format": "percent",
"min": 0,
"max": 100
}]
},
{
"title": "Disk I/O",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"type": "graph",
"targets": [
{
"expr": "rate(node_disk_read_bytes_total[5m])",
"legendFormat": "{{instance}} - Read",
"refId": "A"
},
{
"expr": "rate(node_disk_written_bytes_total[5m])",
"legendFormat": "{{instance}} - Write",
"refId": "B"
}
],
"yaxes": [{
"format": "Bps"
}]
},
{
"title": "Network Traffic",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"type": "graph",
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{device!~\"lo\"}[5m])",
"legendFormat": "{{instance}} - {{device}} RX",
"refId": "A"
},
{
"expr": "rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m])",
"legendFormat": "{{instance}} - {{device}} TX",
"refId": "B"
}
],
"yaxes": [{
"format": "Bps"
}]
}
],
"time": {"from": "now-1h", "to": "now"},
"refresh": "10s"
}
}
Dashboard Best Practices
-
Organization
- Use folders for different environments
- Consistent naming conventions
- Version control dashboard JSON
-
Design
- Group related metrics
- Use appropriate visualization types
- Consistent color schemes
- Meaningful panel titles
-
Performance
- Limit time ranges
- Use recording rules for complex queries
- Avoid too many panels per dashboard
- Set appropriate refresh intervals
Alerting Configuration
Alertmanager Setup
# Download Alertmanager
ALERTMANAGER_VERSION="0.26.0"
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool /usr/local/bin/
# Create directories
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
# Create configuration
sudo nano /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.example.com:587'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'
smtp_require_tls: true
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-ops'
routes:
- match:
severity: critical
receiver: 'team-ops-critical'
continue: true
- match:
team: database
receiver: 'team-database'
- match_re:
service: ^(frontend|backend)$
receiver: 'team-dev'
receivers:
- name: 'team-ops'
email_configs:
- to: '[email protected]'
headers:
Subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'
- name: 'team-ops-critical'
email_configs:
- to: '[email protected]'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
- name: 'team-database'
email_configs:
- to: '[email protected]'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#database-alerts'
- name: 'team-dev'
webhook_configs:
- url: 'http://webhook.example.com/prometheus'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Create Alert Templates
sudo mkdir -p /etc/alertmanager/templates
sudo nano /etc/alertmanager/templates/custom.tmpl
{{ define "custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}
{{ define "custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
*Value:* {{ .Value }}
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}
{{ define "custom.slack.text" }}
{{ range .Alerts }}
:{{ if eq .Status "firing" }}red_circle{{ else }}green_circle{{ end }}: *{{ .Annotations.summary }}*
{{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Instance:* `{{ .Labels.instance }}`
*Value:* `{{ .Value }}`
{{ end }}
{{ end }}
Grafana Alerting
# Configure Grafana alerting
sudo nano /etc/grafana/grafana.ini
[unified_alerting]
enabled = true
execute_alerts = true
evaluation_timeout = 30s
notification_timeout = 30s
max_attempts = 3
min_interval = 10s
[unified_alerting.screenshots]
capture = true
capture_timeout = 10s
max_concurrent_screenshots = 5
upload_external_image_storage = false
Create Grafana Alert Rules
{
"uid": "cpu-alert",
"title": "High CPU Usage Alert",
"condition": "A",
"data": [
{
"refId": "A",
"queryType": "",
"model": {
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A"
},
"datasourceUid": "prometheus-uid",
"conditions": [
{
"evaluator": {
"params": [80],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A"]
},
"reducer": {
"params": [],
"type": "avg"
},
"type": "query"
}
],
"reducer": "last",
"expression": "A"
}
],
"noDataState": "NoData",
"execErrState": "Alerting",
"for": "5m",
"annotations": {
"description": "CPU usage is above 80% on {{ $labels.instance }}",
"runbook_url": "https://wiki.example.com/runbooks/cpu-high",
"summary": "High CPU usage detected"
},
"labels": {
"severity": "warning",
"team": "ops"
}
}
Service Discovery
Consul Integration
# Prometheus configuration for Consul
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.example.com:8500'
token: 'your-consul-token'
datacenter: 'dc1'
tag_separator: ','
scheme: 'http'
services: [] # All services
relabel_configs:
# Keep only services with 'prometheus' tag
- source_labels: [__meta_consul_tags]
regex: '.*,prometheus,.*'
action: keep
# Use service name as job label
- source_labels: [__meta_consul_service]
target_label: job
# Use node name as instance label
- source_labels: [__meta_consul_node]
target_label: instance
# Extract custom metrics path from tags
- source_labels: [__meta_consul_tags]
regex: '.*,metrics_path=([^,]+),.*'
target_label: __metrics_path__
replacement: '${1}'
Kubernetes Service Discovery
# Kubernetes pods discovery
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
- production
relabel_configs:
# Only scrape pods with prometheus annotations
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# Use custom path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Add kubernetes labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# Add namespace
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# Add pod name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
File-based Service Discovery
# Create file SD directory
sudo mkdir -p /etc/prometheus/file_sd
# Example targets file
sudo nano /etc/prometheus/file_sd/webservers.json
[
{
"targets": ["web1.example.com:9100", "web2.example.com:9100"],
"labels": {
"env": "production",
"role": "webserver",
"datacenter": "us-east-1"
}
},
{
"targets": ["web3.example.com:9100"],
"labels": {
"env": "staging",
"role": "webserver",
"datacenter": "us-west-2"
}
}
]
DNS Service Discovery
scrape_configs:
- job_name: 'dns-srv-records'
dns_sd_configs:
- names:
- '_prometheus._tcp.example.com'
type: 'SRV'
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_dns_name]
target_label: instance
regex: '([^.]+)\..*'
replacement: '${1}'
Security Hardening
Prometheus Security
# Enable basic authentication
sudo dnf install -y httpd-tools
# Generate password hash
htpasswd -nBC 10 "" | tr -d ':\n'
# Configure Prometheus
sudo nano /etc/prometheus/web.yml
basic_auth_users:
admin: $2y$10$V2RmZ2wKC7cPiE3o/h3gLuUBUPGIM2Qm0x0W8X0gAB3sLNkVE3tEq
prometheus: $2y$10$93m/Gk5HzNxwGqDG3zSJxuYCKNneOU5W.AXFyiKJhDRIAHsQBGtFa
tls_server_config:
cert_file: /etc/prometheus/prometheus.crt
key_file: /etc/prometheus/prometheus.key
client_auth_type: RequireAndVerifyClientCert
client_ca_file: /etc/prometheus/ca.crt
# Update Prometheus service
sudo nano /etc/systemd/system/prometheus.service
# Add to ExecStart:
--web.config.file=/etc/prometheus/web.yml
# Restart Prometheus
sudo systemctl daemon-reload
sudo systemctl restart prometheus
Grafana Security
# /etc/grafana/grafana.ini
[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict
strict_transport_security = true
strict_transport_security_max_age_seconds = 86400
strict_transport_security_preload = true
strict_transport_security_subdomains = true
x_content_type_options = true
x_xss_protection = true
content_security_policy = true
[auth]
disable_login_form = false
disable_signout_menu = false
oauth_auto_login = false
[auth.anonymous]
enabled = false
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = true
[auth.proxy]
enabled = false
[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
viewers_can_edit = false
editors_can_admin = false
[database]
ssl_mode = require
ca_cert_path = /etc/grafana/ca.crt
client_key_path = /etc/grafana/client.key
client_cert_path = /etc/grafana/client.crt
server_cert_name = grafana.example.com
Network Security
# Configure firewall
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-service=https
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --reload
# Restrict access by source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port port="9090" protocol="tcp" accept'
sudo firewall-cmd --reload
SSL/TLS Configuration
# Generate certificates
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
-keyout prometheus.key -out prometheus.crt \
-subj "/C=US/ST=State/L=City/O=Organization/CN=prometheus.example.com"
# Configure Nginx reverse proxy
sudo dnf install -y nginx
sudo nano /etc/nginx/conf.d/monitoring.conf
# Prometheus
server {
listen 443 ssl http2;
server_name prometheus.example.com;
ssl_certificate /etc/nginx/ssl/prometheus.crt;
ssl_certificate_key /etc/nginx/ssl/prometheus.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
location / {
proxy_pass http://localhost:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
}
}
# Grafana
server {
listen 443 ssl http2;
server_name grafana.example.com;
ssl_certificate /etc/nginx/ssl/grafana.crt;
ssl_certificate_key /etc/nginx/ssl/grafana.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Performance Optimization
Prometheus Optimization
# Storage optimization
storage:
tsdb:
# Retention settings
retention.time: 30d
retention.size: 100GB
# WAL compression
wal-compression: true
# Block duration
min-block-duration: 2h
max-block-duration: 48h
# Query optimization
query:
# Concurrent queries
max-concurrency: 20
# Query timeout
timeout: 2m
# Max samples per query
max-samples: 50000000
# Lookback delta
lookback-delta: 5m
# Scrape optimization
scrape_configs:
- job_name: 'optimized'
# Increase scrape interval for less critical metrics
scrape_interval: 60s
# Reduce scrape timeout
scrape_timeout: 10s
# Limit sample size
sample_limit: 10000
# Limit label count
label_limit: 30
# Limit label name length
label_name_length_limit: 200
# Limit label value length
label_value_length_limit: 200
Recording Rules for Performance
groups:
- name: performance_rules
interval: 30s
rules:
# Pre-calculate expensive queries
- record: job:node_cpu:avg_rate5m
expr: avg by(job) (rate(node_cpu_seconds_total[5m]))
- record: job:node_memory:usage_percentage
expr: |
100 * (1 - (
sum by(job) (node_memory_MemAvailable_bytes)
/
sum by(job) (node_memory_MemTotal_bytes)
))
- record: instance:node_filesystem:usage_percentage
expr: |
100 - (
100 * node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}
)
Grafana Performance
# Database optimization
[database]
max_open_conn = 100
max_idle_conn = 100
conn_max_lifetime = 14400
# Caching
[caching]
enabled = true
# Data proxy
[dataproxy]
timeout = 30
keep_alive_seconds = 30
tls_handshake_timeout_seconds = 10
expect_continue_timeout_seconds = 1
max_idle_connections = 100
idle_conn_timeout_seconds = 90
# Rendering
[rendering]
concurrent_render_limit = 5
# Query caching
[feature_toggles]
enable = queryCaching
Query Optimization Tips
-
Use Recording Rules
- Pre-calculate expensive queries
- Aggregate data at collection time
- Reduce query complexity
-
Optimize PromQL
# Bad: Multiple aggregations avg(rate(http_requests_total[5m])) by (job) # Good: Single aggregation avg by (job) (rate(http_requests_total[5m]))
-
Limit Time Ranges
- Use appropriate time ranges
- Avoid querying old data unnecessarily
- Use downsampling for historical data
-
Index Labels Properly
- Keep cardinality in check
- Use meaningful label names
- Avoid high-cardinality labels
High Availability Setup
Prometheus HA Configuration
# prometheus-1.yml
global:
scrape_interval: 15s
external_labels:
replica: '1'
cluster: 'prod'
# prometheus-2.yml
global:
scrape_interval: 15s
external_labels:
replica: '2'
cluster: 'prod'
Using Thanos for HA
# Install Thanos
THANOS_VERSION="0.32.0"
wget https://github.com/thanos-io/thanos/releases/download/v${THANOS_VERSION}/thanos-${THANOS_VERSION}.linux-amd64.tar.gz
tar xvf thanos-${THANOS_VERSION}.linux-amd64.tar.gz
sudo cp thanos-${THANOS_VERSION}.linux-amd64/thanos /usr/local/bin/
# Configure Thanos Sidecar
sudo nano /etc/systemd/system/thanos-sidecar.service
[Unit]
Description=Thanos Sidecar
After=prometheus.service
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/thanos sidecar \
--tsdb.path=/var/lib/prometheus \
--prometheus.url=http://localhost:9090 \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902
[Install]
WantedBy=multi-user.target
Grafana HA with Database
# Use external database for HA
[database]
type = postgres
host = postgres.example.com:5432
name = grafana
user = grafana
password = SecurePassword123!
ssl_mode = require
ca_cert_path = /etc/grafana/ca.pem
[session]
provider = postgres
provider_config = user=grafana password=SecurePassword123! host=postgres.example.com port=5432 dbname=grafana sslmode=require
[remote_cache]
type = redis
connstr = redis.example.com:6379
Troubleshooting
Common Prometheus Issues
# Check Prometheus configuration
promtool check config /etc/prometheus/prometheus.yml
# Check rule files
promtool check rules /etc/prometheus/rules/*.yml
# Test service discovery
curl http://localhost:9090/api/v1/targets
# Check metrics ingestion
curl http://localhost:9090/api/v1/query?query=up
# Debug scraping issues
curl http://localhost:9090/api/v1/targets/metadata
# Check storage
du -sh /var/lib/prometheus/*
# Analyze cardinality
curl http://localhost:9090/api/v1/label/__name__/values | jq length
Common Grafana Issues
# Check Grafana logs
sudo journalctl -u grafana-server -f
# Test data source
curl -u admin:password http://localhost:3000/api/datasources
# Check plugin installation
grafana-cli plugins ls
# Database issues
grafana-cli admin data-migration
# Reset admin password
grafana-cli admin reset-admin-password newpassword
Performance Diagnostics
# Prometheus metrics about itself
curl http://localhost:9090/metrics | grep prometheus_
# Key metrics to check:
# - prometheus_tsdb_head_samples
# - prometheus_tsdb_symbol_table_size_bytes
# - prometheus_tsdb_head_chunks
# - prometheus_engine_query_duration_seconds
# - prometheus_http_request_duration_seconds
# Grafana performance metrics
curl http://localhost:3000/metrics | grep grafana_
# System resource usage
top -p $(pgrep prometheus)
top -p $(pgrep grafana)
Best Practices
Monitoring Best Practices
-
Label Management
- Keep label cardinality under control
- Use consistent label naming
- Avoid dynamic label values
- Document label meanings
-
Query Optimization
- Use recording rules for dashboards
- Limit query time ranges
- Avoid regex where possible
- Cache frequently used queries
-
Alert Design
- Alert on symptoms, not causes
- Include runbook links
- Set appropriate thresholds
- Test alerts regularly
-
Dashboard Design
- Group related metrics
- Use consistent layouts
- Include documentation
- Version control dashboards
Operational Best Practices
-
Backup Strategy
# Backup Prometheus data tar -czf prometheus-backup-$(date +%Y%m%d).tar.gz /var/lib/prometheus # Backup Grafana cp /var/lib/grafana/grafana.db grafana-backup-$(date +%Y%m%d).db grafana-cli admin export-dashboard-json
-
Monitoring the Monitors
- Monitor Prometheus with another instance
- Set up alerts for monitoring stack
- Track resource usage
- Monitor scrape performance
-
Capacity Planning
- Monitor storage growth
- Track cardinality increases
- Plan for retention needs
- Scale before hitting limits
-
Documentation
- Document architecture
- Maintain runbooks
- Record configuration decisions
- Keep dashboard documentation
Security Best Practices
-
Access Control
- Use strong authentication
- Implement RBAC
- Audit access logs
- Regular permission reviews
-
Network Security
- Use TLS everywhere
- Restrict network access
- Implement firewall rules
- Use VPN for remote access
-
Data Protection
- Encrypt data at rest
- Secure backups
- Limit data retention
- Anonymize sensitive data
Conclusion
Deploying Prometheus and Grafana on Rocky Linux provides a powerful, scalable monitoring solution for modern infrastructure. This guide has covered:
- Complete installation and configuration
- Integration and dashboard creation
- Advanced features like service discovery
- Security hardening and best practices
- Performance optimization techniques
- High availability configurations
Remember that monitoring is an iterative process. Start with basic metrics, gradually add more sophisticated monitoring, and continuously refine based on your needs.