Prometheus and Grafana Stack on Rocky Linux: Complete Monitoring Solution

Building a robust monitoring infrastructure is crucial for maintaining healthy systems and applications. Prometheus and Grafana form a powerful combination for metrics collection, storage, and visualization. This comprehensive guide walks you through deploying a complete monitoring stack on Rocky Linux, from basic setup to advanced configurations and custom dashboards.

Understanding Prometheus and Grafana
Architecture Overview
Installing Prometheus
Installing Grafana
Configuring Prometheus
Setting Up Exporters
Integrating Prometheus with Grafana
Creating Dashboards
Alerting Configuration
Service Discovery
Security Hardening
Performance Optimization
High Availability Setup
Troubleshooting
Best Practices

Understanding Prometheus and Grafana

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. Key features include:

Pull-based metrics collection: Prometheus scrapes metrics from configured targets
Time-series database: Efficient storage of metrics with timestamps
Powerful query language: PromQL for data retrieval and analysis
Service discovery: Automatic discovery of monitoring targets
Built-in alerting: Alert rules and integration with Alertmanager

What is Grafana?

Grafana is a multi-platform analytics and visualization platform that supports multiple data sources:

Beautiful dashboards: Create stunning visualizations of your metrics
Multiple data sources: Supports Prometheus, InfluxDB, Elasticsearch, and more
Alerting: Visual alert rules with multiple notification channels
User management: Role-based access control and teams
Plugin ecosystem: Extend functionality with community plugins

Why Use Them Together?

Feature	Prometheus	Grafana	Combined Benefit
Data Collection	✓	✗	Reliable metrics gathering
Storage	✓	✗	Efficient time-series storage
Visualization	Basic	✓	Professional dashboards
Alerting	Rule-based	Visual	Comprehensive alerting
User Interface	Minimal	Rich	User-friendly monitoring

Architecture Overview

Component Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Applications  │────▶│   Exporters     │◀────│   Prometheus    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Alertmanager  │◀────│   Alert Rules   │     │     Grafana     │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Network Ports

Prometheus: 9090 (web UI and API)
Grafana: 3000 (web UI)
Node Exporter: 9100
Alertmanager: 9093
Pushgateway: 9091

Installing Prometheus

Prerequisites

# Update system
sudo dnf update -y

# Install dependencies
sudo dnf install -y wget curl tar

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /var/log/prometheus

Download and Install Prometheus

# Set version
PROMETHEUS_VERSION="2.45.0"

# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Extract archive
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Copy binaries
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool /usr/local/bin/

# Copy configuration files
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/console_libraries /etc/prometheus

# Set ownership
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /var/log/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

Create Prometheus Configuration

# Create basic configuration
sudo nano /etc/prometheus/prometheus.yml

# Global configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    monitor: 'prometheus-stack'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

# Load rules once and periodically evaluate them
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: 'prometheus-node'

  # Grafana
  - job_name: 'grafana'
    static_configs:
      - targets: ['localhost:3000']

Create Systemd Service

sudo nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.enable-lifecycle \
  --web.enable-admin-api \
  --log.level=info \
  --log.format=logfmt

Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
SyslogIdentifier=prometheus
KillMode=mixed
KillSignal=SIGTERM

[Install]
WantedBy=multi-user.target

Start Prometheus

# Reload systemd
sudo systemctl daemon-reload

# Enable and start Prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus

# Check status
sudo systemctl status prometheus

# Check logs
sudo journalctl -u prometheus -f

# Verify installation
curl http://localhost:9090/metrics

Installing Grafana

Method 1: Install from Repository

# Add Grafana repository
sudo nano /etc/yum.repos.d/grafana.repo

[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

# Install Grafana
sudo dnf install -y grafana

# Enable and start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# Check status
sudo systemctl status grafana-server

Method 2: Install from Binary

# Set version
GRAFANA_VERSION="10.0.3"

# Download Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz

# Extract and install
tar -zxvf grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz
sudo mv grafana-${GRAFANA_VERSION} /opt/grafana

# Create user
sudo useradd --no-create-home --shell /bin/false grafana

# Set permissions
sudo chown -R grafana:grafana /opt/grafana

# Create systemd service
sudo nano /etc/systemd/system/grafana.service

[Unit]
Description=Grafana
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=grafana
Group=grafana
ExecStart=/opt/grafana/bin/grafana-server \
  --config=/opt/grafana/conf/defaults.ini \
  --homepath=/opt/grafana

Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Configure Grafana

# Edit Grafana configuration
sudo nano /etc/grafana/grafana.ini

[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = monitoring.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false

[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = false
cookie_samesite = lax
allow_embedding = false

[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer

[auth.anonymous]
enabled = false

[auth.basic]
enabled = true

[database]
type = sqlite3
path = grafana.db

[session]
provider = file
provider_config = sessions

[analytics]
reporting_enabled = false
check_for_updates = false

[log]
mode = console file
level = info
filters = 

[alerting]
enabled = true
execute_alerts = true

Configuring Prometheus

Advanced Configuration

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  query_log_file: /var/log/prometheus/query.log
  external_labels:
    cluster: 'production'
    replica: '1'

# Remote write configuration (optional)
remote_write:
  - url: 'http://remote-storage:9201/write'
    queue_config:
      capacity: 10000
      max_shards: 5
      max_samples_per_send: 1000

# Alerting configuration
alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
          - 'alertmanager:9093'
      timeout: 10s

# Rule files
rule_files:
  - "/etc/prometheus/rules/alerts.yml"
  - "/etc/prometheus/rules/recording.yml"

# Scrape configurations
scrape_configs:
  # Service discovery for node exporters
  - job_name: 'node-exporter'
    consul_sd_configs:
      - server: 'consul:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_node]
        target_label: instance
      - source_labels: [__meta_consul_tags]
        regex: '.*,production,.*'
        action: keep

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # File-based service discovery
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
        - '/etc/prometheus/file_sd/*.json'
        refresh_interval: 5m

  # Static targets with different intervals
  - job_name: 'high-frequency'
    scrape_interval: 5s
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
        labels:
          env: 'production'
          team: 'backend'

  # Blackbox exporter for endpoint monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Create Alert Rules

# Create rules directory
sudo mkdir -p /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus/rules

# Create alert rules
sudo nano /etc/prometheus/rules/alerts.yml

groups:
  - name: node_alerts
    interval: 30s
    rules:
      # Node down
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 5 minutes."

      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"

      # Disk space low
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 15% (current value: {{ $value }}%)"

      # High load average
      - alert: HighLoadAverage
        expr: node_load1 > (count by(instance)(node_cpu_seconds_total{mode="idle"})) * 2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High load average on {{ $labels.instance }}"
          description: "Load average is high (current value: {{ $value }})"

  - name: prometheus_alerts
    rules:
      # Prometheus target down
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target {{ $labels.job }}/{{ $labels.instance }} is down"
          description: "Target has been down for more than 5 minutes."

      # Too many scrape errors
      - alert: PrometheusScrapingError
        expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus scraping error"
          description: "Prometheus has scraping errors for {{ $labels.job }}/{{ $labels.instance }}"

      # Prometheus config reload failed
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful != 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus configuration reload failed"
          description: "Prometheus configuration reload has failed"

Create Recording Rules

sudo nano /etc/prometheus/rules/recording.yml

groups:
  - name: node_recording
    interval: 30s
    rules:
      # CPU usage percentage
      - record: instance:node_cpu_utilisation:rate5m
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # Memory usage percentage
      - record: instance:node_memory_utilisation:percentage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      # Disk usage percentage
      - record: instance:node_filesystem_utilisation:percentage
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

      # Network receive bandwidth
      - record: instance:node_network_receive_bytes:rate5m
        expr: sum by(instance) (rate(node_network_receive_bytes_total[5m]))

      # Network transmit bandwidth
      - record: instance:node_network_transmit_bytes:rate5m
        expr: sum by(instance) (rate(node_network_transmit_bytes_total[5m]))

  - name: aggregated_metrics
    interval: 60s
    rules:
      # Average CPU across all nodes
      - record: job:node_cpu_utilisation:avg
        expr: avg(instance:node_cpu_utilisation:rate5m)

      # Total memory usage
      - record: job:node_memory_bytes:sum
        expr: sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

Setting Up Exporters

Node Exporter

# Download Node Exporter
NODE_EXPORTER_VERSION="1.6.1"
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Set permissions
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Create systemd service
sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
  --collector.netclass.ignored-devices=^(veth.*)$$ \
  --web.listen-address=:9100 \
  --web.telemetry-path=/metrics

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Blackbox Exporter

# Download Blackbox Exporter
BLACKBOX_VERSION="0.24.0"
cd /tmp
wget https://github.com/prometheus/blackbox_exporter/releases/download/v${BLACKBOX_VERSION}/blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz
sudo cp blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64/blackbox_exporter /usr/local/bin/

# Create configuration
sudo mkdir -p /etc/blackbox_exporter
sudo nano /etc/blackbox_exporter/blackbox.yml

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "data"}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  dns_tcp:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      transport_protocol: "tcp"

MySQL/MariaDB Exporter

# Download MySQL Exporter
MYSQL_EXPORTER_VERSION="0.15.0"
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v${MYSQL_EXPORTER_VERSION}/mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/

# Create MySQL user for exporter
mysql -u root -p << EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF

# Create credentials file
sudo mkdir -p /etc/mysqld_exporter
sudo nano /etc/mysqld_exporter/.my.cnf

[client]
host=localhost
port=3306
user=exporter
password=ExporterPassword123!

# Set permissions
sudo chmod 600 /etc/mysqld_exporter/.my.cnf
sudo chown prometheus:prometheus /etc/mysqld_exporter/.my.cnf

Custom Application Metrics

# Example Python application with Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# Define metrics
request_count = Counter('app_requests_total', 'Total number of requests', ['method', 'endpoint'])
request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint'])
active_users = Gauge('app_active_users', 'Number of active users')

# Expose metrics
start_http_server(8000)

# Application logic
while True:
    # Simulate requests
    method = random.choice(['GET', 'POST'])
    endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
    
    with request_duration.labels(method=method, endpoint=endpoint).time():
        # Simulate processing time
        time.sleep(random.random())
        
    request_count.labels(method=method, endpoint=endpoint).inc()
    active_users.set(random.randint(50, 200))
    
    time.sleep(1)

Integrating Prometheus with Grafana

Add Prometheus Data Source

# Using Grafana API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true,
    "jsonData": {
      "timeInterval": "15s",
      "queryTimeout": "60s",
      "httpMethod": "POST"
    }
  }'

Configure Data Source in UI

Navigate to Configuration → Data Sources
Click “Add data source”
Select “Prometheus”
Configure:
- URL: http://localhost:9090
- Access: Server (default)
- Scrape interval: 15s
- Query timeout: 60s
- HTTP Method: POST

Creating Dashboards

Import Community Dashboards

# Popular dashboard IDs:
# 1860 - Node Exporter Full
# 7362 - MySQL Overview
# 3662 - Prometheus 2.0 Overview
# 11074 - Node Exporter for Prometheus

# Import via API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {
      "id": 1860,
      "uid": null,
      "title": "Node Exporter Full"
    },
    "overwrite": true,
    "inputs": [{
      "name": "DS_PROMETHEUS",
      "type": "datasource",
      "pluginId": "prometheus",
      "value": "Prometheus"
    }]
  }'

Create Custom Dashboard

{
  "dashboard": {
    "title": "System Metrics Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "type": "graph",
        "targets": [{
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}}",
          "refId": "A"
        }],
        "yaxes": [{
          "format": "percent",
          "min": 0,
          "max": 100
        }]
      },
      {
        "title": "Memory Usage",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "type": "graph",
        "targets": [{
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
          "legendFormat": "{{instance}}",
          "refId": "A"
        }],
        "yaxes": [{
          "format": "percent",
          "min": 0,
          "max": 100
        }]
      },
      {
        "title": "Disk I/O",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_disk_read_bytes_total[5m])",
            "legendFormat": "{{instance}} - Read",
            "refId": "A"
          },
          {
            "expr": "rate(node_disk_written_bytes_total[5m])",
            "legendFormat": "{{instance}} - Write",
            "refId": "B"
          }
        ],
        "yaxes": [{
          "format": "Bps"
        }]
      },
      {
        "title": "Network Traffic",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{device!~\"lo\"}[5m])",
            "legendFormat": "{{instance}} - {{device}} RX",
            "refId": "A"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m])",
            "legendFormat": "{{instance}} - {{device}} TX",
            "refId": "B"
          }
        ],
        "yaxes": [{
          "format": "Bps"
        }]
      }
    ],
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "10s"
  }
}

Dashboard Best Practices

Organization
- Use folders for different environments
- Consistent naming conventions
- Version control dashboard JSON
Design
- Group related metrics
- Use appropriate visualization types
- Consistent color schemes
- Meaningful panel titles
Performance
- Limit time ranges
- Use recording rules for complex queries
- Avoid too many panels per dashboard
- Set appropriate refresh intervals

Alerting Configuration

Alertmanager Setup

# Download Alertmanager
ALERTMANAGER_VERSION="0.26.0"
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool /usr/local/bin/

# Create directories
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager

# Create configuration
sudo nano /etc/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'
  smtp_require_tls: true

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-ops'
  
  routes:
    - match:
        severity: critical
      receiver: 'team-ops-critical'
      continue: true
      
    - match:
        team: database
      receiver: 'team-database'
      
    - match_re:
        service: ^(frontend|backend)$
      receiver: 'team-dev'

receivers:
  - name: 'team-ops'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'
    
  - name: 'team-ops-critical'
    email_configs:
      - to: '[email protected]'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        
  - name: 'team-database'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#database-alerts'
        
  - name: 'team-dev'
    webhook_configs:
      - url: 'http://webhook.example.com/prometheus'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Create Alert Templates

sudo mkdir -p /etc/alertmanager/templates
sudo nano /etc/alertmanager/templates/custom.tmpl

{{ define "custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}

{{ define "custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
*Value:* {{ .Value }}
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}

{{ define "custom.slack.text" }}
{{ range .Alerts }}
:{{ if eq .Status "firing" }}red_circle{{ else }}green_circle{{ end }}: *{{ .Annotations.summary }}*
{{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Instance:* `{{ .Labels.instance }}`
*Value:* `{{ .Value }}`
{{ end }}
{{ end }}

Grafana Alerting

# Configure Grafana alerting
sudo nano /etc/grafana/grafana.ini

[unified_alerting]
enabled = true
execute_alerts = true
evaluation_timeout = 30s
notification_timeout = 30s
max_attempts = 3
min_interval = 10s

[unified_alerting.screenshots]
capture = true
capture_timeout = 10s
max_concurrent_screenshots = 5
upload_external_image_storage = false

Create Grafana Alert Rules

{
  "uid": "cpu-alert",
  "title": "High CPU Usage Alert",
  "condition": "A",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "model": {
        "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
        "refId": "A"
      },
      "datasourceUid": "prometheus-uid",
      "conditions": [
        {
          "evaluator": {
            "params": [80],
            "type": "gt"
          },
          "operator": {
            "type": "and"
          },
          "query": {
            "params": ["A"]
          },
          "reducer": {
            "params": [],
            "type": "avg"
          },
          "type": "query"
        }
      ],
      "reducer": "last",
      "expression": "A"
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "5m",
  "annotations": {
    "description": "CPU usage is above 80% on {{ $labels.instance }}",
    "runbook_url": "https://wiki.example.com/runbooks/cpu-high",
    "summary": "High CPU usage detected"
  },
  "labels": {
    "severity": "warning",
    "team": "ops"
  }
}

Service Discovery

Consul Integration

# Prometheus configuration for Consul
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        token: 'your-consul-token'
        datacenter: 'dc1'
        tag_separator: ','
        scheme: 'http'
        services: []  # All services
        
    relabel_configs:
      # Keep only services with 'prometheus' tag
      - source_labels: [__meta_consul_tags]
        regex: '.*,prometheus,.*'
        action: keep
        
      # Use service name as job label
      - source_labels: [__meta_consul_service]
        target_label: job
        
      # Use node name as instance label
      - source_labels: [__meta_consul_node]
        target_label: instance
        
      # Extract custom metrics path from tags
      - source_labels: [__meta_consul_tags]
        regex: '.*,metrics_path=([^,]+),.*'
        target_label: __metrics_path__
        replacement: '${1}'

Kubernetes Service Discovery

# Kubernetes pods discovery
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
            - production
            
    relabel_configs:
      # Only scrape pods with prometheus annotations
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
        
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        
      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
        
      # Add kubernetes labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
        
      # Add namespace
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
        
      # Add pod name
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

File-based Service Discovery

# Create file SD directory
sudo mkdir -p /etc/prometheus/file_sd

# Example targets file
sudo nano /etc/prometheus/file_sd/webservers.json

[
  {
    "targets": ["web1.example.com:9100", "web2.example.com:9100"],
    "labels": {
      "env": "production",
      "role": "webserver",
      "datacenter": "us-east-1"
    }
  },
  {
    "targets": ["web3.example.com:9100"],
    "labels": {
      "env": "staging",
      "role": "webserver",
      "datacenter": "us-west-2"
    }
  }
]

DNS Service Discovery

scrape_configs:
  - job_name: 'dns-srv-records'
    dns_sd_configs:
      - names:
          - '_prometheus._tcp.example.com'
        type: 'SRV'
        refresh_interval: 30s
        
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance
        regex: '([^.]+)\..*'
        replacement: '${1}'

Security Hardening

Prometheus Security

# Enable basic authentication
sudo dnf install -y httpd-tools

# Generate password hash
htpasswd -nBC 10 "" | tr -d ':\n'

# Configure Prometheus
sudo nano /etc/prometheus/web.yml

basic_auth_users:
  admin: $2y$10$V2RmZ2wKC7cPiE3o/h3gLuUBUPGIM2Qm0x0W8X0gAB3sLNkVE3tEq
  prometheus: $2y$10$93m/Gk5HzNxwGqDG3zSJxuYCKNneOU5W.AXFyiKJhDRIAHsQBGtFa

tls_server_config:
  cert_file: /etc/prometheus/prometheus.crt
  key_file: /etc/prometheus/prometheus.key
  client_auth_type: RequireAndVerifyClientCert
  client_ca_file: /etc/prometheus/ca.crt

# Update Prometheus service
sudo nano /etc/systemd/system/prometheus.service

# Add to ExecStart:
--web.config.file=/etc/prometheus/web.yml

# Restart Prometheus
sudo systemctl daemon-reload
sudo systemctl restart prometheus

Grafana Security

# /etc/grafana/grafana.ini
[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict
strict_transport_security = true
strict_transport_security_max_age_seconds = 86400
strict_transport_security_preload = true
strict_transport_security_subdomains = true
x_content_type_options = true
x_xss_protection = true
content_security_policy = true

[auth]
disable_login_form = false
disable_signout_menu = false
oauth_auto_login = false

[auth.anonymous]
enabled = false

[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = true

[auth.proxy]
enabled = false

[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
viewers_can_edit = false
editors_can_admin = false

[database]
ssl_mode = require
ca_cert_path = /etc/grafana/ca.crt
client_key_path = /etc/grafana/client.key
client_cert_path = /etc/grafana/client.crt
server_cert_name = grafana.example.com

Network Security

# Configure firewall
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-service=https
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --reload

# Restrict access by source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port port="9090" protocol="tcp" accept'
sudo firewall-cmd --reload

SSL/TLS Configuration

# Generate certificates
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
  -keyout prometheus.key -out prometheus.crt \
  -subj "/C=US/ST=State/L=City/O=Organization/CN=prometheus.example.com"

# Configure Nginx reverse proxy
sudo dnf install -y nginx

sudo nano /etc/nginx/conf.d/monitoring.conf

# Prometheus
server {
    listen 443 ssl http2;
    server_name prometheus.example.com;
    
    ssl_certificate /etc/nginx/ssl/prometheus.crt;
    ssl_certificate_key /etc/nginx/ssl/prometheus.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    
    location / {
        proxy_pass http://localhost:9090;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        auth_basic "Prometheus";
        auth_basic_user_file /etc/nginx/.htpasswd;
    }
}

# Grafana
server {
    listen 443 ssl http2;
    server_name grafana.example.com;
    
    ssl_certificate /etc/nginx/ssl/grafana.crt;
    ssl_certificate_key /etc/nginx/ssl/grafana.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Performance Optimization

Prometheus Optimization

# Storage optimization
storage:
  tsdb:
    # Retention settings
    retention.time: 30d
    retention.size: 100GB
    
    # WAL compression
    wal-compression: true
    
    # Block duration
    min-block-duration: 2h
    max-block-duration: 48h

# Query optimization
query:
  # Concurrent queries
  max-concurrency: 20
  
  # Query timeout
  timeout: 2m
  
  # Max samples per query
  max-samples: 50000000
  
  # Lookback delta
  lookback-delta: 5m

# Scrape optimization
scrape_configs:
  - job_name: 'optimized'
    # Increase scrape interval for less critical metrics
    scrape_interval: 60s
    
    # Reduce scrape timeout
    scrape_timeout: 10s
    
    # Limit sample size
    sample_limit: 10000
    
    # Limit label count
    label_limit: 30
    
    # Limit label name length
    label_name_length_limit: 200
    
    # Limit label value length
    label_value_length_limit: 200

Recording Rules for Performance

groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-calculate expensive queries
      - record: job:node_cpu:avg_rate5m
        expr: avg by(job) (rate(node_cpu_seconds_total[5m]))
        
      - record: job:node_memory:usage_percentage
        expr: |
          100 * (1 - (
            sum by(job) (node_memory_MemAvailable_bytes)
            /
            sum by(job) (node_memory_MemTotal_bytes)
          ))
          
      - record: instance:node_filesystem:usage_percentage
        expr: |
          100 - (
            100 * node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}
          )

Grafana Performance

# Database optimization
[database]
max_open_conn = 100
max_idle_conn = 100
conn_max_lifetime = 14400

# Caching
[caching]
enabled = true

# Data proxy
[dataproxy]
timeout = 30
keep_alive_seconds = 30
tls_handshake_timeout_seconds = 10
expect_continue_timeout_seconds = 1
max_idle_connections = 100
idle_conn_timeout_seconds = 90

# Rendering
[rendering]
concurrent_render_limit = 5

# Query caching
[feature_toggles]
enable = queryCaching

Query Optimization Tips

Use Recording Rules
- Pre-calculate expensive queries
- Aggregate data at collection time
- Reduce query complexity

Optimize PromQL

# Bad: Multiple aggregations
avg(rate(http_requests_total[5m])) by (job)

# Good: Single aggregation
avg by (job) (rate(http_requests_total[5m]))

Limit Time Ranges
- Use appropriate time ranges
- Avoid querying old data unnecessarily
- Use downsampling for historical data
Index Labels Properly
- Keep cardinality in check
- Use meaningful label names
- Avoid high-cardinality labels

High Availability Setup

Prometheus HA Configuration

# prometheus-1.yml
global:
  scrape_interval: 15s
  external_labels:
    replica: '1'
    cluster: 'prod'

# prometheus-2.yml  
global:
  scrape_interval: 15s
  external_labels:
    replica: '2'
    cluster: 'prod'

Using Thanos for HA

# Install Thanos
THANOS_VERSION="0.32.0"
wget https://github.com/thanos-io/thanos/releases/download/v${THANOS_VERSION}/thanos-${THANOS_VERSION}.linux-amd64.tar.gz
tar xvf thanos-${THANOS_VERSION}.linux-amd64.tar.gz
sudo cp thanos-${THANOS_VERSION}.linux-amd64/thanos /usr/local/bin/

# Configure Thanos Sidecar
sudo nano /etc/systemd/system/thanos-sidecar.service

[Unit]
Description=Thanos Sidecar
After=prometheus.service

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/thanos sidecar \
  --tsdb.path=/var/lib/prometheus \
  --prometheus.url=http://localhost:9090 \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

[Install]
WantedBy=multi-user.target

Grafana HA with Database

# Use external database for HA
[database]
type = postgres
host = postgres.example.com:5432
name = grafana
user = grafana
password = SecurePassword123!
ssl_mode = require
ca_cert_path = /etc/grafana/ca.pem

[session]
provider = postgres
provider_config = user=grafana password=SecurePassword123! host=postgres.example.com port=5432 dbname=grafana sslmode=require

[remote_cache]
type = redis
connstr = redis.example.com:6379

Troubleshooting

Common Prometheus Issues

# Check Prometheus configuration
promtool check config /etc/prometheus/prometheus.yml

# Check rule files
promtool check rules /etc/prometheus/rules/*.yml

# Test service discovery
curl http://localhost:9090/api/v1/targets

# Check metrics ingestion
curl http://localhost:9090/api/v1/query?query=up

# Debug scraping issues
curl http://localhost:9090/api/v1/targets/metadata

# Check storage
du -sh /var/lib/prometheus/*

# Analyze cardinality
curl http://localhost:9090/api/v1/label/__name__/values | jq length

Common Grafana Issues

# Check Grafana logs
sudo journalctl -u grafana-server -f

# Test data source
curl -u admin:password http://localhost:3000/api/datasources

# Check plugin installation
grafana-cli plugins ls

# Database issues
grafana-cli admin data-migration

# Reset admin password
grafana-cli admin reset-admin-password newpassword

Performance Diagnostics

# Prometheus metrics about itself
curl http://localhost:9090/metrics | grep prometheus_

# Key metrics to check:
# - prometheus_tsdb_head_samples
# - prometheus_tsdb_symbol_table_size_bytes
# - prometheus_tsdb_head_chunks
# - prometheus_engine_query_duration_seconds
# - prometheus_http_request_duration_seconds

# Grafana performance metrics
curl http://localhost:3000/metrics | grep grafana_

# System resource usage
top -p $(pgrep prometheus)
top -p $(pgrep grafana)

Best Practices

Monitoring Best Practices

Label Management
- Keep label cardinality under control
- Use consistent label naming
- Avoid dynamic label values
- Document label meanings
Query Optimization
- Use recording rules for dashboards
- Limit query time ranges
- Avoid regex where possible
- Cache frequently used queries
Alert Design
- Alert on symptoms, not causes
- Include runbook links
- Set appropriate thresholds
- Test alerts regularly
Dashboard Design
- Group related metrics
- Use consistent layouts
- Include documentation
- Version control dashboards

Operational Best Practices

Backup Strategy

# Backup Prometheus data
tar -czf prometheus-backup-$(date +%Y%m%d).tar.gz /var/lib/prometheus

# Backup Grafana
cp /var/lib/grafana/grafana.db grafana-backup-$(date +%Y%m%d).db
grafana-cli admin export-dashboard-json

Monitoring the Monitors
- Monitor Prometheus with another instance
- Set up alerts for monitoring stack
- Track resource usage
- Monitor scrape performance
Capacity Planning
- Monitor storage growth
- Track cardinality increases
- Plan for retention needs
- Scale before hitting limits
Documentation
- Document architecture
- Maintain runbooks
- Record configuration decisions
- Keep dashboard documentation

Security Best Practices

Access Control
- Use strong authentication
- Implement RBAC
- Audit access logs
- Regular permission reviews
Network Security
- Use TLS everywhere
- Restrict network access
- Implement firewall rules
- Use VPN for remote access
Data Protection
- Encrypt data at rest
- Secure backups
- Limit data retention
- Anonymize sensitive data

Conclusion

Deploying Prometheus and Grafana on Rocky Linux provides a powerful, scalable monitoring solution for modern infrastructure. This guide has covered:

Complete installation and configuration
Integration and dashboard creation
Advanced features like service discovery
Security hardening and best practices
Performance optimization techniques
High availability configurations

Remember that monitoring is an iterative process. Start with basic metrics, gradually add more sophisticated monitoring, and continuously refine based on your needs.