dart
+
+
+
gentoo
+
hugging
&
+
riot
django
f#
c++
+
packer
+
abap
%
+
+
#
+
+
gin
oauth
+
+
windows
+
+
+
+
ada
+
+
f#
+
influxdb
java
dask
+
$
circle
mvn
ada
+
+
go
+
couchdb
+
_
||
symfony
+
sinatra
>=
pnpm
xgboost
+
+
+
+
py
express
+
+
+
+
+
>=
+
λ
+
+
+
+
elasticsearch
+
+
neo4j
ractive
+
+
clickhouse
+
+
Back to Blog
Prometheus and Grafana Stack on Rocky Linux: Complete Monitoring Solution
Rocky Linux Prometheus Grafana

Prometheus and Grafana Stack on Rocky Linux: Complete Monitoring Solution

Published Jul 27, 2025

Deploy a comprehensive monitoring stack with Prometheus and Grafana on Rocky Linux. Learn installation, configuration, metrics collection, alerting, and creating beautiful dashboards.

28 min read
0 views
Table of Contents

Building a robust monitoring infrastructure is crucial for maintaining healthy systems and applications. Prometheus and Grafana form a powerful combination for metrics collection, storage, and visualization. This comprehensive guide walks you through deploying a complete monitoring stack on Rocky Linux, from basic setup to advanced configurations and custom dashboards.

Understanding Prometheus and Grafana

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. Key features include:

  • Pull-based metrics collection: Prometheus scrapes metrics from configured targets
  • Time-series database: Efficient storage of metrics with timestamps
  • Powerful query language: PromQL for data retrieval and analysis
  • Service discovery: Automatic discovery of monitoring targets
  • Built-in alerting: Alert rules and integration with Alertmanager

What is Grafana?

Grafana is a multi-platform analytics and visualization platform that supports multiple data sources:

  • Beautiful dashboards: Create stunning visualizations of your metrics
  • Multiple data sources: Supports Prometheus, InfluxDB, Elasticsearch, and more
  • Alerting: Visual alert rules with multiple notification channels
  • User management: Role-based access control and teams
  • Plugin ecosystem: Extend functionality with community plugins

Why Use Them Together?

FeaturePrometheusGrafanaCombined Benefit
Data CollectionReliable metrics gathering
StorageEfficient time-series storage
VisualizationBasicProfessional dashboards
AlertingRule-basedVisualComprehensive alerting
User InterfaceMinimalRichUser-friendly monitoring

Architecture Overview

Component Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Applications  │────▶│   Exporters     │◀────│   Prometheus    │
└─────────────────┘     └─────────────────┘     └─────────────────┘


┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Alertmanager  │◀────│   Alert Rules   │     │     Grafana     │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Network Ports

  • Prometheus: 9090 (web UI and API)
  • Grafana: 3000 (web UI)
  • Node Exporter: 9100
  • Alertmanager: 9093
  • Pushgateway: 9091

Installing Prometheus

Prerequisites

# Update system
sudo dnf update -y

# Install dependencies
sudo dnf install -y wget curl tar

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /var/log/prometheus

Download and Install Prometheus

# Set version
PROMETHEUS_VERSION="2.45.0"

# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Extract archive
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Copy binaries
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool /usr/local/bin/

# Copy configuration files
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/console_libraries /etc/prometheus

# Set ownership
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /var/log/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

Create Prometheus Configuration

# Create basic configuration
sudo nano /etc/prometheus/prometheus.yml
# Global configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    monitor: 'prometheus-stack'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

# Load rules once and periodically evaluate them
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: 'prometheus-node'

  # Grafana
  - job_name: 'grafana'
    static_configs:
      - targets: ['localhost:3000']

Create Systemd Service

sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.enable-lifecycle \
  --web.enable-admin-api \
  --log.level=info \
  --log.format=logfmt

Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
SyslogIdentifier=prometheus
KillMode=mixed
KillSignal=SIGTERM

[Install]
WantedBy=multi-user.target

Start Prometheus

# Reload systemd
sudo systemctl daemon-reload

# Enable and start Prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus

# Check status
sudo systemctl status prometheus

# Check logs
sudo journalctl -u prometheus -f

# Verify installation
curl http://localhost:9090/metrics

Installing Grafana

Method 1: Install from Repository

# Add Grafana repository
sudo nano /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
# Install Grafana
sudo dnf install -y grafana

# Enable and start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# Check status
sudo systemctl status grafana-server

Method 2: Install from Binary

# Set version
GRAFANA_VERSION="10.0.3"

# Download Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz

# Extract and install
tar -zxvf grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz
sudo mv grafana-${GRAFANA_VERSION} /opt/grafana

# Create user
sudo useradd --no-create-home --shell /bin/false grafana

# Set permissions
sudo chown -R grafana:grafana /opt/grafana

# Create systemd service
sudo nano /etc/systemd/system/grafana.service
[Unit]
Description=Grafana
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=grafana
Group=grafana
ExecStart=/opt/grafana/bin/grafana-server \
  --config=/opt/grafana/conf/defaults.ini \
  --homepath=/opt/grafana

Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Configure Grafana

# Edit Grafana configuration
sudo nano /etc/grafana/grafana.ini
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = monitoring.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false

[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = false
cookie_samesite = lax
allow_embedding = false

[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer

[auth.anonymous]
enabled = false

[auth.basic]
enabled = true

[database]
type = sqlite3
path = grafana.db

[session]
provider = file
provider_config = sessions

[analytics]
reporting_enabled = false
check_for_updates = false

[log]
mode = console file
level = info
filters = 

[alerting]
enabled = true
execute_alerts = true

Configuring Prometheus

Advanced Configuration

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  query_log_file: /var/log/prometheus/query.log
  external_labels:
    cluster: 'production'
    replica: '1'

# Remote write configuration (optional)
remote_write:
  - url: 'http://remote-storage:9201/write'
    queue_config:
      capacity: 10000
      max_shards: 5
      max_samples_per_send: 1000

# Alerting configuration
alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
          - 'alertmanager:9093'
      timeout: 10s

# Rule files
rule_files:
  - "/etc/prometheus/rules/alerts.yml"
  - "/etc/prometheus/rules/recording.yml"

# Scrape configurations
scrape_configs:
  # Service discovery for node exporters
  - job_name: 'node-exporter'
    consul_sd_configs:
      - server: 'consul:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_node]
        target_label: instance
      - source_labels: [__meta_consul_tags]
        regex: '.*,production,.*'
        action: keep

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # File-based service discovery
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
        - '/etc/prometheus/file_sd/*.json'
        refresh_interval: 5m

  # Static targets with different intervals
  - job_name: 'high-frequency'
    scrape_interval: 5s
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
        labels:
          env: 'production'
          team: 'backend'

  # Blackbox exporter for endpoint monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Create Alert Rules

# Create rules directory
sudo mkdir -p /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus/rules

# Create alert rules
sudo nano /etc/prometheus/rules/alerts.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:
      # Node down
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 5 minutes."

      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"

      # Disk space low
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 15% (current value: {{ $value }}%)"

      # High load average
      - alert: HighLoadAverage
        expr: node_load1 > (count by(instance)(node_cpu_seconds_total{mode="idle"})) * 2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High load average on {{ $labels.instance }}"
          description: "Load average is high (current value: {{ $value }})"

  - name: prometheus_alerts
    rules:
      # Prometheus target down
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target {{ $labels.job }}/{{ $labels.instance }} is down"
          description: "Target has been down for more than 5 minutes."

      # Too many scrape errors
      - alert: PrometheusScrapingError
        expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus scraping error"
          description: "Prometheus has scraping errors for {{ $labels.job }}/{{ $labels.instance }}"

      # Prometheus config reload failed
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful != 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus configuration reload failed"
          description: "Prometheus configuration reload has failed"

Create Recording Rules

sudo nano /etc/prometheus/rules/recording.yml
groups:
  - name: node_recording
    interval: 30s
    rules:
      # CPU usage percentage
      - record: instance:node_cpu_utilisation:rate5m
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # Memory usage percentage
      - record: instance:node_memory_utilisation:percentage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      # Disk usage percentage
      - record: instance:node_filesystem_utilisation:percentage
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

      # Network receive bandwidth
      - record: instance:node_network_receive_bytes:rate5m
        expr: sum by(instance) (rate(node_network_receive_bytes_total[5m]))

      # Network transmit bandwidth
      - record: instance:node_network_transmit_bytes:rate5m
        expr: sum by(instance) (rate(node_network_transmit_bytes_total[5m]))

  - name: aggregated_metrics
    interval: 60s
    rules:
      # Average CPU across all nodes
      - record: job:node_cpu_utilisation:avg
        expr: avg(instance:node_cpu_utilisation:rate5m)

      # Total memory usage
      - record: job:node_memory_bytes:sum
        expr: sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

Setting Up Exporters

Node Exporter

# Download Node Exporter
NODE_EXPORTER_VERSION="1.6.1"
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Set permissions
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Create systemd service
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
  --collector.netclass.ignored-devices=^(veth.*)$$ \
  --web.listen-address=:9100 \
  --web.telemetry-path=/metrics

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Blackbox Exporter

# Download Blackbox Exporter
BLACKBOX_VERSION="0.24.0"
cd /tmp
wget https://github.com/prometheus/blackbox_exporter/releases/download/v${BLACKBOX_VERSION}/blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz
sudo cp blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64/blackbox_exporter /usr/local/bin/

# Create configuration
sudo mkdir -p /etc/blackbox_exporter
sudo nano /etc/blackbox_exporter/blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "data"}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  dns_tcp:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      transport_protocol: "tcp"

MySQL/MariaDB Exporter

# Download MySQL Exporter
MYSQL_EXPORTER_VERSION="0.15.0"
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v${MYSQL_EXPORTER_VERSION}/mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/

# Create MySQL user for exporter
mysql -u root -p << EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF

# Create credentials file
sudo mkdir -p /etc/mysqld_exporter
sudo nano /etc/mysqld_exporter/.my.cnf
[client]
host=localhost
port=3306
user=exporter
password=ExporterPassword123!
# Set permissions
sudo chmod 600 /etc/mysqld_exporter/.my.cnf
sudo chown prometheus:prometheus /etc/mysqld_exporter/.my.cnf

Custom Application Metrics

# Example Python application with Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# Define metrics
request_count = Counter('app_requests_total', 'Total number of requests', ['method', 'endpoint'])
request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint'])
active_users = Gauge('app_active_users', 'Number of active users')

# Expose metrics
start_http_server(8000)

# Application logic
while True:
    # Simulate requests
    method = random.choice(['GET', 'POST'])
    endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
    
    with request_duration.labels(method=method, endpoint=endpoint).time():
        # Simulate processing time
        time.sleep(random.random())
        
    request_count.labels(method=method, endpoint=endpoint).inc()
    active_users.set(random.randint(50, 200))
    
    time.sleep(1)

Integrating Prometheus with Grafana

Add Prometheus Data Source

# Using Grafana API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true,
    "jsonData": {
      "timeInterval": "15s",
      "queryTimeout": "60s",
      "httpMethod": "POST"
    }
  }'

Configure Data Source in UI

  1. Navigate to Configuration → Data Sources
  2. Click “Add data source”
  3. Select “Prometheus”
  4. Configure:
    • URL: http://localhost:9090
    • Access: Server (default)
    • Scrape interval: 15s
    • Query timeout: 60s
    • HTTP Method: POST

Creating Dashboards

Import Community Dashboards

# Popular dashboard IDs:
# 1860 - Node Exporter Full
# 7362 - MySQL Overview
# 3662 - Prometheus 2.0 Overview
# 11074 - Node Exporter for Prometheus

# Import via API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {
      "id": 1860,
      "uid": null,
      "title": "Node Exporter Full"
    },
    "overwrite": true,
    "inputs": [{
      "name": "DS_PROMETHEUS",
      "type": "datasource",
      "pluginId": "prometheus",
      "value": "Prometheus"
    }]
  }'

Create Custom Dashboard

{
  "dashboard": {
    "title": "System Metrics Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "type": "graph",
        "targets": [{
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}}",
          "refId": "A"
        }],
        "yaxes": [{
          "format": "percent",
          "min": 0,
          "max": 100
        }]
      },
      {
        "title": "Memory Usage",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "type": "graph",
        "targets": [{
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
          "legendFormat": "{{instance}}",
          "refId": "A"
        }],
        "yaxes": [{
          "format": "percent",
          "min": 0,
          "max": 100
        }]
      },
      {
        "title": "Disk I/O",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_disk_read_bytes_total[5m])",
            "legendFormat": "{{instance}} - Read",
            "refId": "A"
          },
          {
            "expr": "rate(node_disk_written_bytes_total[5m])",
            "legendFormat": "{{instance}} - Write",
            "refId": "B"
          }
        ],
        "yaxes": [{
          "format": "Bps"
        }]
      },
      {
        "title": "Network Traffic",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{device!~\"lo\"}[5m])",
            "legendFormat": "{{instance}} - {{device}} RX",
            "refId": "A"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m])",
            "legendFormat": "{{instance}} - {{device}} TX",
            "refId": "B"
          }
        ],
        "yaxes": [{
          "format": "Bps"
        }]
      }
    ],
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "10s"
  }
}

Dashboard Best Practices

  1. Organization

    • Use folders for different environments
    • Consistent naming conventions
    • Version control dashboard JSON
  2. Design

    • Group related metrics
    • Use appropriate visualization types
    • Consistent color schemes
    • Meaningful panel titles
  3. Performance

    • Limit time ranges
    • Use recording rules for complex queries
    • Avoid too many panels per dashboard
    • Set appropriate refresh intervals

Alerting Configuration

Alertmanager Setup

# Download Alertmanager
ALERTMANAGER_VERSION="0.26.0"
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool /usr/local/bin/

# Create directories
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager

# Create configuration
sudo nano /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'
  smtp_require_tls: true

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-ops'
  
  routes:
    - match:
        severity: critical
      receiver: 'team-ops-critical'
      continue: true
      
    - match:
        team: database
      receiver: 'team-database'
      
    - match_re:
        service: ^(frontend|backend)$
      receiver: 'team-dev'

receivers:
  - name: 'team-ops'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'
    
  - name: 'team-ops-critical'
    email_configs:
      - to: '[email protected]'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        
  - name: 'team-database'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#database-alerts'
        
  - name: 'team-dev'
    webhook_configs:
      - url: 'http://webhook.example.com/prometheus'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Create Alert Templates

sudo mkdir -p /etc/alertmanager/templates
sudo nano /etc/alertmanager/templates/custom.tmpl
{{ define "custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}

{{ define "custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
*Value:* {{ .Value }}
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}

{{ define "custom.slack.text" }}
{{ range .Alerts }}
:{{ if eq .Status "firing" }}red_circle{{ else }}green_circle{{ end }}: *{{ .Annotations.summary }}*
{{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Instance:* `{{ .Labels.instance }}`
*Value:* `{{ .Value }}`
{{ end }}
{{ end }}

Grafana Alerting

# Configure Grafana alerting
sudo nano /etc/grafana/grafana.ini
[unified_alerting]
enabled = true
execute_alerts = true
evaluation_timeout = 30s
notification_timeout = 30s
max_attempts = 3
min_interval = 10s

[unified_alerting.screenshots]
capture = true
capture_timeout = 10s
max_concurrent_screenshots = 5
upload_external_image_storage = false

Create Grafana Alert Rules

{
  "uid": "cpu-alert",
  "title": "High CPU Usage Alert",
  "condition": "A",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "model": {
        "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
        "refId": "A"
      },
      "datasourceUid": "prometheus-uid",
      "conditions": [
        {
          "evaluator": {
            "params": [80],
            "type": "gt"
          },
          "operator": {
            "type": "and"
          },
          "query": {
            "params": ["A"]
          },
          "reducer": {
            "params": [],
            "type": "avg"
          },
          "type": "query"
        }
      ],
      "reducer": "last",
      "expression": "A"
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "5m",
  "annotations": {
    "description": "CPU usage is above 80% on {{ $labels.instance }}",
    "runbook_url": "https://wiki.example.com/runbooks/cpu-high",
    "summary": "High CPU usage detected"
  },
  "labels": {
    "severity": "warning",
    "team": "ops"
  }
}

Service Discovery

Consul Integration

# Prometheus configuration for Consul
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        token: 'your-consul-token'
        datacenter: 'dc1'
        tag_separator: ','
        scheme: 'http'
        services: []  # All services
        
    relabel_configs:
      # Keep only services with 'prometheus' tag
      - source_labels: [__meta_consul_tags]
        regex: '.*,prometheus,.*'
        action: keep
        
      # Use service name as job label
      - source_labels: [__meta_consul_service]
        target_label: job
        
      # Use node name as instance label
      - source_labels: [__meta_consul_node]
        target_label: instance
        
      # Extract custom metrics path from tags
      - source_labels: [__meta_consul_tags]
        regex: '.*,metrics_path=([^,]+),.*'
        target_label: __metrics_path__
        replacement: '${1}'

Kubernetes Service Discovery

# Kubernetes pods discovery
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
            - production
            
    relabel_configs:
      # Only scrape pods with prometheus annotations
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
        
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        
      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
        
      # Add kubernetes labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
        
      # Add namespace
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
        
      # Add pod name
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

File-based Service Discovery

# Create file SD directory
sudo mkdir -p /etc/prometheus/file_sd

# Example targets file
sudo nano /etc/prometheus/file_sd/webservers.json
[
  {
    "targets": ["web1.example.com:9100", "web2.example.com:9100"],
    "labels": {
      "env": "production",
      "role": "webserver",
      "datacenter": "us-east-1"
    }
  },
  {
    "targets": ["web3.example.com:9100"],
    "labels": {
      "env": "staging",
      "role": "webserver",
      "datacenter": "us-west-2"
    }
  }
]

DNS Service Discovery

scrape_configs:
  - job_name: 'dns-srv-records'
    dns_sd_configs:
      - names:
          - '_prometheus._tcp.example.com'
        type: 'SRV'
        refresh_interval: 30s
        
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance
        regex: '([^.]+)\..*'
        replacement: '${1}'

Security Hardening

Prometheus Security

# Enable basic authentication
sudo dnf install -y httpd-tools

# Generate password hash
htpasswd -nBC 10 "" | tr -d ':\n'

# Configure Prometheus
sudo nano /etc/prometheus/web.yml
basic_auth_users:
  admin: $2y$10$V2RmZ2wKC7cPiE3o/h3gLuUBUPGIM2Qm0x0W8X0gAB3sLNkVE3tEq
  prometheus: $2y$10$93m/Gk5HzNxwGqDG3zSJxuYCKNneOU5W.AXFyiKJhDRIAHsQBGtFa

tls_server_config:
  cert_file: /etc/prometheus/prometheus.crt
  key_file: /etc/prometheus/prometheus.key
  client_auth_type: RequireAndVerifyClientCert
  client_ca_file: /etc/prometheus/ca.crt
# Update Prometheus service
sudo nano /etc/systemd/system/prometheus.service

# Add to ExecStart:
--web.config.file=/etc/prometheus/web.yml

# Restart Prometheus
sudo systemctl daemon-reload
sudo systemctl restart prometheus

Grafana Security

# /etc/grafana/grafana.ini
[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict
strict_transport_security = true
strict_transport_security_max_age_seconds = 86400
strict_transport_security_preload = true
strict_transport_security_subdomains = true
x_content_type_options = true
x_xss_protection = true
content_security_policy = true

[auth]
disable_login_form = false
disable_signout_menu = false
oauth_auto_login = false

[auth.anonymous]
enabled = false

[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = true

[auth.proxy]
enabled = false

[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
viewers_can_edit = false
editors_can_admin = false

[database]
ssl_mode = require
ca_cert_path = /etc/grafana/ca.crt
client_key_path = /etc/grafana/client.key
client_cert_path = /etc/grafana/client.crt
server_cert_name = grafana.example.com

Network Security

# Configure firewall
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-service=https
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --reload

# Restrict access by source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port port="9090" protocol="tcp" accept'
sudo firewall-cmd --reload

SSL/TLS Configuration

# Generate certificates
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
  -keyout prometheus.key -out prometheus.crt \
  -subj "/C=US/ST=State/L=City/O=Organization/CN=prometheus.example.com"

# Configure Nginx reverse proxy
sudo dnf install -y nginx

sudo nano /etc/nginx/conf.d/monitoring.conf
# Prometheus
server {
    listen 443 ssl http2;
    server_name prometheus.example.com;
    
    ssl_certificate /etc/nginx/ssl/prometheus.crt;
    ssl_certificate_key /etc/nginx/ssl/prometheus.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    
    location / {
        proxy_pass http://localhost:9090;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        auth_basic "Prometheus";
        auth_basic_user_file /etc/nginx/.htpasswd;
    }
}

# Grafana
server {
    listen 443 ssl http2;
    server_name grafana.example.com;
    
    ssl_certificate /etc/nginx/ssl/grafana.crt;
    ssl_certificate_key /etc/nginx/ssl/grafana.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Performance Optimization

Prometheus Optimization

# Storage optimization
storage:
  tsdb:
    # Retention settings
    retention.time: 30d
    retention.size: 100GB
    
    # WAL compression
    wal-compression: true
    
    # Block duration
    min-block-duration: 2h
    max-block-duration: 48h

# Query optimization
query:
  # Concurrent queries
  max-concurrency: 20
  
  # Query timeout
  timeout: 2m
  
  # Max samples per query
  max-samples: 50000000
  
  # Lookback delta
  lookback-delta: 5m

# Scrape optimization
scrape_configs:
  - job_name: 'optimized'
    # Increase scrape interval for less critical metrics
    scrape_interval: 60s
    
    # Reduce scrape timeout
    scrape_timeout: 10s
    
    # Limit sample size
    sample_limit: 10000
    
    # Limit label count
    label_limit: 30
    
    # Limit label name length
    label_name_length_limit: 200
    
    # Limit label value length
    label_value_length_limit: 200

Recording Rules for Performance

groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-calculate expensive queries
      - record: job:node_cpu:avg_rate5m
        expr: avg by(job) (rate(node_cpu_seconds_total[5m]))
        
      - record: job:node_memory:usage_percentage
        expr: |
          100 * (1 - (
            sum by(job) (node_memory_MemAvailable_bytes)
            /
            sum by(job) (node_memory_MemTotal_bytes)
          ))
          
      - record: instance:node_filesystem:usage_percentage
        expr: |
          100 - (
            100 * node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}
          )

Grafana Performance

# Database optimization
[database]
max_open_conn = 100
max_idle_conn = 100
conn_max_lifetime = 14400

# Caching
[caching]
enabled = true

# Data proxy
[dataproxy]
timeout = 30
keep_alive_seconds = 30
tls_handshake_timeout_seconds = 10
expect_continue_timeout_seconds = 1
max_idle_connections = 100
idle_conn_timeout_seconds = 90

# Rendering
[rendering]
concurrent_render_limit = 5

# Query caching
[feature_toggles]
enable = queryCaching

Query Optimization Tips

  1. Use Recording Rules

    • Pre-calculate expensive queries
    • Aggregate data at collection time
    • Reduce query complexity
  2. Optimize PromQL

    # Bad: Multiple aggregations
    avg(rate(http_requests_total[5m])) by (job)
    
    # Good: Single aggregation
    avg by (job) (rate(http_requests_total[5m]))
  3. Limit Time Ranges

    • Use appropriate time ranges
    • Avoid querying old data unnecessarily
    • Use downsampling for historical data
  4. Index Labels Properly

    • Keep cardinality in check
    • Use meaningful label names
    • Avoid high-cardinality labels

High Availability Setup

Prometheus HA Configuration

# prometheus-1.yml
global:
  scrape_interval: 15s
  external_labels:
    replica: '1'
    cluster: 'prod'

# prometheus-2.yml  
global:
  scrape_interval: 15s
  external_labels:
    replica: '2'
    cluster: 'prod'

Using Thanos for HA

# Install Thanos
THANOS_VERSION="0.32.0"
wget https://github.com/thanos-io/thanos/releases/download/v${THANOS_VERSION}/thanos-${THANOS_VERSION}.linux-amd64.tar.gz
tar xvf thanos-${THANOS_VERSION}.linux-amd64.tar.gz
sudo cp thanos-${THANOS_VERSION}.linux-amd64/thanos /usr/local/bin/

# Configure Thanos Sidecar
sudo nano /etc/systemd/system/thanos-sidecar.service
[Unit]
Description=Thanos Sidecar
After=prometheus.service

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/thanos sidecar \
  --tsdb.path=/var/lib/prometheus \
  --prometheus.url=http://localhost:9090 \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

[Install]
WantedBy=multi-user.target

Grafana HA with Database

# Use external database for HA
[database]
type = postgres
host = postgres.example.com:5432
name = grafana
user = grafana
password = SecurePassword123!
ssl_mode = require
ca_cert_path = /etc/grafana/ca.pem

[session]
provider = postgres
provider_config = user=grafana password=SecurePassword123! host=postgres.example.com port=5432 dbname=grafana sslmode=require

[remote_cache]
type = redis
connstr = redis.example.com:6379

Troubleshooting

Common Prometheus Issues

# Check Prometheus configuration
promtool check config /etc/prometheus/prometheus.yml

# Check rule files
promtool check rules /etc/prometheus/rules/*.yml

# Test service discovery
curl http://localhost:9090/api/v1/targets

# Check metrics ingestion
curl http://localhost:9090/api/v1/query?query=up

# Debug scraping issues
curl http://localhost:9090/api/v1/targets/metadata

# Check storage
du -sh /var/lib/prometheus/*

# Analyze cardinality
curl http://localhost:9090/api/v1/label/__name__/values | jq length

Common Grafana Issues

# Check Grafana logs
sudo journalctl -u grafana-server -f

# Test data source
curl -u admin:password http://localhost:3000/api/datasources

# Check plugin installation
grafana-cli plugins ls

# Database issues
grafana-cli admin data-migration

# Reset admin password
grafana-cli admin reset-admin-password newpassword

Performance Diagnostics

# Prometheus metrics about itself
curl http://localhost:9090/metrics | grep prometheus_

# Key metrics to check:
# - prometheus_tsdb_head_samples
# - prometheus_tsdb_symbol_table_size_bytes
# - prometheus_tsdb_head_chunks
# - prometheus_engine_query_duration_seconds
# - prometheus_http_request_duration_seconds

# Grafana performance metrics
curl http://localhost:3000/metrics | grep grafana_

# System resource usage
top -p $(pgrep prometheus)
top -p $(pgrep grafana)

Best Practices

Monitoring Best Practices

  1. Label Management

    • Keep label cardinality under control
    • Use consistent label naming
    • Avoid dynamic label values
    • Document label meanings
  2. Query Optimization

    • Use recording rules for dashboards
    • Limit query time ranges
    • Avoid regex where possible
    • Cache frequently used queries
  3. Alert Design

    • Alert on symptoms, not causes
    • Include runbook links
    • Set appropriate thresholds
    • Test alerts regularly
  4. Dashboard Design

    • Group related metrics
    • Use consistent layouts
    • Include documentation
    • Version control dashboards

Operational Best Practices

  1. Backup Strategy

    # Backup Prometheus data
    tar -czf prometheus-backup-$(date +%Y%m%d).tar.gz /var/lib/prometheus
    
    # Backup Grafana
    cp /var/lib/grafana/grafana.db grafana-backup-$(date +%Y%m%d).db
    grafana-cli admin export-dashboard-json
  2. Monitoring the Monitors

    • Monitor Prometheus with another instance
    • Set up alerts for monitoring stack
    • Track resource usage
    • Monitor scrape performance
  3. Capacity Planning

    • Monitor storage growth
    • Track cardinality increases
    • Plan for retention needs
    • Scale before hitting limits
  4. Documentation

    • Document architecture
    • Maintain runbooks
    • Record configuration decisions
    • Keep dashboard documentation

Security Best Practices

  1. Access Control

    • Use strong authentication
    • Implement RBAC
    • Audit access logs
    • Regular permission reviews
  2. Network Security

    • Use TLS everywhere
    • Restrict network access
    • Implement firewall rules
    • Use VPN for remote access
  3. Data Protection

    • Encrypt data at rest
    • Secure backups
    • Limit data retention
    • Anonymize sensitive data

Conclusion

Deploying Prometheus and Grafana on Rocky Linux provides a powerful, scalable monitoring solution for modern infrastructure. This guide has covered:

  • Complete installation and configuration
  • Integration and dashboard creation
  • Advanced features like service discovery
  • Security hardening and best practices
  • Performance optimization techniques
  • High availability configurations

Remember that monitoring is an iterative process. Start with basic metrics, gradually add more sophisticated monitoring, and continuously refine based on your needs.