๐ AlmaLinux Monitoring Tools: Complete System Oversight Guide
Want to know exactly whatโs happening inside your AlmaLinux system? ๐ System monitoring is the key to maintaining healthy, performant servers and workstations! This comprehensive guide takes you from basic resource monitoring to professional-grade observability platforms. Whether youโre tracking CPU usage or building complete monitoring dashboards, letโs master the art of system oversight! โก
๐ค Why System Monitoring is Essential?
Monitoring transforms reactive firefighting into proactive management! ๐ Hereโs why itโs crucial:
- ๐จ Early Problem Detection: Spot issues before users notice
- ๐ Performance Optimization: Identify and fix bottlenecks
- ๐ฐ Resource Planning: Know when to upgrade hardware
- ๐ก๏ธ Security Monitoring: Detect suspicious activities
- ๐ Capacity Planning: Predict future resource needs
- ๐ง Troubleshooting: Quickly diagnose system problems
- ๐ Compliance: Meet audit and reporting requirements
- ๐ Peace of Mind: Always know your systemโs health
Properly monitored systems have 75% less downtime! ๐
๐ฏ What You Need
Letโs prepare for monitoring mastery! โ
- โ AlmaLinux system with root or sudo access
- โ Basic understanding of system resources
- โ Terminal access for command-line tools
- โ Network connectivity for remote monitoring
- โ 60 minutes to explore all monitoring tools
- โ Some test workloads to monitor
- โ Curiosity about system internals
- โ Excitement to see everything happening! ๐
Letโs unlock complete system visibility! ๐
๐ Step 1: Basic Command-Line Monitoring
Master essential monitoring commands! ๐ฏ
System Load and Uptime:
# Check system uptime and load average:
uptime
# Output: 10:23:45 up 5 days, 3:15, 2 users, load average: 0.15, 0.12, 0.09
# Load average = 1-minute, 5-minute, 15-minute averages
# Detailed uptime information:
uptime -p # Pretty format: up 5 days, 3 hours, 15 minutes
uptime -s # System start time: 2025-09-12 07:08:30
# Understanding load average:
# Load of 1.0 = 100% CPU usage on single-core system
# Load of 4.0 = 100% CPU usage on quad-core system
# Check CPU count:
nproc # Number of processing units
# W command - who's logged in and what they're doing:
w
# Shows users, TTY, login time, idle time, JCPU, PCPU, and current command
Memory Usage Monitoring:
# Free memory display:
free -h # Human-readable format
# Output:
# total used free shared buff/cache available
# Mem: 15Gi 3.2Gi 8.1Gi 245Mi 4.2Gi 11Gi
# Swap: 8.0Gi 0B 8.0Gi
# Continuous memory monitoring:
free -h -s 2 # Update every 2 seconds
# Detailed memory information:
cat /proc/meminfo | head -20
# Memory usage by process:
ps aux --sort=-%mem | head -10
# Show memory statistics:
vmstat 2 5 # Update every 2 seconds, 5 times
# Columns: r=running, b=blocked, swpd=swap used, free=free memory
# buff=buffers, cache=cache, si=swap in, so=swap out
# Memory pressure information:
cat /proc/pressure/memory
CPU Usage Tracking:
# Real-time CPU usage:
top # Interactive process viewer
# Key commands in top:
# 1 - Show individual CPU cores
# M - Sort by memory usage
# P - Sort by CPU usage
# k - Kill process
# r - Renice process
# q - Quit
# CPU information:
lscpu # Detailed CPU architecture information
# Per-core CPU usage:
mpstat -P ALL 2 # All CPUs, update every 2 seconds
# Install if missing: sudo dnf install sysstat
# Process CPU usage:
ps aux --sort=-%cpu | head -10
# CPU frequency monitoring:
watch -n 1 "grep MHz /proc/cpuinfo"
# CPU temperature (if sensors available):
sensors # Install: sudo dnf install lm_sensors
sudo sensors-detect # Configure sensors
Disk Usage and I/O:
# Disk space usage:
df -h # Human-readable disk usage
df -i # Inode usage
# Directory disk usage:
du -sh /var/* # Size of directories in /var
du -h --max-depth=1 / | sort -hr # Sorted by size
# Disk I/O statistics:
iostat -x 2 # Extended stats, update every 2 seconds
# Real-time I/O monitoring:
iotop # Interactive I/O monitor
# Install: sudo dnf install iotop
# Disk I/O by process:
sudo iotop -o # Only show processes doing I/O
# Check disk health:
sudo smartctl -a /dev/sda # Install: sudo dnf install smartmontools
Perfect! ๐ Basic monitoring commands mastered!
๐ง Step 2: Advanced Monitoring Tools
Deploy professional monitoring utilities! ๐ฆ
htop - Enhanced Process Monitor:
# Install htop:
sudo dnf install htop
# Launch htop:
htop
# htop features:
# - Color-coded resource bars
# - Tree view of processes (F5)
# - Search processes (F3)
# - Filter processes (F4)
# - Kill processes (F9)
# - Sort by columns (F6)
# - Setup/customize (F2)
# htop configuration (~/.config/htop/htoprc):
# Customize colors, meters, columns
# Useful htop shortcuts:
# H - Show/hide user threads
# K - Show/hide kernel threads
# F - Follow process
# Space - Tag process
# U - Untag all
# c - Tag processes by name
Glances - Comprehensive System Monitor:
# Install Glances:
sudo dnf install glances
# Basic usage:
glances
# Glances modes:
glances -w # Web server mode (http://localhost:61208)
glances -1 # Show all CPU cores
glances -2 # Disable left sidebar
glances -3 # Disable quick look
glances -4 # Disable top processes
# Export to file:
glances --export csv --export-csv-file /tmp/glances.csv
glances --export json --export-json-file /tmp/glances.json
# Monitor remote system:
# On server:
glances -s # Server mode
# On client:
glances -c server_ip
# Glances with Docker monitoring:
glances --enable-plugin docker
# Configuration file (~/.config/glances/glances.conf):
cat > ~/.config/glances/glances.conf << 'EOF'
[cpu]
user_careful=50
user_warning=70
user_critical=90
[mem]
careful=50
warning=70
critical=90
[network]
hide=lo,docker.*
EOF
nmon - Performance Monitor:
# Install nmon:
sudo dnf install nmon
# Launch nmon:
nmon
# nmon interactive commands:
# c - CPU statistics
# m - Memory statistics
# d - Disk statistics
# n - Network statistics
# t - Top processes
# h - Help menu
# Capture data for analysis:
nmon -f -t -s 5 -c 120
# -f: Spreadsheet output
# -t: Include top processes
# -s 5: 5-second intervals
# -c 120: 120 snapshots (10 minutes)
# Analyze with nmonchart:
# Creates HTML reports from nmon data
atop - Advanced System Monitor:
# Install atop:
sudo dnf install atop
# Start atop:
atop
# atop features:
# - Process accounting
# - Historical data
# - Disk I/O per process
# - Network activity per process
# View historical data:
atop -r /var/log/atop/atop_20250917
# atop shortcuts:
# g - Generic info
# m - Memory info
# d - Disk info
# n - Network info
# c - Full command lines
# v - Various process info
# Configure atop logging (/etc/sysconfig/atop):
INTERVAL=60 # Log every 60 seconds
LOGGENERATIONS=28 # Keep 28 days
Amazing! ๐ Advanced monitoring tools deployed!
๐ Step 3: Network and Service Monitoring
Monitor network traffic and services! โก
Network Monitoring Tools:
# Install network monitoring tools:
sudo dnf install net-tools iptraf-ng nethogs iftop
# Monitor network connections:
ss -tuln # Show listening ports
ss -tan # Show all TCP connections
netstat -tuln # Legacy alternative
# Real-time bandwidth monitoring:
sudo iftop -i eth0 # Monitor specific interface
# iftop commands:
# p - Toggle port display
# n - Toggle DNS resolution
# s/d - Toggle source/destination
# 1/2/3 - Sort by different columns
# Bandwidth usage by process:
sudo nethogs eth0
# Shows which processes are using bandwidth
# Detailed network statistics:
ip -s link # Interface statistics
nstat # Network statistics
sar -n DEV 2 # Network device statistics
# Monitor specific port:
sudo tcpdump -i eth0 port 80
sudo tcpdump -i any host 192.168.1.100
# Network performance testing:
iperf3 -s # Server mode
iperf3 -c server_ip # Client mode
Service Monitoring:
# Systemd service monitoring:
systemctl status
systemctl list-units --failed
systemctl list-units --state=running
# Monitor service logs:
journalctl -u nginx -f # Follow nginx logs
journalctl -p err -b # Show errors since boot
# Service resource usage:
systemd-cgtop # Top-like view for systemd services
# Monitor specific service:
systemctl show nginx --property=MainPID,MemoryCurrent,CPUUsageNSec
# Create service monitor script:
cat > ~/monitor-services.sh << 'EOF'
#!/bin/bash
SERVICES="nginx mysql sshd firewalld"
echo "=== Service Status Check ==="
for service in $SERVICES; do
if systemctl is-active --quiet $service; then
echo "โ
$service: Running"
echo " Memory: $(systemctl show $service --property=MemoryCurrent --value | numfmt --to=iec)"
else
echo "โ $service: Not running"
fi
done
EOF
chmod +x ~/monitor-services.sh
Application Performance Monitoring:
# Monitor web server:
# Apache status module:
sudo dnf install mod_status
# Enable in Apache config:
# ExtendedStatus On
# <Location /server-status>
# SetHandler server-status
# Require local
# </Location>
# Nginx status:
# Add to nginx.conf:
# location /nginx_status {
# stub_status;
# allow 127.0.0.1;
# deny all;
# }
# Monitor with curl:
curl http://localhost/server-status
curl http://localhost/nginx_status
# MySQL monitoring:
mysql -e "SHOW STATUS LIKE 'Threads_connected'"
mysql -e "SHOW PROCESSLIST"
# Redis monitoring:
redis-cli INFO
redis-cli MONITOR # Real-time command monitoring
# Monitor application logs:
tail -f /var/log/nginx/access.log | grep -E "50[0-9]" # 500 errors
Excellent! โก Network and service monitoring ready!
โ Step 4: Enterprise Monitoring Solutions
Deploy production-grade monitoring systems! ๐ง
Prometheus and Grafana Setup:
# Install Prometheus:
sudo useradd --no-create-home --shell /bin/false prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar -xvf prometheus-2.40.0.linux-amd64.tar.gz
sudo mv prometheus-2.40.0.linux-amd64 /opt/prometheus
# Configure Prometheus:
cat | sudo tee /opt/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
EOF
# Create systemd service:
cat | sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/prometheus/data
[Install]
WantedBy=multi-user.target
EOF
# Install Node Exporter:
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar -xvf node_exporter-1.5.0.linux-amd64.tar.gz
sudo mv node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
# Create Node Exporter service:
cat | sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Start services:
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus node_exporter
# Install Grafana:
sudo dnf install grafana
sudo systemctl enable --now grafana-server
# Access:
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin)
Zabbix Monitoring Platform:
# Install Zabbix repository:
sudo rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/9/x86_64/zabbix-release-6.0-4.el9.noarch.rpm
# Install Zabbix server, frontend, agent:
sudo dnf install zabbix-server-mysql zabbix-web-mysql zabbix-apache-conf zabbix-sql-scripts zabbix-agent
# Configure database:
mysql -u root -p << 'EOF'
CREATE DATABASE zabbix CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
CREATE USER 'zabbix'@'localhost' IDENTIFIED BY 'zabbix_password';
GRANT ALL PRIVILEGES ON zabbix.* TO 'zabbix'@'localhost';
FLUSH PRIVILEGES;
EXIT
EOF
# Import initial schema:
zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql -u zabbix -p zabbix
# Configure Zabbix server:
sudo nano /etc/zabbix/zabbix_server.conf
# Set: DBPassword=zabbix_password
# Start services:
sudo systemctl restart zabbix-server zabbix-agent httpd php-fpm
sudo systemctl enable zabbix-server zabbix-agent httpd php-fpm
# Access: http://localhost/zabbix
Custom Monitoring Scripts:
# Create comprehensive monitoring script:
cat > ~/system-monitor.sh << 'EOF'
#!/bin/bash
# Configuration
LOG_DIR="/var/log/monitoring"
ALERT_EMAIL="[email protected]"
CPU_THRESHOLD=80
MEM_THRESHOLD=90
DISK_THRESHOLD=85
mkdir -p "$LOG_DIR"
# Functions
check_cpu() {
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
echo "โ ๏ธ HIGH CPU: ${CPU_USAGE}%"
echo "$(date): CPU Alert - ${CPU_USAGE}%" >> "$LOG_DIR/alerts.log"
else
echo "โ
CPU: ${CPU_USAGE}%"
fi
}
check_memory() {
MEM_USAGE=$(free | grep Mem | awk '{print ($3/$2) * 100.0}')
if (( $(echo "$MEM_USAGE > $MEM_THRESHOLD" | bc -l) )); then
echo "โ ๏ธ HIGH MEMORY: ${MEM_USAGE}%"
echo "$(date): Memory Alert - ${MEM_USAGE}%" >> "$LOG_DIR/alerts.log"
else
echo "โ
Memory: ${MEM_USAGE}%"
fi
}
check_disk() {
df -h | tail -n +2 | while read line; do
USAGE=$(echo $line | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo $line | awk '{print $6}')
if [ "$USAGE" -gt "$DISK_THRESHOLD" ]; then
echo "โ ๏ธ HIGH DISK: $MOUNT at ${USAGE}%"
echo "$(date): Disk Alert - $MOUNT at ${USAGE}%" >> "$LOG_DIR/alerts.log"
fi
done
}
check_services() {
SERVICES="nginx mysql sshd"
for service in $SERVICES; do
if ! systemctl is-active --quiet $service; then
echo "โ Service Down: $service"
echo "$(date): Service Alert - $service is down" >> "$LOG_DIR/alerts.log"
else
echo "โ
Service Up: $service"
fi
done
}
# Main monitoring loop
echo "=== System Monitor Report - $(date) ==="
check_cpu
check_memory
check_disk
check_services
# Send alerts if log has new entries
if [ -f "$LOG_DIR/alerts.log" ]; then
tail -n 10 "$LOG_DIR/alerts.log" | mail -s "System Alerts" "$ALERT_EMAIL"
fi
EOF
chmod +x ~/system-monitor.sh
# Schedule monitoring:
(crontab -l 2>/dev/null; echo "*/5 * * * * /home/$USER/system-monitor.sh") | crontab -
Perfect! ๐ Enterprise monitoring deployed!
๐ฎ Quick Examples
Real-world monitoring scenarios! ๐ฏ
Example 1: Performance Troubleshooting Dashboard
#!/bin/bash
# Interactive performance troubleshooting dashboard
echo "Creating performance dashboard..."
# Create dashboard script
cat > ~/perf-dashboard.sh << 'EOF'
#!/bin/bash
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Functions
show_header() {
clear
echo "โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ"
echo " AlmaLinux Performance Dashboard - $(date +%H:%M:%S)"
echo "โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ"
}
show_cpu() {
echo -e "\n${GREEN}โถ CPU Usage:${NC}"
top -bn1 | head -5
echo ""
mpstat 1 1 | tail -2
}
show_memory() {
echo -e "\n${GREEN}โถ Memory Usage:${NC}"
free -h
echo ""
echo "Top Memory Consumers:"
ps aux --sort=-%mem | head -5 | awk '{printf " %-10s %6s %s\n", $1, $4"%", $11}'
}
show_disk() {
echo -e "\n${GREEN}โถ Disk Usage:${NC}"
df -h | grep -v tmpfs
echo ""
echo "Disk I/O:"
iostat -x 1 2 | tail -n +4
}
show_network() {
echo -e "\n${GREEN}โถ Network Activity:${NC}"
ip -s link | awk '/^[0-9]/{print $2} /RX:/{getline; print " RX: "$1" bytes"} /TX:/{getline; print " TX: "$1" bytes"}'
echo ""
echo "Active Connections:"
ss -tun | tail -5
}
show_processes() {
echo -e "\n${GREEN}โถ Top Processes:${NC}"
ps aux --sort=-%cpu | head -10 | awk '{printf " %-10s %6s %6s %s\n", $1, $3"%", $4"%", $11}'
}
check_issues() {
echo -e "\n${YELLOW}โถ Potential Issues:${NC}"
# Check CPU
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | cut -d'%' -f1)
if (( $(echo "$CPU_IDLE < 20" | bc -l) )); then
echo -e " ${RED}โ High CPU usage detected${NC}"
fi
# Check Memory
MEM_AVAIL=$(free | grep Mem | awk '{print $7/$2 * 100.0}')
if (( $(echo "$MEM_AVAIL < 20" | bc -l) )); then
echo -e " ${RED}โ Low memory available${NC}"
fi
# Check Swap
SWAP_USED=$(free | grep Swap | awk '{if($2>0) print $3/$2 * 100.0; else print 0}')
if (( $(echo "$SWAP_USED > 50" | bc -l) )); then
echo -e " ${RED}โ High swap usage${NC}"
fi
# Check Load
LOAD=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1)
CORES=$(nproc)
if (( $(echo "$LOAD > $CORES" | bc -l) )); then
echo -e " ${RED}โ System load above CPU count${NC}"
fi
}
# Main loop
while true; do
show_header
show_cpu
show_memory
show_disk
show_network
show_processes
check_issues
echo -e "\n${GREEN}Press [Enter] to refresh, [Q] to quit${NC}"
read -t 5 -n 1 key
if [[ $key = "q" ]] || [[ $key = "Q" ]]; then
break
fi
done
EOF
chmod +x ~/perf-dashboard.sh
echo "Dashboard created! Run with: ~/perf-dashboard.sh"
Example 2: Automated Alert System
#!/bin/bash
# Comprehensive alerting system
echo "Setting up automated alert system..."
# Create alert monitor
cat > ~/alert-monitor.sh << 'EOF'
#!/bin/bash
# Configuration
ALERT_LOG="/var/log/system-alerts.log"
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
EMAIL="[email protected]"
# Thresholds
CPU_CRITICAL=90
CPU_WARNING=70
MEM_CRITICAL=95
MEM_WARNING=85
DISK_CRITICAL=95
DISK_WARNING=85
LOAD_MULTIPLIER=2
# Functions
send_alert() {
local severity=$1
local component=$2
local message=$3
local value=$4
# Log alert
echo "$(date '+%Y-%m-%d %H:%M:%S') [$severity] $component: $message (Value: $value)" >> "$ALERT_LOG"
# Send to Slack (if configured)
if [ ! -z "$WEBHOOK_URL" ]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"๐จ [$severity] $component: $message (Value: $value)\"}" \
"$WEBHOOK_URL" 2>/dev/null
fi
# Send email for critical alerts
if [ "$severity" = "CRITICAL" ]; then
echo "$message (Value: $value)" | mail -s "[$severity] System Alert: $component" "$EMAIL"
fi
}
check_cpu() {
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print 100 - $8}' | cut -d'%' -f1)
if (( $(echo "$cpu_usage > $CPU_CRITICAL" | bc -l) )); then
send_alert "CRITICAL" "CPU" "CPU usage critically high" "${cpu_usage}%"
elif (( $(echo "$cpu_usage > $CPU_WARNING" | bc -l) )); then
send_alert "WARNING" "CPU" "CPU usage high" "${cpu_usage}%"
fi
}
check_memory() {
local mem_usage=$(free | grep Mem | awk '{print ($3/$2) * 100.0}')
if (( $(echo "$mem_usage > $MEM_CRITICAL" | bc -l) )); then
send_alert "CRITICAL" "MEMORY" "Memory usage critically high" "${mem_usage}%"
elif (( $(echo "$mem_usage > $MEM_WARNING" | bc -l) )); then
send_alert "WARNING" "MEMORY" "Memory usage high" "${mem_usage}%"
fi
}
check_disk() {
df -h | tail -n +2 | while read line; do
local usage=$(echo $line | awk '{print $5}' | sed 's/%//')
local mount=$(echo $line | awk '{print $6}')
if [ "$usage" -gt "$DISK_CRITICAL" ]; then
send_alert "CRITICAL" "DISK" "Disk space critically low on $mount" "${usage}%"
elif [ "$usage" -gt "$DISK_WARNING" ]; then
send_alert "WARNING" "DISK" "Disk space low on $mount" "${usage}%"
fi
done
}
check_load() {
local load=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | xargs)
local cores=$(nproc)
local threshold=$(echo "$cores * $LOAD_MULTIPLIER" | bc)
if (( $(echo "$load > $threshold" | bc -l) )); then
send_alert "WARNING" "LOAD" "System load high" "$load (${cores} cores)"
fi
}
check_services() {
local critical_services="sshd firewalld"
local important_services="nginx mysql redis"
for service in $critical_services; do
if ! systemctl is-active --quiet $service; then
send_alert "CRITICAL" "SERVICE" "$service is down" "stopped"
fi
done
for service in $important_services; do
if ! systemctl is-active --quiet $service 2>/dev/null; then
send_alert "WARNING" "SERVICE" "$service is down" "stopped"
fi
done
}
check_network() {
# Check network interfaces
for interface in $(ip link | grep "^[0-9]" | cut -d: -f2 | grep -v lo); do
if ! ip link show $interface | grep -q "state UP"; then
send_alert "WARNING" "NETWORK" "Interface $interface is down" "DOWN"
fi
done
# Check connectivity
if ! ping -c 1 8.8.8.8 &>/dev/null; then
send_alert "CRITICAL" "NETWORK" "No internet connectivity" "FAILED"
fi
}
# Main monitoring
check_cpu
check_memory
check_disk
check_load
check_services
check_network
# Cleanup old alerts (keep 30 days)
find /var/log -name "system-alerts.log.*" -mtime +30 -delete
EOF
chmod +x ~/alert-monitor.sh
# Schedule monitoring
echo "*/5 * * * * /home/$USER/alert-monitor.sh" | crontab -
echo "Alert system configured!"
Example 3: Container Monitoring Setup
#!/bin/bash
# Docker/Podman container monitoring
echo "Setting up container monitoring..."
# Create container monitor
cat > ~/container-monitor.sh << 'EOF'
#!/bin/bash
# Detect container runtime
if command -v docker &>/dev/null; then
RUNTIME="docker"
elif command -v podman &>/dev/null; then
RUNTIME="podman"
else
echo "No container runtime found"
exit 1
fi
# Container monitoring dashboard
show_containers() {
echo "=== Container Status ==="
$RUNTIME ps --format "table {{.Names}}\t{{.Status}}\t{{.Size}}"
}
show_container_stats() {
echo -e "\n=== Container Resource Usage ==="
$RUNTIME stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
}
show_container_logs() {
echo -e "\n=== Recent Container Logs ==="
for container in $($RUNTIME ps -q); do
name=$($RUNTIME inspect -f '{{.Name}}' $container | sed 's/^\/*//')
echo "โถ $name:"
$RUNTIME logs --tail 5 $container 2>&1 | sed 's/^/ /'
echo ""
done
}
check_container_health() {
echo -e "\n=== Container Health Checks ==="
for container in $($RUNTIME ps -q); do
name=$($RUNTIME inspect -f '{{.Name}}' $container | sed 's/^\/*//')
health=$($RUNTIME inspect -f '{{.State.Health.Status}}' $container 2>/dev/null)
if [ "$health" = "healthy" ]; then
echo "โ
$name: Healthy"
elif [ "$health" = "unhealthy" ]; then
echo "โ $name: Unhealthy"
else
echo "โ ๏ธ $name: No health check"
fi
done
}
monitor_container_resources() {
echo -e "\n=== Detailed Resource Monitoring ==="
for container in $($RUNTIME ps -q); do
name=$($RUNTIME inspect -f '{{.Name}}' $container | sed 's/^\/*//')
echo "Container: $name"
# Get cgroup stats
if [ "$RUNTIME" = "docker" ]; then
# CPU usage
cpu_usage=$($RUNTIME exec $container cat /sys/fs/cgroup/cpuacct/cpuacct.usage 2>/dev/null)
echo " CPU nanoseconds: $cpu_usage"
# Memory usage
mem_usage=$($RUNTIME exec $container cat /sys/fs/cgroup/memory/memory.usage_in_bytes 2>/dev/null)
echo " Memory bytes: $mem_usage"
fi
# Process count
proc_count=$($RUNTIME exec $container ps aux 2>/dev/null | wc -l)
echo " Processes: $proc_count"
# Network connections
net_conn=$($RUNTIME exec $container netstat -tan 2>/dev/null | grep ESTABLISHED | wc -l)
echo " Network connections: $net_conn"
echo ""
done
}
# Prometheus metrics exporter
export_metrics() {
echo -e "\n=== Prometheus Metrics ==="
for container in $($RUNTIME ps -q); do
name=$($RUNTIME inspect -f '{{.Name}}' $container | sed 's/^\/*//')
stats=$($RUNTIME stats --no-stream --format "{{json .}}" $container)
cpu=$(echo $stats | jq -r '.CPUPerc' | sed 's/%//')
mem=$(echo $stats | jq -r '.MemPerc' | sed 's/%//')
echo "container_cpu_usage_percent{name=\"$name\"} $cpu"
echo "container_memory_usage_percent{name=\"$name\"} $mem"
done
}
# Main monitoring loop
while true; do
clear
echo "Container Monitoring Dashboard - $(date)"
echo "========================================"
show_containers
show_container_stats
check_container_health
monitor_container_resources
if [ "$1" = "--export" ]; then
export_metrics > /tmp/container_metrics.prom
fi
echo -e "\nPress Ctrl+C to exit"
sleep 5
done
EOF
chmod +x ~/container-monitor.sh
# Create container alert script
cat > ~/container-alerts.sh << 'EOF'
#!/bin/bash
RUNTIME=$(command -v docker || command -v podman)
ALERT_LOG="/var/log/container-alerts.log"
# Check for stopped containers
for container in $($RUNTIME ps -a -q); do
status=$($RUNTIME inspect -f '{{.State.Status}}' $container)
name=$($RUNTIME inspect -f '{{.Name}}' $container | sed 's/^\/*//')
if [ "$status" != "running" ]; then
echo "$(date): Container $name is $status" >> "$ALERT_LOG"
# Try to restart
$RUNTIME start $container
if [ $? -eq 0 ]; then
echo "$(date): Successfully restarted $name" >> "$ALERT_LOG"
else
echo "$(date): Failed to restart $name" >> "$ALERT_LOG"
fi
fi
done
# Check container resource usage
$RUNTIME stats --no-stream --format "{{json .}}" | while read stats; do
name=$(echo $stats | jq -r '.Name')
cpu=$(echo $stats | jq -r '.CPUPerc' | sed 's/%//')
mem=$(echo $stats | jq -r '.MemPerc' | sed 's/%//')
if (( $(echo "$cpu > 80" | bc -l) )); then
echo "$(date): High CPU usage in $name: ${cpu}%" >> "$ALERT_LOG"
fi
if (( $(echo "$mem > 80" | bc -l) )); then
echo "$(date): High memory usage in $name: ${mem}%" >> "$ALERT_LOG"
fi
done
EOF
chmod +x ~/container-alerts.sh
echo "Container monitoring configured!"
echo "Run dashboard: ~/container-monitor.sh"
echo "Schedule alerts: */5 * * * * ~/container-alerts.sh"
Example 4: Database Performance Monitoring
#!/bin/bash
# Database monitoring for MySQL/MariaDB and PostgreSQL
echo "Setting up database monitoring..."
# MySQL/MariaDB monitoring
cat > ~/mysql-monitor.sh << 'EOF'
#!/bin/bash
DB_USER="monitor"
DB_PASS="monitor_password"
ALERT_LOG="/var/log/mysql-monitor.log"
# Create monitoring user if not exists
mysql -u root -p << SQL
CREATE USER IF NOT EXISTS 'monitor'@'localhost' IDENTIFIED BY 'monitor_password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'monitor'@'localhost';
FLUSH PRIVILEGES;
SQL
# Monitor function
monitor_mysql() {
echo "=== MySQL Monitoring Report - $(date) ==="
# Connection statistics
echo -e "\nโถ Connection Statistics:"
mysql -u $DB_USER -p$DB_PASS -e "SHOW STATUS WHERE Variable_name IN ('Threads_connected','Max_used_connections','Aborted_connects');" 2>/dev/null
# Query performance
echo -e "\nโถ Query Performance:"
mysql -u $DB_USER -p$DB_PASS -e "SHOW STATUS WHERE Variable_name IN ('Slow_queries','Questions','Queries');" 2>/dev/null
# InnoDB statistics
echo -e "\nโถ InnoDB Buffer Pool:"
mysql -u $DB_USER -p$DB_PASS -e "SHOW STATUS WHERE Variable_name LIKE 'Innodb_buffer_pool%';" 2>/dev/null | head -10
# Current processes
echo -e "\nโถ Active Processes:"
mysql -u $DB_USER -p$DB_PASS -e "SHOW PROCESSLIST;" 2>/dev/null
# Table sizes
echo -e "\nโถ Largest Tables:"
mysql -u $DB_USER -p$DB_PASS << SQL 2>/dev/null
SELECT
table_schema AS 'Database',
table_name AS 'Table',
ROUND((data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
FROM information_schema.TABLES
ORDER BY (data_length + index_length) DESC
LIMIT 10;
SQL
# Slow query log
echo -e "\nโถ Recent Slow Queries:"
if [ -f /var/log/mysql/slow.log ]; then
tail -10 /var/log/mysql/slow.log
fi
}
# Performance metrics
check_performance() {
# Check slow queries
SLOW_QUERIES=$(mysql -u $DB_USER -p$DB_PASS -se "SHOW STATUS LIKE 'Slow_queries';" | awk '{print $2}')
if [ "$SLOW_QUERIES" -gt 100 ]; then
echo "$(date): WARNING - High number of slow queries: $SLOW_QUERIES" >> "$ALERT_LOG"
fi
# Check connections
CURRENT_CONN=$(mysql -u $DB_USER -p$DB_PASS -se "SHOW STATUS LIKE 'Threads_connected';" | awk '{print $2}')
MAX_CONN=$(mysql -u $DB_USER -p$DB_PASS -se "SHOW VARIABLES LIKE 'max_connections';" | awk '{print $2}')
USAGE=$(echo "scale=2; $CURRENT_CONN / $MAX_CONN * 100" | bc)
if (( $(echo "$USAGE > 80" | bc -l) )); then
echo "$(date): WARNING - High connection usage: ${USAGE}%" >> "$ALERT_LOG"
fi
}
# Run monitoring
monitor_mysql
check_performance
EOF
chmod +x ~/mysql-monitor.sh
# PostgreSQL monitoring
cat > ~/postgres-monitor.sh << 'EOF'
#!/bin/bash
export PGUSER="postgres"
ALERT_LOG="/var/log/postgres-monitor.log"
monitor_postgres() {
echo "=== PostgreSQL Monitoring Report - $(date) ==="
# Connection statistics
echo -e "\nโถ Connection Statistics:"
psql -c "SELECT count(*) as connections, state FROM pg_stat_activity GROUP BY state;"
# Database sizes
echo -e "\nโถ Database Sizes:"
psql -c "SELECT datname, pg_size_pretty(pg_database_size(datname)) as size FROM pg_database ORDER BY pg_database_size(datname) DESC;"
# Long running queries
echo -e "\nโถ Long Running Queries:"
psql << SQL
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state != 'idle';
SQL
# Cache hit ratio
echo -e "\nโถ Cache Hit Ratio:"
psql << SQL
SELECT
sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;
SQL
# Table bloat
echo -e "\nโถ Table Bloat:"
psql << SQL
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
n_dead_tup
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 10;
SQL
}
# Run monitoring
monitor_postgres
EOF
chmod +x ~/postgres-monitor.sh
echo "Database monitoring scripts created!"
echo "MySQL: ~/mysql-monitor.sh"
echo "PostgreSQL: ~/postgres-monitor.sh"
๐จ Fix Common Problems
Monitoring troubleshooting guide! ๐ง
Problem 1: High CPU Usage
Solution:
# Identify CPU consumers:
# 1. Find top CPU processes
top -bn1 | head -20
ps aux --sort=-%cpu | head -10
# 2. Check for runaway processes
ps aux | awk '$3 > 50 {print $0}'
# 3. Analyze specific process
PID=12345 # Replace with actual PID
strace -p $PID -c # Count system calls
lsof -p $PID # Open files
pmap $PID # Memory map
# 4. Check for CPU intensive services
systemd-cgtop
# 5. Temporary fixes
# Limit CPU usage with nice/renice:
renice +10 -p $PID # Lower priority
# Or use cpulimit:
sudo dnf install cpulimit
cpulimit -p $PID -l 50 # Limit to 50%
# 6. Investigate cause
journalctl -u service_name --since "1 hour ago"
dmesg | tail -50
# 7. Permanent solutions
# Update software
sudo dnf update
# Optimize configuration
# Add more CPU cores
Problem 2: Memory Leaks
Solution:
# Diagnose memory issues:
# 1. Find memory hogs
ps aux --sort=-%mem | head -10
smem -r -k | head -10 # Install: sudo dnf install smem
# 2. Track memory growth
while true; do
ps aux | grep process_name
sleep 60
done
# 3. Analyze memory usage
cat /proc/$PID/status | grep -E "VmSize|VmRSS|VmSwap"
pmap -x $PID | tail -1
# 4. Memory leak detection
# Install valgrind
sudo dnf install valgrind
valgrind --leak-check=full ./program
# 5. Clear caches
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# 6. Configure swap
# Check swap usage
free -h
swapon --show
# Add swap file if needed
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# 7. Set memory limits
# Edit /etc/security/limits.conf
username soft memlock 1048576
username hard memlock 1048576
Problem 3: Disk I/O Bottlenecks
Solution:
# Identify I/O issues:
# 1. Check I/O statistics
iostat -x 2
iotop -o
# 2. Find heavy I/O processes
pidstat -d 2
# 3. Check disk health
sudo smartctl -H /dev/sda
sudo smartctl -a /dev/sda | grep -E "Reallocated|Pending|Uncorrectable"
# 4. Analyze specific process I/O
cat /proc/$PID/io
# 5. Optimize I/O
# Change I/O scheduler
echo noop | sudo tee /sys/block/sda/queue/scheduler
# Options: noop, deadline, cfq, bfq
# 6. Mount options optimization
# Edit /etc/fstab
# Add: noatime,nodiratime
# 7. Move high I/O to different disk
# Identify heavy directories
du -sh /* | sort -hr | head -10
# Move to SSD or separate disk
Problem 4: Network Performance Issues
Solution:
# Diagnose network problems:
# 1. Check interface statistics
ip -s link show
netstat -i
# 2. Monitor bandwidth
iftop -i eth0
nethogs
# 3. Check for packet loss
ping -c 100 google.com | grep loss
mtr google.com
# 4. TCP tuning
# Check current settings
sysctl net.ipv4.tcp_congestion_control
sysctl net.core.rmem_max
# Optimize settings
cat | sudo tee /etc/sysctl.d/99-network.conf << EOF
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_congestion_control = bbr
EOF
sudo sysctl -p /etc/sysctl.d/99-network.conf
# 5. DNS issues
# Test DNS performance
dig google.com | grep "Query time"
# Use faster DNS
echo "nameserver 1.1.1.1" | sudo tee /etc/resolv.conf
# 6. Check for network errors
ethtool -S eth0 | grep -E "errors|drops"
๐ Monitoring Tools Quick Reference
Tool | Purpose | Command |
---|---|---|
top | Process monitor | top |
htop | Enhanced process monitor | htop |
free | Memory usage | free -h |
df | Disk usage | df -h |
iostat | I/O statistics | iostat -x 2 |
iftop | Network bandwidth | sudo iftop -i eth0 |
glances | System overview | glances |
nmon | Performance monitor | nmon |
sar | System activity | sar -u 2 |
vmstat | Virtual memory | vmstat 2 |
๐ก Tips for Success
Master system monitoring like a professional! ๐
- ๐ Baseline Normal: Know what normal looks like
- ๐ Regular Monitoring: Check systems daily
- ๐ Document Patterns: Record recurring issues
- ๐ฏ Set Thresholds: Define alert trigger points
- ๐ Track Trends: Monitor long-term patterns
- ๐ ๏ธ Automate Alerts: Donโt rely on manual checks
- ๐พ Store Metrics: Keep historical data
- ๐ Correlate Events: Look for related issues
- ๐ฑ Mobile Access: Monitor from anywhere
- ๐ค Share Dashboards: Keep team informed
๐ What You Learned
Congratulations! Youโre now a monitoring expert! ๐
- โ Mastered essential command-line monitoring tools
- โ Deployed advanced monitoring utilities
- โ Configured network and service monitoring
- โ Implemented enterprise monitoring solutions
- โ Built custom monitoring scripts and dashboards
- โ Created automated alerting systems
- โ Solved common performance problems
- โ Gained professional system observability skills
๐ฏ Why This Matters
Your monitoring expertise ensures system reliability! ๐
- ๐จ Proactive Management: Fix issues before they escalate
- ๐ฐ Cost Savings: Optimize resource usage
- ๐ Performance: Maintain peak system efficiency
- ๐ก๏ธ Security: Detect anomalies and threats
- ๐ผ Professional Value: Essential operations skill
- ๐ฏ Rapid Response: Quickly identify root causes
- ๐ Data-Driven: Make informed decisions
- ๐ System Health: Ensure continuous availability
You now have complete visibility into your Linux systems! ๐
Monitor everything, miss nothing! ๐