Monitoring Với Prometheus Và Grafana: Hướng Dẫn Setup DevOps Dashboard

Trong thế giới DevOps ngày nay, việc giám sát hệ thống không còn là tùy chọn mà là yêu cầu bắt buộc. Một stack monitoring hiệu quả giúp bạn phát hiện vấn đề trước khi người dùng phàn nàn, tối ưu hiệu năng, và đảm bảo uptime của dịch vụ.

Bài viết này sẽ hướng dẫn bạn setup Prometheus và Grafana – stack monitoring mạnh mẽ và phổ biến nhất hiện nay – để tạo một DevOps dashboard chuyên nghiệp.

Monitoring là gì? Tại sao cần giám sát hệ thống?

Monitoring là quá trình thu thập, phân tích và hiển thị các metrics về hệ thống và ứng dụng. Nó bao gồm CPU, RAM, disk, network, application performance, và nhiều chỉ số khác.

Không monitoring giống như lái xe mà không có đồng hồ – bạn không biết tốc độ, lượng xăng, hay nhiệt độ máy. Khi engine chết, bạn sẽ biết – nhưng quá trễ.

Loại Monitoring	Chỉ số theo dõi	Công cụ
Infrastructure	CPU, RAM, Disk, Network, Load Average	Prometheus + Node Exporter
Application	Request rate, Error rate, Latency	Prometheus + client libraries
Database	Query performance, Connections, Cache hit	MySQL/PostgreSQL exporters
Kubernetes	Pod status, Resource usage, API server	Prometheus Operator

Prometheus là gì?

Prometheus là open-source monitoring và alerting toolkit ban đầu được phát triển bởi SoundCloud, sau đó trở thành dự án Cloud Native Computing Foundation (CNCF). Prometheus có các đặc điểm nổi bật:

Pull-based model

Multi-dimensional data model

Powerful PromQL

Alerting rules

Service discovery

Grafana là gì?

Grafana là open-source platform cho visualization và analytics. Grafana kết nối với nhiều data sources như Prometheus, InfluxDB, Elasticsearch, và cho phép tạo dashboards tương tác với charts, graphs, và alerts.

Grafana là frontend cho Prometheus – trong khi Prometheus lưu trữ và query data, Grafana giúp bạn nhìn thấy dữ liệu đó dưới dạng trực quan, đẹp mắt, và dễ hiểu.

Cài đặt Prometheus trên Ubuntu 22.04

Prometheus có thể cài đặt qua Docker, binary, hoặc package manager. Dưới đây là hướng dẫn cài đặt bằng binary – phương pháp phổ biến nhất cho production.

# Tải Prometheus
PROMETHEUS_VERSION="2.50.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Giải nén
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64

# Tạo user cho Prometheus (production)
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

Cấu hình Prometheus

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter cho system metrics
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

  # Custom application metrics
  - job_name: "myapp"
    metrics_path: /metrics
    static_configs:
      - targets: ["localhost:8080"]

Tạo systemd service cho Prometheus

# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.libraries=/usr/local/share/prometheus/console_libraries \
    --web.console.templates=/usr/local/share/prometheus/consoles

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# Khởi động Prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus

# Kiểm tra status
sudo systemctl status prometheus

# Truy cập Prometheus UI
# http://your-server-ip:9090

Cài đặt Grafana trên Ubuntu 22.04

Grafana có thể cài đặt qua APT repository – cách này giúp update dễ dàng hơn.

# Cài đặt qua APT
sudo apt install -y apt-transport-https software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt update
sudo apt install grafana

# Khởi động Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

# Truy cập Grafana UI
# http://your-server-ip:3000
# Default login: admin / admin (đổi password ngay sau khi đăng nhập)

Kết nối Grafana với Prometheus

Sau khi cài đặt Grafana, bạn cần thêm Prometheus làm data source:

Configuration

Data Sources

Add data source

Prometheus

URL

http://localhost:9090

Access

Save & Test

Cài đặt Node Exporter cho System Metrics

Node Exporter là agent thu thập system metrics như CPU, memory, disk, network. Nó chạy trên mỗi machine bạn muốn monitor.

# Tải và cài đặt Node Exporter
NODE_EXPORTER_VERSION="1.8.0"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64

# Copy binary
sudo cp node_exporter /usr/local/bin/

# Tạo systemd service
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# Khởi động
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# Verify - kiểm tra metrics endpoint
curl http://localhost:9100/metrics

Tạo DevOps Dashboard trong Grafana

Dashboard là nơi bạn visualize các metrics. Grafana có nhiều pre-built dashboards bạn có thể import từ Grafana Dashboards.

Tạo Dashboard cơ bản

Create

Dashboard

Add new panel

Data source

Query

node_cpu_seconds_total

Visualization

Một số queries hữu ích cho DevOps dashboard:

# CPU Usage (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (percentage)
100 - (avg by (instance) (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Disk Usage (percentage)
100 - (avg by (device, instance) (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}) / avg by (device, instance) (node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}) * 100)

# Network I/O (bytes per second)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Load Average (1m, 5m, 15m)
node_load1 / node_count
node_load5 / node_count
node_load15 / node_count

Cài đặt Alerting với Alertmanager

Alerting là phần quan trọng của monitoring – giúp bạn nhận thông báo khi có vấn đề xảy ra thay vì phải ngồi watch dashboard liên tục.

# Tạo alerting rules
# /etc/prometheus/rules/webapp.yml
groups:
- name: webapp_alerts
  rules:
  - alert: HighCPU
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "Instance {{ $labels.instance }} has CPU usage above 80% for 5 minutes"

  - alert: HighMemory
    expr: 100 - (avg by (instance) (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Memory usage detected"
      description: "Instance {{ $labels.instance }} has memory usage above 85%"

  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "Disk space is running low"
      description: "Instance {{ $labels.instance }} has less than 10% disk space remaining"

Cấu hình Alertmanager (Email/Slack Notification)

# /etc/prometheus/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourcompany.com'
  smtp_auth_username: 'your-email@gmail.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    continue: true

receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'devops@yourcompany.com'
    send_resolved: true

- name: 'critical-alerts'
  email_configs:
  - to: 'oncall@yourcompany.com'
    send_resolved: true
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    channel: '#alerts-critical'
    send_resolved: true

Import Dashboard có sẵn

Thay vì tự tạo dashboard, bạn có thể import từ Grafana Labs - có hàng ngàn dashboards được cộng đồng chia sẻ:

Node Exporter Full

Prometheus 2.0 Overview

Kubernetes Cluster

Để import: Dashboards → Import → nhập Dashboard ID → chọn Prometheus data source → Load.

Best Practices cho Monitoring Stack

Do not over-collect

Set appropriate scrape interval

Use labels wisely

Set retention policy

Alert on symptoms not causes

Câu hỏi thường gặp (FAQ)

Prometheus và Grafana có miễn phí không?

Có, cả hai đều là open-source hoặc có version miễn phí. Grafana có thêm Grafana Cloud (miễn phí 10k metrics), nhưng self-hosted hoàn toàn miễn phí. Prometheus không có enterprise version - hoàn toàn miễn phí.

Sự khác nhau giữa Push và Pull model?

Pull model (Prometheus): Prometheus chủ động gọi đến targets để lấy metrics. Ưu điểm: dễ quản lý, không cần cấu hình trên targets, central collection point. Push model: Applications push metrics đến central server. Phù hợp cho batch jobs hoặc short-lived services.

Làm sao monitor nhiều servers?

Cài Node Exporter trên mỗi server, thêm vào prometheus.yml scrape config với danh sách tất cả targets. Hoặc dùng service discovery như Kubernetes, EC2, Azure để Prometheus tự động discover các targets.

Có thể monitor Docker containers không?

Có. Dùng cAdvisor (Container Advisor) - container advisor exposes container metrics. Prometheus có built-in support cho cAdvisor. Hoặc dùng Docker metrics exporter cho Docker Engine metrics.

Grafana dashboard bị trống, không hiển thị data?

Kiểm tra: (1) Prometheus có đang scrape đúng target không - vào Prometheus UI → Status → Targets, (2) Data source URL trong Grafana có đúng không, (3) Thời gian range của dashboard có phù hợp không (mặc định là last 6 hours), (4) Kiểm tra queries bằng Prometheus explore page.

Làm sao backup/restore Prometheus data?

Prometheus data nằm trong storage.tsdb.path (thường là /var/lib/prometheus/). Backup bằng cách stop Prometheus và copy thư mục này. Restore bằng cách copy backup vào đúng vị trí. Hoặc dùng remote write/remote read integration với Thanos/Cortex để persist data lâu dài.

Kết luận

Prometheus và Grafana là stack monitoring tiêu chuẩn trong giới DevOps - mạnh mẽ, miễn phí, và có cộng đồng lớn. Với hướng dẫn trên, bạn đã có một monitoring stack hoàn chỉnh để giám sát infrastructure của mình.

Để tìm hiểu thêm về containerization và cách deploy các dịch vụ này bằng Docker, hãy tham khảo hướng dẫn Docker toàn tập trên vnhte.com.