Monitoring Và Logging Là Gì? Observability Toàn Tập 2026

Monitoring và Logging là hai thành phần cốt lõi của hệ thống observability trong production. Monitoring là collection và analysis của metrics (CPU, memory, requests/second). Logging là capture của discrete events (errors, user actions, system changes). Cùng nhau, chúng cung cấp visibility vào hệ thống và giúp identify issues trước khi users notice, hoặc diagnose problems khi chúng occur.

Tại sao Monitoring quan trọng?

Detect problems sớm: Monitoring cho phép detect issues trước khi chúng become outages. Alert khi metrics exceed thresholds cho phép fix trước khi users affected.
Performance Optimization: Data-driven decisions. Which endpoints slow? Where bottleneck? Is scaling needed? Monitoring data answer these questions.
Capacity Planning: Forecast future needs dựa trên usage trends. Avoid over-provisioning (waste) hoặc under-provisioning (performance issues).
Incident Response: When issues occur, monitoring data help quickly identify root cause. Logs provide audit trail cho forensics.

Monitoring Architecture

Một hệ thống monitoring hiệu quả gồm nhiều layers: collection, storage, visualization, và alerting.

Metrics Collection

Pull model: Prometheus scrapes metrics from targets
Push model: Applications push metrics to aggregator (StatsD)
Agent-based: Daemonset collects node metrics (node_exporter)

Time Series Database

Specialized database optimized cho time-stamped data. Examples: Prometheus, InfluxDB, TimescaleDB. Hỗ trợ downsampling, retention policies, và efficient range queries.

Visualization Layer

Dashboards display metrics over time. Common tools: Grafana, Datadog, AWS CloudWatch dashboards.

Key Metrics Categories

1. Infrastructure Metrics

Metric	Description
CPU Usage	% CPU utilized
Memory Usage	% RAM utilized, available
Disk I/O	Read/write bytes per second
Disk Space	Used vs available GB
Network	Bytes in/out, packet loss

2. Application Metrics

Metric	Description
Request Rate	Requests per second
Latency	p50, p90, p99 response time
Error Rate	% 4xx, 5xx responses
Throughput	Bytes/second processed

3. Business Metrics

Active Users: Concurrent users
Conversion Rate: % users complete action
Revenue: $/hour, $/day
Error Budget: Allowed downtime

Prometheus – Monitoring Tool phổ biến

Prometheus là open-source monitoring và alerting toolkit, được design cho reliable collection và querying của metrics. Prometheus là core component trong Cloud-Native landscape.

Prometheus Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Applications│     │   Push      │     │   Exporter  │
│ with client │────▶│   Gateway   │     │   (Node,    │
│   library   │     │             │     │   MySQL...) │
└─────────────┘     └─────────────┘     └─────────────┘
                       │                      │
                       │      ┌───────────────┘
                       ▼      ▼
                  ┌─────────────────┐
                  │   Prometheus    │
                  │     Server      │
                  │  - Scrapes      │
                  │  - Stores       │
                  │  - Evaluates    │
                  └────────┬────────┘
                           │
                  ┌────────▼────────┐
                  │     Grafana     │
                  │   Dashboards    │
                  └─────────────────┘

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

PromQL – Prometheus Query Language

# CPU usage percentage
rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100

# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Request rate per second
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) 
  / 
rate(http_requests_total[5m]) * 100

Grafana – Visualization và Dashboards

Grafana là open-source platform cho observability. Grafana cho phép query, visualize, alert, và understand metrics regardless of where they are stored.

Key Grafana Features

Dashboards: Visualize metrics với graphs, tables, heatmaps
Alerting: Create alerts based on metric thresholds
Data Sources: Prometheus, InfluxDB, Elasticsearch, Loki, SQL…
Templating: Reusable dashboard variables
Annotations: Mark events on timeline

Sample Grafana Dashboard JSON

{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_cpu_seconds_total[5m])",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
            "legendFormat": "{{instance}}"
          }
        ]
      }
    ]
  }
}

ELK Stack – Logging Solution

ELK Stack (Elasticsearch, Logstash, Kibana) là popular open-source solution cho centralized logging. ELK cho phép search, analyze, và visualize logs từ multiple sources.

ELK Components

Component	Purpose
Elasticsearch	Distributed search và analytics engine
Logstash	Data processing pipeline (parse, transform)
Kibana	Visualization interface
Beats	Lightweight shippers (Filebeat, Metricbeat…)

ELK Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Log Files   │    │   Services   │    │   Systemd    │
│   (App logs)  │    │   (stdout)   │    │   (journal)  │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────┐
│                      Beats                           │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌────────┐ │
│  │Filebeat │  │Metricbeat│  │Heartbeat│  │ Packetbeat│
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘ │
└───────┼────────────┼────────────┼────────────┼──────┘
        │            │            │            │
        └────────────┼────────────┼────────────┘
                     │            │
                     ▼            ▼
             ┌───────────────┐  ┌──────────────┐
             │   Logstash    │  │ Elasticsearch │
             │   (Pipeline)  │──│   (Storage)   │
             └───────────────┘  └───────┬────────┘
                                       │
                                       ▼
                                 ┌──────────────┐
                                 │    Kibana    │
                                 │(Visualization)│
                                 └──────────────┘

Logstash Pipeline Example

input {
  beats {
    port => 5044
  }
}

filter {
  if [log_type] == "nginx" {
    grok {
      match => { "message" => '%{IPORHOST:client_ip} %{NGUSER:ident} %{NGUSER:auth} [%{HTTPDATE:timestamp}] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes} %{QS:referrer} %{QS:agent}' }
    }
    mutate {
      add_field => { "log_type" => "processed" }
    }
  }
  
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "nginx-%{+YYYY.MM.dd}"
  }
}

Loki – Log Aggregation cho Prometheus

Loki là log aggregation system được thiết kế để work with Prometheus và Grafana. Loki khác với ELK ở chỗ nó không full-text search logs, mà label-based indexing — giúp Loki scale tốt và cost-effective.

Loki vs ELK

Aspect	Loki	ELK
Indexing	Label-based (cheap)	Full-text (expensive)
Storage	Object storage (S3)	Elasticsearch cluster
Scale	Excellent	Good (need sharding)
Query Speed	Fast for label queries	Fast for full-text
Cost	Lower	Higher

Promtail Configuration

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

client:
  url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: 'system'
    static_configs:
      - targets:
          - localhost
        labels:
          job: 'systemlogs'
          env: 'production'
          __path__: '/var/log/*.log'
  
  - job_name: 'nginx'
    static_configs:
      - targets:
          - localhost
        labels:
          job: 'nginx'
          __path__: '/var/log/nginx/*.log'

Alerting – Notify khi có vấn đề

Alerting là critical component của monitoring. Alerts cần được configure để fire khi có issues thực sự, không phải noise.

Alerting Best Practices

Alert on symptoms, not causes: “High error rate” thay vì “Database down”
Set appropriate thresholds: p95 thay vì p50 cho latency alerts
Reduce noise: Use multi-window alerts (fire only if persistent)
Route to right people: PagerDuty, Slack, email cho different severity
Document runbooks: Mỗi alert cần có response instructions

Prometheus Alert Rules

groups:
- name: example
  rules:
  # Alert khi instance down
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.job }} has been down for more than 5 minutes."
  
  # Alert khi CPU cao
  - alert: HighCPU
    expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 10 minutes."
  
  # Alert khi disk sắp full
  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Disk space is below 10%."

Distributed Tracing - OpenTelemetry

Distributed tracing track requests through multiple services. OpenTelemetry là open-source standard cho collecting traces, metrics, và logs.

Tracing Concepts

Trace: Complete end-to-end request journey
Span: Individual operation within a trace
Context: TraceId, SpanId propagate across services
Attributes: Key-value metadata (user_id, endpoint, etc)

OpenTelemetry Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Service A │     │   Service B │     │   Service C │
│   (Span 1)  │────▶│   (Span 2)  │────▶│   (Span 3)  │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │   OTel Collector        │
              │   (Receive, Process,    │
              │    Export traces)       │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Backend              │
              │   (Jaeger, Zipkin,     │
              │    Tempo, etc)         │
              └─────────────────────────┘

Best Practices cho Monitoring và Logging

Start with SLOs: Define Service Level Objectives trước khi set up monitoring
Three pillars: Metrics, Logs, Traces — cả ba đều quan trọng
Correlate data: Link metrics sang logs sang traces để debug nhanh
Retention policies: Hot storage cho recent data, cold storage cho historical
Cost management: Sample logs, aggregate metrics để giảm storage costs

FAQ - Các câu hỏi thường gặp

Monitoring và Logging khác nhau thế nào? Monitoring là collection và analysis của continuous metrics (CPU, memory, latency). Logging là capture của discrete events (errors, user actions). Monitoring cho phép proactive alerting, logging cho phép retrospective analysis.
Nên dùng Prometheus hay Datadog? Prometheus là open-source, tự host, free. Datadog là SaaS, có nhiều integrations, có pricing based on data volume. Prometheus tốt cho teams có capacity manage infrastructure. Datadog tốt cho quick setup và advanced features.
Logs nên lưu bao lâu? Tùy compliance requirements. Development: 7-30 days. Production: 30-90 days. Compliance: months to years. Dùng tiered storage: hot (SSD) cho recent, cold (S3) cho historical.
Làm sao debug distributed system issues? Dùng distributed tracing (Jaeger, Zipkin). Trace journey từ request đến response qua tất cả services. Identify bottleneck bằng cách xem spans có high latency.
Alert fatigue làm sao tránh? Tune thresholds dựa trên actual traffic patterns. Use multi-window alerts (fire only if persistent). Separate warning và critical alerts. Review và prune alerts regularly.