MonitoringLogging là hai thành phần cốt lõi của hệ thống observability trong production. Monitoring là collection và analysis của metrics (CPU, memory, requests/second). Logging là capture của discrete events (errors, user actions, system changes). Cùng nhau, chúng cung cấp visibility vào hệ thống và giúp identify issues trước khi users notice, hoặc diagnose problems khi chúng occur.

Tại sao Monitoring quan trọng?

  • Detect problems sớm: Monitoring cho phép detect issues trước khi chúng become outages. Alert khi metrics exceed thresholds cho phép fix trước khi users affected.
  • Performance Optimization: Data-driven decisions. Which endpoints slow? Where bottleneck? Is scaling needed? Monitoring data answer these questions.
  • Capacity Planning: Forecast future needs dựa trên usage trends. Avoid over-provisioning (waste) hoặc under-provisioning (performance issues).
  • Incident Response: When issues occur, monitoring data help quickly identify root cause. Logs provide audit trail cho forensics.

Monitoring Architecture

Một hệ thống monitoring hiệu quả gồm nhiều layers: collection, storage, visualization, và alerting.

Metrics Collection

  • Pull model: Prometheus scrapes metrics from targets
  • Push model: Applications push metrics to aggregator (StatsD)
  • Agent-based: Daemonset collects node metrics (node_exporter)

Time Series Database

Specialized database optimized cho time-stamped data. Examples: Prometheus, InfluxDB, TimescaleDB. Hỗ trợ downsampling, retention policies, và efficient range queries.

Visualization Layer

Dashboards display metrics over time. Common tools: Grafana, Datadog, AWS CloudWatch dashboards.

Key Metrics Categories

1. Infrastructure Metrics

MetricDescription
CPU Usage% CPU utilized
Memory Usage% RAM utilized, available
Disk I/ORead/write bytes per second
Disk SpaceUsed vs available GB
NetworkBytes in/out, packet loss

2. Application Metrics

MetricDescription
Request RateRequests per second
Latencyp50, p90, p99 response time
Error Rate% 4xx, 5xx responses
ThroughputBytes/second processed

3. Business Metrics

  • Active Users: Concurrent users
  • Conversion Rate: % users complete action
  • Revenue: $/hour, $/day
  • Error Budget: Allowed downtime

Prometheus – Monitoring Tool phổ biến

Prometheus là open-source monitoring và alerting toolkit, được design cho reliable collection và querying của metrics. Prometheus là core component trong Cloud-Native landscape.

Prometheus Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Applications│     │   Push      │     │   Exporter  │
│ with client │────▶│   Gateway   │     │   (Node,    │
│   library   │     │             │     │   MySQL...) │
└─────────────┘     └─────────────┘     └─────────────┘
                       │                      │
                       │      ┌───────────────┘
                       ▼      ▼
                  ┌─────────────────┐
                  │   Prometheus    │
                  │     Server      │
                  │  - Scrapes      │
                  │  - Stores       │
                  │  - Evaluates    │
                  └────────┬────────┘
                           │
                  ┌────────▼────────┐
                  │     Grafana     │
                  │   Dashboards    │
                  └─────────────────┘

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

PromQL – Prometheus Query Language

# CPU usage percentage
rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100

# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Request rate per second
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) 
  / 
rate(http_requests_total[5m]) * 100

Grafana – Visualization và Dashboards

Grafana là open-source platform cho observability. Grafana cho phép query, visualize, alert, và understand metrics regardless of where they are stored.

Key Grafana Features

  • Dashboards: Visualize metrics với graphs, tables, heatmaps
  • Alerting: Create alerts based on metric thresholds
  • Data Sources: Prometheus, InfluxDB, Elasticsearch, Loki, SQL…
  • Templating: Reusable dashboard variables
  • Annotations: Mark events on timeline

Sample Grafana Dashboard JSON

{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_cpu_seconds_total[5m])",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
            "legendFormat": "{{instance}}"
          }
        ]
      }
    ]
  }
}

ELK Stack – Logging Solution

ELK Stack (Elasticsearch, Logstash, Kibana) là popular open-source solution cho centralized logging. ELK cho phép search, analyze, và visualize logs từ multiple sources.

ELK Components

ComponentPurpose
ElasticsearchDistributed search và analytics engine
LogstashData processing pipeline (parse, transform)
KibanaVisualization interface
BeatsLightweight shippers (Filebeat, Metricbeat…)

ELK Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Log Files   │    │   Services   │    │   Systemd    │
│   (App logs)  │    │   (stdout)   │    │   (journal)  │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────┐
│                      Beats                           │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌────────┐ │
│  │Filebeat │  │Metricbeat│  │Heartbeat│  │ Packetbeat│
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘ │
└───────┼────────────┼────────────┼────────────┼──────┘
        │            │            │            │
        └────────────┼────────────┼────────────┘
                     │            │
                     ▼            ▼
             ┌───────────────┐  ┌──────────────┐
             │   Logstash    │  │ Elasticsearch │
             │   (Pipeline)  │──│   (Storage)   │
             └───────────────┘  └───────┬────────┘
                                       │
                                       ▼
                                 ┌──────────────┐
                                 │    Kibana    │
                                 │(Visualization)│
                                 └──────────────┘

Logstash Pipeline Example

input {
  beats {
    port => 5044
  }
}

filter {
  if [log_type] == "nginx" {
    grok {
      match => { "message" => '%{IPORHOST:client_ip} %{NGUSER:ident} %{NGUSER:auth} [%{HTTPDATE:timestamp}] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes} %{QS:referrer} %{QS:agent}' }
    }
    mutate {
      add_field => { "log_type" => "processed" }
    }
  }
  
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "nginx-%{+YYYY.MM.dd}"
  }
}

Loki – Log Aggregation cho Prometheus

Loki là log aggregation system được thiết kế để work with Prometheus và Grafana. Loki khác với ELK ở chỗ nó không full-text search logs, mà label-based indexing — giúp Loki scale tốt và cost-effective.

Loki vs ELK

AspectLokiELK
IndexingLabel-based (cheap)Full-text (expensive)
StorageObject storage (S3)Elasticsearch cluster
ScaleExcellentGood (need sharding)
Query SpeedFast for label queriesFast for full-text
CostLowerHigher

Promtail Configuration

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

client:
  url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: 'system'
    static_configs:
      - targets:
          - localhost
        labels:
          job: 'systemlogs'
          env: 'production'
          __path__: '/var/log/*.log'
  
  - job_name: 'nginx'
    static_configs:
      - targets:
          - localhost
        labels:
          job: 'nginx'
          __path__: '/var/log/nginx/*.log'

Alerting – Notify khi có vấn đề

Alerting là critical component của monitoring. Alerts cần được configure để fire khi có issues thực sự, không phải noise.

Alerting Best Practices

  • Alert on symptoms, not causes: “High error rate” thay vì “Database down”
  • Set appropriate thresholds: p95 thay vì p50 cho latency alerts
  • Reduce noise: Use multi-window alerts (fire only if persistent)
  • Route to right people: PagerDuty, Slack, email cho different severity
  • Document runbooks: Mỗi alert cần có response instructions

Prometheus Alert Rules

groups:
- name: example
  rules:
  # Alert khi instance down
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.job }} has been down for more than 5 minutes."
  
  # Alert khi CPU cao
  - alert: HighCPU
    expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 10 minutes."
  
  # Alert khi disk sắp full
  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Disk space is below 10%."

Distributed Tracing - OpenTelemetry

Distributed tracing track requests through multiple services. OpenTelemetry là open-source standard cho collecting traces, metrics, và logs.

Tracing Concepts

  • Trace: Complete end-to-end request journey
  • Span: Individual operation within a trace
  • Context: TraceId, SpanId propagate across services
  • Attributes: Key-value metadata (user_id, endpoint, etc)

OpenTelemetry Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Service A │     │   Service B │     │   Service C │
│   (Span 1)  │────▶│   (Span 2)  │────▶│   (Span 3)  │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │   OTel Collector        │
              │   (Receive, Process,    │
              │    Export traces)       │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Backend              │
              │   (Jaeger, Zipkin,     │
              │    Tempo, etc)         │
              └─────────────────────────┘

Best Practices cho Monitoring và Logging

  • Start with SLOs: Define Service Level Objectives trước khi set up monitoring
  • Three pillars: Metrics, Logs, Traces — cả ba đều quan trọng
  • Correlate data: Link metrics sang logs sang traces để debug nhanh
  • Retention policies: Hot storage cho recent data, cold storage cho historical
  • Cost management: Sample logs, aggregate metrics để giảm storage costs

FAQ - Các câu hỏi thường gặp

  • Monitoring và Logging khác nhau thế nào? Monitoring là collection và analysis của continuous metrics (CPU, memory, latency). Logging là capture của discrete events (errors, user actions). Monitoring cho phép proactive alerting, logging cho phép retrospective analysis.
  • Nên dùng Prometheus hay Datadog? Prometheus là open-source, tự host, free. Datadog là SaaS, có nhiều integrations, có pricing based on data volume. Prometheus tốt cho teams có capacity manage infrastructure. Datadog tốt cho quick setup và advanced features.
  • Logs nên lưu bao lâu? Tùy compliance requirements. Development: 7-30 days. Production: 30-90 days. Compliance: months to years. Dùng tiered storage: hot (SSD) cho recent, cold (S3) cho historical.
  • Làm sao debug distributed system issues? Dùng distributed tracing (Jaeger, Zipkin). Trace journey từ request đến response qua tất cả services. Identify bottleneck bằng cách xem spans có high latency.
  • Alert fatigue làm sao tránh? Tune thresholds dựa trên actual traffic patterns. Use multi-window alerts (fire only if persistent). Separate warning và critical alerts. Review và prune alerts regularly.

Chào các bạn mình là Quốc Hùng , mình sinh ra thuộc cung song tử ,song tử luôn khẳng định chính mình ,luôn luôn phấn đấu vượt lên phía trước ,mình sinh ra và lớn lên tại vùng đất võ cổ truyền ,đam mê của mình là coder ,ngày đi học tối về viết blog ...