Monitoring và Logging là hai thành phần cốt lõi của hệ thống observability trong production. Monitoring là collection và analysis của metrics (CPU, memory, requests/second). Logging là capture của discrete events (errors, user actions, system changes). Cùng nhau, chúng cung cấp visibility vào hệ thống và giúp identify issues trước khi users notice, hoặc diagnose problems khi chúng occur.
Tại sao Monitoring quan trọng?
- Detect problems sớm: Monitoring cho phép detect issues trước khi chúng become outages. Alert khi metrics exceed thresholds cho phép fix trước khi users affected.
- Performance Optimization: Data-driven decisions. Which endpoints slow? Where bottleneck? Is scaling needed? Monitoring data answer these questions.
- Capacity Planning: Forecast future needs dựa trên usage trends. Avoid over-provisioning (waste) hoặc under-provisioning (performance issues).
- Incident Response: When issues occur, monitoring data help quickly identify root cause. Logs provide audit trail cho forensics.
Monitoring Architecture
Một hệ thống monitoring hiệu quả gồm nhiều layers: collection, storage, visualization, và alerting.
Metrics Collection
- Pull model: Prometheus scrapes metrics from targets
- Push model: Applications push metrics to aggregator (StatsD)
- Agent-based: Daemonset collects node metrics (node_exporter)
Time Series Database
Specialized database optimized cho time-stamped data. Examples: Prometheus, InfluxDB, TimescaleDB. Hỗ trợ downsampling, retention policies, và efficient range queries.
Visualization Layer
Dashboards display metrics over time. Common tools: Grafana, Datadog, AWS CloudWatch dashboards.
Key Metrics Categories
1. Infrastructure Metrics
| Metric | Description |
|---|---|
| CPU Usage | % CPU utilized |
| Memory Usage | % RAM utilized, available |
| Disk I/O | Read/write bytes per second |
| Disk Space | Used vs available GB |
| Network | Bytes in/out, packet loss |
2. Application Metrics
| Metric | Description |
|---|---|
| Request Rate | Requests per second |
| Latency | p50, p90, p99 response time |
| Error Rate | % 4xx, 5xx responses |
| Throughput | Bytes/second processed |
3. Business Metrics
- Active Users: Concurrent users
- Conversion Rate: % users complete action
- Revenue: $/hour, $/day
- Error Budget: Allowed downtime
Prometheus – Monitoring Tool phổ biến
Prometheus là open-source monitoring và alerting toolkit, được design cho reliable collection và querying của metrics. Prometheus là core component trong Cloud-Native landscape.
Prometheus Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Applications│ │ Push │ │ Exporter │
│ with client │────▶│ Gateway │ │ (Node, │
│ library │ │ │ │ MySQL...) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
│ ┌───────────────┘
▼ ▼
┌─────────────────┐
│ Prometheus │
│ Server │
│ - Scrapes │
│ - Stores │
│ - Evaluates │
└────────┬────────┘
│
┌────────▼────────┐
│ Grafana │
│ Dashboards │
└─────────────────┘
Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
PromQL – Prometheus Query Language
# CPU usage percentage
rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100
# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Request rate per second
rate(http_requests_total[5m])
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100
Grafana – Visualization và Dashboards
Grafana là open-source platform cho observability. Grafana cho phép query, visualize, alert, và understand metrics regardless of where they are stored.
Key Grafana Features
- Dashboards: Visualize metrics với graphs, tables, heatmaps
- Alerting: Create alerts based on metric thresholds
- Data Sources: Prometheus, InfluxDB, Elasticsearch, Loki, SQL…
- Templating: Reusable dashboard variables
- Annotations: Mark events on timeline
Sample Grafana Dashboard JSON
{
"dashboard": {
"title": "System Overview",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "rate(node_cpu_seconds_total[5m])",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
ELK Stack – Logging Solution
ELK Stack (Elasticsearch, Logstash, Kibana) là popular open-source solution cho centralized logging. ELK cho phép search, analyze, và visualize logs từ multiple sources.
ELK Components
| Component | Purpose |
|---|---|
| Elasticsearch | Distributed search và analytics engine |
| Logstash | Data processing pipeline (parse, transform) |
| Kibana | Visualization interface |
| Beats | Lightweight shippers (Filebeat, Metricbeat…) |
ELK Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Log Files │ │ Services │ │ Systemd │
│ (App logs) │ │ (stdout) │ │ (journal) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ Beats │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐ │
│ │Filebeat │ │Metricbeat│ │Heartbeat│ │ Packetbeat│
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
└───────┼────────────┼────────────┼────────────┼──────┘
│ │ │ │
└────────────┼────────────┼────────────┘
│ │
▼ ▼
┌───────────────┐ ┌──────────────┐
│ Logstash │ │ Elasticsearch │
│ (Pipeline) │──│ (Storage) │
└───────────────┘ └───────┬────────┘
│
▼
┌──────────────┐
│ Kibana │
│(Visualization)│
└──────────────┘
Logstash Pipeline Example
input {
beats {
port => 5044
}
}
filter {
if [log_type] == "nginx" {
grok {
match => { "message" => '%{IPORHOST:client_ip} %{NGUSER:ident} %{NGUSER:auth} [%{HTTPDATE:timestamp}] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes} %{QS:referrer} %{QS:agent}' }
}
mutate {
add_field => { "log_type" => "processed" }
}
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "@timestamp"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "nginx-%{+YYYY.MM.dd}"
}
}
Loki – Log Aggregation cho Prometheus
Loki là log aggregation system được thiết kế để work with Prometheus và Grafana. Loki khác với ELK ở chỗ nó không full-text search logs, mà label-based indexing — giúp Loki scale tốt và cost-effective.
Loki vs ELK
| Aspect | Loki | ELK |
|---|---|---|
| Indexing | Label-based (cheap) | Full-text (expensive) |
| Storage | Object storage (S3) | Elasticsearch cluster |
| Scale | Excellent | Good (need sharding) |
| Query Speed | Fast for label queries | Fast for full-text |
| Cost | Lower | Higher |
Promtail Configuration
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
client:
url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: 'system'
static_configs:
- targets:
- localhost
labels:
job: 'systemlogs'
env: 'production'
__path__: '/var/log/*.log'
- job_name: 'nginx'
static_configs:
- targets:
- localhost
labels:
job: 'nginx'
__path__: '/var/log/nginx/*.log'
Alerting – Notify khi có vấn đề
Alerting là critical component của monitoring. Alerts cần được configure để fire khi có issues thực sự, không phải noise.
Alerting Best Practices
- Alert on symptoms, not causes: “High error rate” thay vì “Database down”
- Set appropriate thresholds: p95 thay vì p50 cho latency alerts
- Reduce noise: Use multi-window alerts (fire only if persistent)
- Route to right people: PagerDuty, Slack, email cho different severity
- Document runbooks: Mỗi alert cần có response instructions
Prometheus Alert Rules
groups:
- name: example
rules:
# Alert khi instance down
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.job }} has been down for more than 5 minutes."
# Alert khi CPU cao
- alert: HighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 10 minutes."
# Alert khi disk sắp full
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk space is below 10%."
Distributed Tracing - OpenTelemetry
Distributed tracing track requests through multiple services. OpenTelemetry là open-source standard cho collecting traces, metrics, và logs.
Tracing Concepts
- Trace: Complete end-to-end request journey
- Span: Individual operation within a trace
- Context: TraceId, SpanId propagate across services
- Attributes: Key-value metadata (user_id, endpoint, etc)
OpenTelemetry Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Service A │ │ Service B │ │ Service C │
│ (Span 1) │────▶│ (Span 2) │────▶│ (Span 3) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┼───────────────────┘
│
▼
┌─────────────────────────┐
│ OTel Collector │
│ (Receive, Process, │
│ Export traces) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Backend │
│ (Jaeger, Zipkin, │
│ Tempo, etc) │
└─────────────────────────┘
Best Practices cho Monitoring và Logging
- Start with SLOs: Define Service Level Objectives trước khi set up monitoring
- Three pillars: Metrics, Logs, Traces — cả ba đều quan trọng
- Correlate data: Link metrics sang logs sang traces để debug nhanh
- Retention policies: Hot storage cho recent data, cold storage cho historical
- Cost management: Sample logs, aggregate metrics để giảm storage costs
FAQ - Các câu hỏi thường gặp
- Monitoring và Logging khác nhau thế nào? Monitoring là collection và analysis của continuous metrics (CPU, memory, latency). Logging là capture của discrete events (errors, user actions). Monitoring cho phép proactive alerting, logging cho phép retrospective analysis.
- Nên dùng Prometheus hay Datadog? Prometheus là open-source, tự host, free. Datadog là SaaS, có nhiều integrations, có pricing based on data volume. Prometheus tốt cho teams có capacity manage infrastructure. Datadog tốt cho quick setup và advanced features.
- Logs nên lưu bao lâu? Tùy compliance requirements. Development: 7-30 days. Production: 30-90 days. Compliance: months to years. Dùng tiered storage: hot (SSD) cho recent, cold (S3) cho historical.
- Làm sao debug distributed system issues? Dùng distributed tracing (Jaeger, Zipkin). Trace journey từ request đến response qua tất cả services. Identify bottleneck bằng cách xem spans có high latency.
- Alert fatigue làm sao tránh? Tune thresholds dựa trên actual traffic patterns. Use multi-window alerts (fire only if persistent). Separate warning và critical alerts. Review và prune alerts regularly.