Monitoring và Logging là gì?
Trong production environments, monitoring và logging là essential để understand system health, troubleshoot issues, và improve performance. Monitoring là collection và analysis của metrics (CPU, memory, requests/second). Logging là capture của discrete events (errors, user actions, system changes).
Cùng nhau, chúng cung cấp visibility vào hệ thống và help identify issues trước khi users notice, hoặc diagnose problems khi chúng occur.
Tại Sao Monitoring Quan Trọng?
1. Proactive vs Reactive
Monitoring cho phép detect problems trước khi chúng become outages. Alert khi metrics exceed thresholds cho phép fix trước khi users affected.
2. Performance Optimization
Data-driven decisions. Which endpoints slow? Where bottleneck? Is scaling needed? Monitoring data answer these questions.
3. Capacity Planning
Forecast future needs dựa trên usage trends. Avoid over-provisioning (waste) hoặc under-provisioning (performance issues).
4. Incident Response
When issues occur, monitoring data help quickly identify root cause. Logs provide audit trail cho forensics.
Monitoring Architecture
Metrics Collection
- Pull model: Prometheus scrapes metrics from targets
- Push model: Applications push metrics to aggregator (StatsD)
- Agent-based: Daemonset collects node metrics (node_exporter)
Time Series Database
Specialized database optimized for time-stamped data. Examples: Prometheus, InfluxDB, TimescaleDB. Support downsampling, retention policies, và efficient range queries.
Visualization Layer
Dashboards display metrics over time. Common tools: Grafana, Datadog, AWS CloudWatch dashboards.
Key Metrics Categories
1. Infrastructure Metrics
| Metric | Description |
|---|---|
| CPU Usage | % CPU utilized |
| Memory Usage | % RAM utilized, available |
| Disk I/O | Read/write bytes per second |
| Disk Space | Used vs available |
| Network | Bytes in/out, packet drops |
2. Application Metrics
| Metric | Description |
|---|---|
| Request Rate | Requests per second |
| Latency (p50, p95, p99) | Response time distribution |
| Error Rate | % requests resulting in errors |
| Throughput | MB/s or requests/s |
3. Business Metrics
| Metric | Description |
|---|---|
| Active Users | Concurrent users |
| Conversion Rate | % visitors completing action |
| Revenue | $$ per hour/day |
| API Calls | External API usage |
Prometheus – Monitoring System
Prometheus là open-source monitoring và alerting system, được CNCF graduate. Prometheus pull metrics từ targets, store time series data, và support powerful querying (PromQL).
Prometheus Architecture
- Prometheus Server: Scrapes, stores, queries
- Exporters: Export existing metrics (node_exporter, mysql_exporter)
- Alertmanager: Handle alerts, route notifications
- Grafana: Visualization
Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'web-app'
static_configs:
- targets: ['web-app:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
PromQL Examples
# CPU usage of web servers
avg(rate(node_cpu_seconds_total{mode="user"}[5m])) by (instance)
# Request rate
rate(http_requests_total[5m])
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
Grafana – Dashboards
Grafana là open-source visualization và analytics platform. Connect to Prometheus, InfluxDB, Elasticsearch, và many other data sources.
Grafana Dashboard Panels
- Graph: Time series visualization
- Stat: Single value display
- Gauge: Current value vs max
- Table: Tabular data
- Alert List: Current firing alerts
Kubernetes Monitoring Stack
# kube-prometheus-stack (Helm) helm install prometheus prometheus-community/kube-prometheus-stack # Includes: # - Prometheus operator # - Grafana dashboards # - node_exporter # - kube-state-metrics # - Alertmanager
Logging Architecture
Log Levels
| Level | Use Case |
|---|---|
| DEBUG | Detailed debugging info (dev only) |
| INFO | General operational events |
| WARN | Potential issues, degradation |
| ERROR | Errors, failures |
| FATAL | Critical, system down |
Structured Logging
# Unstructured (hard to query)
echo "User login failed for user john at 2024-01-15"
# Structured (JSON - queryable)
{
"level": "warn",
"timestamp": "2024-01-15T10:30:00Z",
"message": "User login failed",
"user": "john",
"ip": "192.168.1.1",
"reason": "invalid_password"
}
Log Fields
- timestamp: When event occurred (ISO8601)
- level: Log level
- service: Which service generated
- trace_id: For request tracing across services
- user_id: If applicable
- message: Human-readable message
ELK Stack – Elasticsearch, Logstash, Kibana
Components
- Elasticsearch: Search và store logs, scalable
- Logstash: Parse và transform logs before indexing
- Kibana: Web UI for searching và visualizing logs
- Beats: Lightweight shippers (Filebeat, Metricbeat)
Elasticsearch Index Pattern
# Index naming convention myapp-2024.01.15 myapp-2024.01.16 # Index lifecycle management (ILM) # - Hot: frequently written, most resources # - Warm: less frequent queries # - Cold: archive, rarely accessed # - Delete: after retention period
Filebeat Configuration
filebeat.inputs:
- type: log
paths:
- /var/log/myapp/*.log
json.keys_under_root: true
json.add_error_key: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "myapp-%{+yyyy.MM.dd}"
# Or output to Logstash
output.logstash:
hosts: ["logstash:5044"]
Loki – Log Aggregation
Loki là log aggregation system được Grafana Labs phát triển. Unlike Elasticsearch, Loki không index log contents – chỉ index labels. This makes Loki much cheaper và more scalable.
Loki vs Elasticsearch
| Aspect | Loki | Elasticsearch |
|---|---|---|
| Indexing | Labels only | Full-text |
| Cost | Lower | Higher |
| Scaling | Better | Good |
| Use case | Logs, metrics traces | Search-heavy workloads |
Promtail (Loki Agent)
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: myapp
static_configs:
- targets:
- localhost
labels:
job: myapp
env: production
__path__: /var/log/myapp/*.log
Alerting
Alerting Rules
groups:
- name: myapp-alerts
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
# High latency
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
Alert Routing
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: ''
Distributed Tracing
Distributed tracing跟踪 request through multiple services. Each service creates spans, linked by trace_id.
Jaeger – OpenTracing
# Jaeger client in Go
import "github.com/jaegertracing/jaeger-client-go"
config := jaegerconfig.Configuration{
ServiceName: "myapp",
Sampler: &jaegerconfig.Sampler{
Type: jaegerconfig.ConstantSampler,
Param: 1,
},
Reporter: &jaegerconfig.Reporter{
LocalAgentHostPort: "jaeger:6831",
},
}
tracer, closer := config.NewTracer()
defer closer.Close()
OpenTelemetry
OpenTelemetry là vendor-neutral standard cho observability (traces, metrics, logs). Support multiple backends: Jaeger, Zipkin, Tempo, DataDog.
# OpenTelemetry Collector config
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
jaeger:
endpoint: jaeger:14250
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
Logging Best Practices
- Use structured logging: JSON format, searchable fields
- Don’t log sensitive data: Passwords, tokens, PII
- Include correlation IDs: trace_id for request tracking
- Log at appropriate levels: DEBUG for dev, WARN+ for prod
- Centralize logs: Aggregate from all services to one place
- Set retention policies: Balance cost vs compliance needs
Common Monitoring Patterns
RED Method (Rate, Errors, Duration)
# Rate - requests per second
rate(http_requests_total[5m])
# Errors - error rate
rate(http_requests_total{status=~"5.."}[5m])
# Duration - latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
USE Method (Utilization, Saturation, Errors)
- Utilization: % time resource busy
- Saturation: how “full” resource is
- Errors: error rate
SLO và SLA
SLO (Service Level Objective)
Internal target cho reliability. Ví dụ: “99.9% of requests complete in < 200ms". Team internally commits to this.
SLA (Service Level Agreement)
Contractual obligation với customers. Usually less strict than SLO (e.g., 99.5%). Missing SLA has financial consequences.
Error Budget
Allowed failure time based on SLO. 99.9% SLO = 0.1% allowed downtime = 43.8 minutes/month. If budget unused, can move faster. If exhausted, focus on reliability.
# Error budget calculation # 99.9% = 0.001 # 30-day month = 43,200 minutes # Allowed downtime = 43.2 minutes/month # Burn rate alert (if 100% of budget consumed in 50% of time) # Means you're burning budget too fast