Monitoring và Logging là gì?

Trong production environments, monitoring và logging là essential để understand system health, troubleshoot issues, và improve performance. Monitoring là collection và analysis của metrics (CPU, memory, requests/second). Logging là capture của discrete events (errors, user actions, system changes).

Cùng nhau, chúng cung cấp visibility vào hệ thống và help identify issues trước khi users notice, hoặc diagnose problems khi chúng occur.

Tại Sao Monitoring Quan Trọng?

1. Proactive vs Reactive

Monitoring cho phép detect problems trước khi chúng become outages. Alert khi metrics exceed thresholds cho phép fix trước khi users affected.

2. Performance Optimization

Data-driven decisions. Which endpoints slow? Where bottleneck? Is scaling needed? Monitoring data answer these questions.

3. Capacity Planning

Forecast future needs dựa trên usage trends. Avoid over-provisioning (waste) hoặc under-provisioning (performance issues).

4. Incident Response

When issues occur, monitoring data help quickly identify root cause. Logs provide audit trail cho forensics.

Monitoring Architecture

Metrics Collection

  • Pull model: Prometheus scrapes metrics from targets
  • Push model: Applications push metrics to aggregator (StatsD)
  • Agent-based: Daemonset collects node metrics (node_exporter)

Time Series Database

Specialized database optimized for time-stamped data. Examples: Prometheus, InfluxDB, TimescaleDB. Support downsampling, retention policies, và efficient range queries.

Visualization Layer

Dashboards display metrics over time. Common tools: Grafana, Datadog, AWS CloudWatch dashboards.

Key Metrics Categories

1. Infrastructure Metrics

MetricDescription
CPU Usage% CPU utilized
Memory Usage% RAM utilized, available
Disk I/ORead/write bytes per second
Disk SpaceUsed vs available
NetworkBytes in/out, packet drops

2. Application Metrics

MetricDescription
Request RateRequests per second
Latency (p50, p95, p99)Response time distribution
Error Rate% requests resulting in errors
ThroughputMB/s or requests/s

3. Business Metrics

MetricDescription
Active UsersConcurrent users
Conversion Rate% visitors completing action
Revenue$$ per hour/day
API CallsExternal API usage

Prometheus – Monitoring System

Prometheus là open-source monitoring và alerting system, được CNCF graduate. Prometheus pull metrics từ targets, store time series data, và support powerful querying (PromQL).

Prometheus Architecture

  • Prometheus Server: Scrapes, stores, queries
  • Exporters: Export existing metrics (node_exporter, mysql_exporter)
  • Alertmanager: Handle alerts, route notifications
  • Grafana: Visualization

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'web-app'
    static_configs:
      - targets: ['web-app:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

PromQL Examples

# CPU usage of web servers
avg(rate(node_cpu_seconds_total{mode="user"}[5m])) by (instance)

# Request rate
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

Grafana – Dashboards

Grafana là open-source visualization và analytics platform. Connect to Prometheus, InfluxDB, Elasticsearch, và many other data sources.

Grafana Dashboard Panels

  • Graph: Time series visualization
  • Stat: Single value display
  • Gauge: Current value vs max
  • Table: Tabular data
  • Alert List: Current firing alerts

Kubernetes Monitoring Stack

# kube-prometheus-stack (Helm)
helm install prometheus prometheus-community/kube-prometheus-stack

# Includes:
# - Prometheus operator
# - Grafana dashboards
# - node_exporter
# - kube-state-metrics
# - Alertmanager

Logging Architecture

Log Levels

LevelUse Case
DEBUGDetailed debugging info (dev only)
INFOGeneral operational events
WARNPotential issues, degradation
ERRORErrors, failures
FATALCritical, system down

Structured Logging

# Unstructured (hard to query)
echo "User login failed for user john at 2024-01-15"

# Structured (JSON - queryable)
{
  "level": "warn",
  "timestamp": "2024-01-15T10:30:00Z",
  "message": "User login failed",
  "user": "john",
  "ip": "192.168.1.1",
  "reason": "invalid_password"
}

Log Fields

  • timestamp: When event occurred (ISO8601)
  • level: Log level
  • service: Which service generated
  • trace_id: For request tracing across services
  • user_id: If applicable
  • message: Human-readable message

ELK Stack – Elasticsearch, Logstash, Kibana

Components

  • Elasticsearch: Search và store logs, scalable
  • Logstash: Parse và transform logs before indexing
  • Kibana: Web UI for searching và visualizing logs
  • Beats: Lightweight shippers (Filebeat, Metricbeat)

Elasticsearch Index Pattern

# Index naming convention
myapp-2024.01.15
myapp-2024.01.16

# Index lifecycle management (ILM)
# - Hot: frequently written, most resources
# - Warm: less frequent queries
# - Cold: archive, rarely accessed
# - Delete: after retention period

Filebeat Configuration

filebeat.inputs:
  - type: log
    paths:
      - /var/log/myapp/*.log
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "myapp-%{+yyyy.MM.dd}"

# Or output to Logstash
output.logstash:
  hosts: ["logstash:5044"]

Loki – Log Aggregation

Loki là log aggregation system được Grafana Labs phát triển. Unlike Elasticsearch, Loki không index log contents – chỉ index labels. This makes Loki much cheaper và more scalable.

Loki vs Elasticsearch

AspectLokiElasticsearch
IndexingLabels onlyFull-text
CostLowerHigher
ScalingBetterGood
Use caseLogs, metrics tracesSearch-heavy workloads

Promtail (Loki Agent)

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: myapp
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          env: production
          __path__: /var/log/myapp/*.log

Alerting

Alerting Rules

groups:
  - name: myapp-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical

      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 10m

Alert Routing

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: ''

Distributed Tracing

Distributed tracing跟踪 request through multiple services. Each service creates spans, linked by trace_id.

Jaeger – OpenTracing

# Jaeger client in Go
import "github.com/jaegertracing/jaeger-client-go"

config := jaegerconfig.Configuration{
    ServiceName: "myapp",
    Sampler: &jaegerconfig.Sampler{
        Type:  jaegerconfig.ConstantSampler,
        Param: 1,
    },
    Reporter: &jaegerconfig.Reporter{
        LocalAgentHostPort: "jaeger:6831",
    },
}

tracer, closer := config.NewTracer()
defer closer.Close()

OpenTelemetry

OpenTelemetry là vendor-neutral standard cho observability (traces, metrics, logs). Support multiple backends: Jaeger, Zipkin, Tempo, DataDog.

# OpenTelemetry Collector config
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  jaeger:
    endpoint: jaeger:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

Logging Best Practices

  • Use structured logging: JSON format, searchable fields
  • Don’t log sensitive data: Passwords, tokens, PII
  • Include correlation IDs: trace_id for request tracking
  • Log at appropriate levels: DEBUG for dev, WARN+ for prod
  • Centralize logs: Aggregate from all services to one place
  • Set retention policies: Balance cost vs compliance needs

Common Monitoring Patterns

RED Method (Rate, Errors, Duration)

# Rate - requests per second
rate(http_requests_total[5m])

# Errors - error rate
rate(http_requests_total{status=~"5.."}[5m])

# Duration - latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE Method (Utilization, Saturation, Errors)

  • Utilization: % time resource busy
  • Saturation: how “full” resource is
  • Errors: error rate

SLO và SLA

SLO (Service Level Objective)

Internal target cho reliability. Ví dụ: “99.9% of requests complete in < 200ms". Team internally commits to this.

SLA (Service Level Agreement)

Contractual obligation với customers. Usually less strict than SLO (e.g., 99.5%). Missing SLA has financial consequences.

Error Budget

Allowed failure time based on SLO. 99.9% SLO = 0.1% allowed downtime = 43.8 minutes/month. If budget unused, can move faster. If exhausted, focus on reliability.

# Error budget calculation
# 99.9% = 0.001
# 30-day month = 43,200 minutes
# Allowed downtime = 43.2 minutes/month

# Burn rate alert (if 100% of budget consumed in 50% of time)
# Means you're burning budget too fast

Kết Luận

Các Câu Hỏi Thường Gặp (FAQ)

1. Tại sao cần cả monitoring và logging?

2. Prometheus vs Datadog?

3. How handle log storage costs?

4. Alert fatigue làm sao tránh?

5. Kubernetes monitoring như thế nào?

Chào các bạn mình là Quốc Hùng , mình sinh ra thuộc cung song tử ,song tử luôn khẳng định chính mình ,luôn luôn phấn đấu vượt lên phía trước ,mình sinh ra và lớn lên tại vùng đất võ cổ truyền ,đam mê của mình là coder ,ngày đi học tối về viết blog ...