Monitoring Và Logging Là Gì - Observability Chi Tiết

Monitoring và Logging là gì?

Trong production environments, monitoring và logging là essential để understand system health, troubleshoot issues, và improve performance. Monitoring là collection và analysis của metrics (CPU, memory, requests/second). Logging là capture của discrete events (errors, user actions, system changes).

Cùng nhau, chúng cung cấp visibility vào hệ thống và help identify issues trước khi users notice, hoặc diagnose problems khi chúng occur.

Tại Sao Monitoring Quan Trọng?

1. Proactive vs Reactive

Monitoring cho phép detect problems trước khi chúng become outages. Alert khi metrics exceed thresholds cho phép fix trước khi users affected.

2. Performance Optimization

Data-driven decisions. Which endpoints slow? Where bottleneck? Is scaling needed? Monitoring data answer these questions.

3. Capacity Planning

Forecast future needs dựa trên usage trends. Avoid over-provisioning (waste) hoặc under-provisioning (performance issues).

4. Incident Response

When issues occur, monitoring data help quickly identify root cause. Logs provide audit trail cho forensics.

Monitoring Architecture

Metrics Collection

Pull model: Prometheus scrapes metrics from targets
Push model: Applications push metrics to aggregator (StatsD)
Agent-based: Daemonset collects node metrics (node_exporter)

Time Series Database

Specialized database optimized for time-stamped data. Examples: Prometheus, InfluxDB, TimescaleDB. Support downsampling, retention policies, và efficient range queries.

Visualization Layer

Dashboards display metrics over time. Common tools: Grafana, Datadog, AWS CloudWatch dashboards.

Key Metrics Categories

1. Infrastructure Metrics

Metric	Description
CPU Usage	% CPU utilized
Memory Usage	% RAM utilized, available
Disk I/O	Read/write bytes per second
Disk Space	Used vs available
Network	Bytes in/out, packet drops

2. Application Metrics

Metric	Description
Request Rate	Requests per second
Latency (p50, p95, p99)	Response time distribution
Error Rate	% requests resulting in errors
Throughput	MB/s or requests/s

3. Business Metrics

Metric	Description
Active Users	Concurrent users
Conversion Rate	% visitors completing action
Revenue	$$ per hour/day
API Calls	External API usage

Prometheus – Monitoring System

Prometheus là open-source monitoring và alerting system, được CNCF graduate. Prometheus pull metrics từ targets, store time series data, và support powerful querying (PromQL).

Prometheus Architecture

Prometheus Server: Scrapes, stores, queries
Exporters: Export existing metrics (node_exporter, mysql_exporter)
Alertmanager: Handle alerts, route notifications
Grafana: Visualization

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'web-app'
    static_configs:
      - targets: ['web-app:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

PromQL Examples

# CPU usage of web servers
avg(rate(node_cpu_seconds_total{mode="user"}[5m])) by (instance)

# Request rate
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

Grafana – Dashboards

Grafana là open-source visualization và analytics platform. Connect to Prometheus, InfluxDB, Elasticsearch, và many other data sources.

Grafana Dashboard Panels

Graph: Time series visualization
Stat: Single value display
Gauge: Current value vs max
Table: Tabular data
Alert List: Current firing alerts

Kubernetes Monitoring Stack

# kube-prometheus-stack (Helm)
helm install prometheus prometheus-community/kube-prometheus-stack

# Includes:
# - Prometheus operator
# - Grafana dashboards
# - node_exporter
# - kube-state-metrics
# - Alertmanager

Logging Architecture

Log Levels

Level	Use Case
DEBUG	Detailed debugging info (dev only)
INFO	General operational events
WARN	Potential issues, degradation
ERROR	Errors, failures
FATAL	Critical, system down

Structured Logging

# Unstructured (hard to query)
echo "User login failed for user john at 2024-01-15"

# Structured (JSON - queryable)
{
  "level": "warn",
  "timestamp": "2024-01-15T10:30:00Z",
  "message": "User login failed",
  "user": "john",
  "ip": "192.168.1.1",
  "reason": "invalid_password"
}

Log Fields

timestamp: When event occurred (ISO8601)
level: Log level
service: Which service generated
trace_id: For request tracing across services
user_id: If applicable
message: Human-readable message

ELK Stack – Elasticsearch, Logstash, Kibana

Components

Elasticsearch: Search và store logs, scalable
Logstash: Parse và transform logs before indexing
Kibana: Web UI for searching và visualizing logs
Beats: Lightweight shippers (Filebeat, Metricbeat)

Elasticsearch Index Pattern

# Index naming convention
myapp-2024.01.15
myapp-2024.01.16

# Index lifecycle management (ILM)
# - Hot: frequently written, most resources
# - Warm: less frequent queries
# - Cold: archive, rarely accessed
# - Delete: after retention period

Filebeat Configuration

filebeat.inputs:
  - type: log
    paths:
      - /var/log/myapp/*.log
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "myapp-%{+yyyy.MM.dd}"

# Or output to Logstash
output.logstash:
  hosts: ["logstash:5044"]

Loki – Log Aggregation

Loki là log aggregation system được Grafana Labs phát triển. Unlike Elasticsearch, Loki không index log contents – chỉ index labels. This makes Loki much cheaper và more scalable.

Loki vs Elasticsearch

Aspect	Loki	Elasticsearch
Indexing	Labels only	Full-text
Cost	Lower	Higher
Scaling	Better	Good
Use case	Logs, metrics traces	Search-heavy workloads

Promtail (Loki Agent)

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: myapp
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          env: production
          __path__: /var/log/myapp/*.log

Alerting

Alerting Rules

groups:
  - name: myapp-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical

      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 10m

Alert Routing

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: ''

Distributed Tracing

Distributed tracing跟踪 request through multiple services. Each service creates spans, linked by trace_id.

Jaeger – OpenTracing

# Jaeger client in Go
import "github.com/jaegertracing/jaeger-client-go"

config := jaegerconfig.Configuration{
    ServiceName: "myapp",
    Sampler: &jaegerconfig.Sampler{
        Type:  jaegerconfig.ConstantSampler,
        Param: 1,
    },
    Reporter: &jaegerconfig.Reporter{
        LocalAgentHostPort: "jaeger:6831",
    },
}

tracer, closer := config.NewTracer()
defer closer.Close()

OpenTelemetry

OpenTelemetry là vendor-neutral standard cho observability (traces, metrics, logs). Support multiple backends: Jaeger, Zipkin, Tempo, DataDog.

# OpenTelemetry Collector config
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  jaeger:
    endpoint: jaeger:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

Logging Best Practices

Use structured logging: JSON format, searchable fields
Don’t log sensitive data: Passwords, tokens, PII
Include correlation IDs: trace_id for request tracking
Log at appropriate levels: DEBUG for dev, WARN+ for prod
Centralize logs: Aggregate from all services to one place
Set retention policies: Balance cost vs compliance needs

Common Monitoring Patterns

RED Method (Rate, Errors, Duration)

# Rate - requests per second
rate(http_requests_total[5m])

# Errors - error rate
rate(http_requests_total{status=~"5.."}[5m])

# Duration - latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE Method (Utilization, Saturation, Errors)

Utilization: % time resource busy
Saturation: how “full” resource is
Errors: error rate

SLO và SLA

SLO (Service Level Objective)

Internal target cho reliability. Ví dụ: “99.9% of requests complete in < 200ms". Team internally commits to this.

SLA (Service Level Agreement)

Contractual obligation với customers. Usually less strict than SLO (e.g., 99.5%). Missing SLA has financial consequences.

Error Budget

Allowed failure time based on SLO. 99.9% SLO = 0.1% allowed downtime = 43.8 minutes/month. If budget unused, can move faster. If exhausted, focus on reliability.

# Error budget calculation
# 99.9% = 0.001
# 30-day month = 43,200 minutes
# Allowed downtime = 43.2 minutes/month

# Burn rate alert (if 100% of budget consumed in 50% of time)
# Means you're burning budget too fast