Monitoring & Logging #
Monitoring tells you that something is wrong. Observability helps explain why.
The observability pillars #
Metrics #
Use metrics for trend analysis, SLO tracking, and alerting.
- request rate, latency, error rate
- saturation (CPU, memory, queue depth)
- business metrics (checkout success, signup conversion)
Logs #
Use structured logs for diagnostic depth.
- standardize fields (timestamp, service, env, trace_id)
- avoid sensitive data in logs
- implement retention and archival policies
Traces #
Use distributed tracing to analyze end-to-end request paths.
- identify bottlenecks in service chains
- expose retry storms and timeout cascades
- improve root-cause analysis speed
Alerting principles #
- Alert only on symptoms that affect users or key systems
- Route alerts to clear owners
- Include runbook links in every alert
- Tune for signal-to-noise ratio; avoid alert fatigue
Recommended stack (example) #
- Prometheus for metrics scraping
- Grafana for dashboards and alerting views
- Loki or ELK/OpenSearch for log search
- OpenTelemetry for vendor-neutral instrumentation
Operational maturity checklist #
- Service-level dashboards defined
- SLOs tracked for critical services
- Structured logging baseline implemented
- Tracing enabled on high-value transactions
- On-call alerts tied to runbooks
- Post-incident observability gaps tracked