Back to Blog
Cloud & DevOpsIntermediate

Observability: Logging, Metrics, and Tracing (ELK/Prometheus)

Simha Infobiz
December 22, 2025
6 min read

In the old days of monoliths, debugging was easy: you SSH'd into the server and checked var/log/syslog. Today, a single user request might touch 20 microservices, 5 databases, and 3 queues. If the user says "It's slow," where do you look?

Monitoring vs. Observability

  • Monitoring tells you when something is wrong. "Alert: CPU is at 99%." It answers "Known Unknowns" (things you knew to look for).
  • Observability allows you to ask why it is wrong. "Why is payment processing taking 5 seconds only for iOS users in Germany?" It answers "Unknown Unknowns."

The Three Pillars (and how they connect)

1. Metrics (The "What")

Numeric data measured over time.

  • Use for: Holistic health, SLIs/SLOs, and triggering alerts.
  • Golden Signals: Traffic, Latency, Errors, Saturation.
  • Tools: Prometheus, Grafana, Datadog.

2. Tracing (The "Where")

The lifecycle of a request as it traverses your distributed system.

  • Context Propagation: A "Trace ID" is generated at the ingress (Load Balancer) and passed to every downstream service.
  • Spans: Each unit of work (e.g., "Query Database", "Call Auth Service") is timed.
  • Tools: Jaeger, Tempo, Honeycomb.

3. Logging (The "Why")

Detailed, discrete events.

  • Use for: Forensic details when you've already narrowed down the problem via metrics and tracing.
  • Best Practice: Structured Logging (JSON) to make them queryable.

The Holy Grail: Correlation

True observability isn't just having these tools; it's linking them.

  1. Alert fires: "High Error Rate" (Metric).
  2. Click graph: Jump to the specific Trace IDs that failed (Exemplars).
  3. View Trace: See the request failed at the "Payment Service".
  4. View Logs: Jump to logs for that specific service and Trace ID.
  5. Root Cause: "Database Connection Timeout."

OpenTelemetry (OTel)

The industry is coalescing around OpenTelemetry, a vendor-neutral standard for generating and collecting this data. Instead of locking yourself into a specific vendor's agent, you instrument with OTel and can send the data anywhere (Prometheus, Splunk, New Relic) without rewriting code.

MonitoringObservabilityDevOps
Share: