2025-05-14

The 3 Pillars of Observability

Building a Robust Observability Strategy

What is Observability?

Define observability, its three pillars (logs, metrics, traces), and why it’s important in distributed systems.

Observability in IT and Cloud computing refers to a set of processes and associated tools that enable teams to collect, aggregate and correlate real-time data/events so that they can analyze what's happening in a network/system/environment in order to achieve better overall service outcomes.

Allow teams to better visualize and understand large computing networks (systems).

  • Identify root causes of performance bottlenecks or issues
  • Discover anomalous data patterns
  • Help business minimize downtime while maximizing for reliability

Observability depends on system events to determine the what, when and why.

3 pillars of Observability (Logs, Metrics, and Traces)

  1. Logs: Archival or historical records of system events and errors. (PlainText or Structured data JSON)
  2. Metrics: Numerical measurements of system performance and behavior.
  3. Traces: Representations of individual requests or transactions flowing through a system. Traces help identify dependency bottlenecks and root causes of issues.

Combine all 3 to get a holistic view into a system.

Logs

Logs are immutable records of discrete events occurring in system.

Logs provide detailing information about the system.

  • Events timestamps, transaction IDs, IP Addresses, User IDs, the event/request itself, process details, error messages.

Observability tools aggregate log data to help dev teams understand system failures and errors.

Can be lots of data and require sophisticated log management tools.

Excessive logging can become noise.

Metrics

Quantitative values/insights that assist in analyzing performance across time.

  • KPIs
  • Host Metrics: memory, disk and CPU usage
  • Network Performance: uptime, latency, throughput
  • APP metrics: response times, request and error rates
  • Server pool metrics: # of instances

Often summarized in time-series graph to help asses health and trends.

Metric thresholds tied to alerts can help staff stay on top of current and impeding issues.

Traces

Logs and Metrics help teams understand individual system behavior --> Traces give visibility on the lifetime / journey of requests through a system.

  • end-to-end journey of a request through a system or network
  • captures the path and lifespan of each component involved in processing the request.
  • especially useful for distributed large systems
  • requires instrumentation across systems (coordination)

Sources

Subscribe to my newsletter for updates on my latest projects and articles.

Occasional updates on what I’m building, writing, and thinking about. No spam, ever.