parikshan

System: Observability Stack

Reading time: ~24 minutes  ·  Prerequisites: Intervals, Math, Matrix & Heaps, Sliding Window  ·  Capstone: observability-stack design problem (parikshan systems track). Ingest 1M metrics/sec from 10K hosts, serve last-hour p99 across 200 services in under 500 ms.

Observability is the discipline of being able to ask any question of a running production system from the outside. Three signals make up the stack: metrics (numbers over time), logs (text events), and traces (causal chains across services). The platforms that serve them (Datadog, Honeycomb, Grafana Cloud, Splunk, New Relic) all converge on the same architectural shape, with deeply different bets on how to store the data cheaply. This primer walks the shape, the trade-offs, and where the algorithms from the bank actually do the heavy lifting.

The product behind the system

Datadog is the market leader. Around 28,000 paying customers, ingesting tens of trillions of events per day. The agent runs on every host and pushes metrics every 15 seconds via the StatsD/DogStatsD protocol. Datadog publishes that a single host runs about 4 requests/min into their intake API and they meter custom metrics at 100 per host (Pro tier) or 200 (Enterprise).

Honeycomb built around high-cardinality structured events instead of pre-aggregated metrics. Their pitch is "you cannot ask the question you did not pre-aggregate for", and their storage engine (Retriever) is designed to scan billions of events per query. Honeycomb publicly handles workloads in the hundreds of TB ingested per day.

Grafana Cloud / Grafana Labs runs the open-source Grafana, Prometheus, Loki, Tempo, Mimir, and Pyroscope stack as a managed service. Each tool specialises: Mimir for metrics (a sharded Prometheus), Loki for logs, Tempo for traces, Pyroscope for profiles. The fact that the stack is split into four pieces tells you something important about the storage trade-offs.

Splunk is the older platform, dominant in security and enterprise. It indexes everything as text (heavyweight, expensive, query-flexible) and has been the price reference customers compare every other vendor against.

New Relic, AppDynamics, Dynatrace round out the commercial APM space. Architecturally they look like Datadog with different proprietary storage and a stronger on-prem/installer story.

What the requirements actually are

Functional:

  1. Ingest metrics (counters, gauges, histograms), logs (structured or free text), traces (spans with parent IDs).
  2. Store for the configured retention window: 15 months for metrics, 30 days for logs and traces is typical.
  3. Query: dashboard queries (p99 latency over 200 services, last hour), ad-hoc investigations (show me all errors from this user's request).
  4. Alerting: detect when a metric crosses a threshold or a derivative shifts; page the right team.
  5. Correlation: jump from a metric anomaly to the logs from the affected hosts to the traces of the slow requests.
  6. High-cardinality dimensions: support tagging by user_id, request_id, build_id; not just by host and service.

Non-functional:

  1. Ingest throughput: millions of events per second per customer for the largest tenants. The total Datadog firehose is in the tens of millions of events per second.
  2. Query latency: sub-second dashboards on hot data, seconds for ad-hoc queries over a day, accept minutes for queries over a year.
  3. Cost: the industry-defining constraint. Customers churn on observability bills more than on anything else.
  4. Durability: do not lose data even when the customer's own systems are on fire (you are needed most during outages).
  5. Cardinality limits: the system must survive a customer accidentally tagging every metric with a unique request_id (this happens monthly to every vendor).
  6. Real-time alerting: from event-time to page in under 60 seconds is table stakes; under 10 seconds is competitive.

The architect's framing

Six components, each scaling separately:

  1. Agent / SDK: runs on every host or in every process. Buffers, batches, compresses, ships. The agent's reliability is the whole product's reliability.
  2. Ingest gateway: receives the firehose, authenticates, applies per-tenant rate limits, dispatches by signal type to the appropriate storage pipeline.
  3. Stream processor: extracts derived signals (errors per second from a log stream, span durations into a histogram), enriches with metadata, fans out to storage.
  4. Time-series store (for metrics): columnar, downsampled, optimised for range scans across time. Prometheus's TSDB, InfluxDB, Datadog's "Husky" (their internal evolution), VictoriaMetrics.
  5. Event store (for logs and traces): row-oriented or columnar, indexed by tags. Loki indexes only labels, ClickHouse for full-text, Honeycomb's Retriever, S3-backed scanning architectures.
  6. Query layer: parses the query language (PromQL, LogQL, NRQL, TraceQL, SQL), plans across stores, returns to the caller. The query layer is where the three signals get unified.

A seventh component, the alerting engine, runs queries on a schedule and fires notifications.

hosts / pods / lambdas
     |     |     |
     v     v     v
  +-----------------+
  |   Agents (SDK)  |   batch, compress, ship every 10s
  +--------+--------+
           |
           v
  +------------------+
  |   Ingest gateway |   auth, rate-limit, dispatch by signal
  +--------+---------+
           |
   +---+---+---+----+
   |   |   |   |    |
   v   v   v   v    v
+--+ +--+ +--+ +--+ +-----+
|M | |L | |T | |E |  ...
+--+ +--+ +--+ +--+
metrics logs traces events
   |     |     |
   v     v     v
+----+ +----+ +----+
|TSDB| |Log | |Span|       different stores, different compression,
|cols| |stor| |stor|       different retention. They share metadata.
+----+ +----+ +----+
   |     |     |
   +-----+-----+
         |
         v
+------------------+      +------------------+
|   Query layer    |<---->|   Alerting       |
|  (PromQL, etc)   |      |   (scheduled q)  |
+--------+---------+      +------------------+
         |
         v
  Dashboards, alerts, APIs

The asymmetry to notice: ingest is one-way, structured, and bursty. Query is two-way, ad-hoc, and concurrent. Modern stacks aggressively split the two paths, often deploying separate clusters: an ingest cluster sized for throughput and a query cluster sized for memory.

The trade-offs we will name

Metrics: pre-aggregated vs raw events. Datadog and Prometheus pre-aggregate at the agent (the agent emits a count, a sum, and a histogram for the last 10 seconds, not the raw events). This is cheap to store and fast to query. The cost is that you cannot ask new questions of old data: if you did not record the dimension at ingest time, you cannot break it down later. Honeycomb made the opposite bet: store the raw events at full cardinality, accept higher storage cost, get full flexibility. The pre-aggregated approach scales to billions of points; the raw approach scales to billions of dollars in storage bills. Most teams mix: pre-aggregate the high-volume metrics, keep raw events for traces.

Logs: indexed vs scanned. Splunk indexes every term; expensive but searches are fast on any field. Loki indexes only the labels (host, service, level) and scans the log bodies on demand; cheap storage, slow ad-hoc text search. Datadog Logs falls between: index on a configurable set of fields, scan the rest. Pick based on whether the access pattern is "look up known structured field" (Loki wins on cost) or "free-text search across everything" (Splunk-style indexed wins on latency).

Storage: columnar vs row-oriented. Time-series queries are almost always "give me one column across many rows", which is where columnar storage (Parquet, custom columnar formats like Datadog's Husky) wins by orders of magnitude on scan cost. Row-oriented (Splunk's original architecture) is more flexible but pays for it on every range query. The industry has converged on columnar with row-oriented as a legacy.

Sampling: head-based vs tail-based. For traces at scale, you cannot store everything. Head-based sampling decides at the start of a request whether to keep its trace (cheap, decision based on no signal). Tail-based decides after the trace completes whether it was interesting (expensive, decision based on duration, errors, anomaly). Honeycomb introduced tail-based sampling for the industry; it captures the 0.1% slow tail that head-based usually misses, at the cost of a buffer that holds every in-flight trace.

Cardinality budget. Pre-aggregated metrics blow up with high-cardinality tags: 100 metrics × 200 services × 1000 instances × 100 customers = 2 billion time series. Each series carries a fixed memory cost in the TSDB. Vendors enforce a cardinality budget per tenant; exceeding it either drops new series silently (bad) or rejects the writes (worse) or charges you (Datadog's chosen path). Every team that adopts an observability platform eventually has the "we tagged by user_id, what now" conversation.

Real-time vs cost. Sub-10-second alerting requires keeping a hot, in-memory index of recent data. Cheap storage requires pushing to S3 or equivalent object storage. The architectural solution is a tiered system: hot tier (last 1-6 hours) in memory or NVMe, warm tier (last 7-30 days) on local SSD, cold tier (anything older) on object storage. Query latency reflects which tier the data lives in.

Where the algorithms from this bank actually appear

Intervals (primer 13). A time-series query is fundamentally an interval scan: "give me the values where timestamp in [t1, t2]". Range merging, overlap detection, and "next event after this point" are exactly the patterns in merge-intervals and meeting-rooms. Compaction (merging old samples into coarser buckets, downsampling) is interval merge run on the storage layer.

Heaps and t-digest (primer 14). Computing p99 latency over a stream cannot keep every sample (a million events/sec × 30 days is too much). The standard answer is a t-digest or HDR histogram: a streaming approximation built on heap-like bucket structures. The intuition is the same as the median-of-stream problem using two heaps: maintain summaries that let you answer percentile queries with bounded error. Datadog, Grafana, and Prometheus all use HDR histograms or t-digests.

Sliding window (primer 03). "Rate of errors over the last 5 minutes" is a sliding-window aggregation. maximum-average-subarray-i and minimum-size-subarray-sum are tiny versions of the same pattern: a window over a time series with a running statistic. Streaming engines (Flink, Kafka Streams) implement this with explicit window state.

Top-K (primer 14 again). "Top 10 slowest endpoints in the last hour" is a heap-based top-K over a windowed aggregation. The exact pattern in top-k-frequent-elements extended to a time window.

Hash maps and counters (primer 01). The Count-Min Sketch is a probabilistic frequency counter that uses multiple hash functions; it answers "approximately how many times did this key appear?" in constant space. Used in Datadog's tag indexing and in many open-source TSDBs for cardinality estimation. Same pattern as top-k-frequent-elements, made approximate for streaming.

Bit manipulation (primer 15). Roaring bitmaps are the standard way to store "which trace IDs contain an error in service X". A boolean column compressed by run-length and bitset hybrid. Trace stores at scale (Tempo, Jaeger) lean on them heavily.

Binary search (primer 05). Time-aligned joins ("for each metric sample, find the closest trace") are interval binary search. So is downsampled query routing ("if the query is for an hour, hit the 1-minute resolution; if for a year, hit the 1-hour resolution").

Graphs (primer 09). A distributed trace is a DAG of spans (one root, many children, possibly cross-service edges). Service maps (Datadog's "Service Map", Honeycomb's "BubbleUp") are graph aggregations over millions of traces. The patterns from clone-graph and course-schedule transfer directly.

Sketch implementation

A minimal metrics ingest + query backend, ~55 lines:

import time
import bisect
import math
from collections import defaultdict
import heapq

class TimeSeriesStore:
    """In-memory store; one TSDB per (metric_name, tag_set)."""
    def __init__(self):
        self.series = defaultdict(list)  # (name, frozenset(tags)) -> [(t, v), ...]
        # In production each value list is replaced by a compressed columnar block.

    def write(self, name, tags, ts, value):
        key = (name, frozenset(tags.items()))
        self.series[key].append((ts, value))

    def range(self, name, tag_filter, t0, t1):
        """Return all points within [t0, t1] for series matching tag_filter."""
        out = []
        for (n, tags), points in self.series.items():
            if n != name or not tag_filter.items() <= dict(tags).items():
                continue
            # Binary search the time-sorted list. Primer 05 in action.
            lo = bisect.bisect_left(points, (t0, -math.inf))
            hi = bisect.bisect_right(points, (t1, math.inf))
            out.extend(points[lo:hi])
        return out

class TDigest:
    """Tiny t-digest stand-in for percentile estimation over a stream."""
    def __init__(self, k=100):
        self.k = k
        self.buckets = []  # sorted list of (mean, count) centroids

    def add(self, value):
        self.buckets.append((value, 1))
        if len(self.buckets) > self.k * 10:
            self._compact()

    def _compact(self):
        self.buckets.sort()
        merged = []
        for mean, count in self.buckets:
            if merged and merged[-1][1] + count <= len(self.buckets) // self.k:
                m, c = merged[-1]
                merged[-1] = ((m * c + mean * count) / (c + count), c + count)
            else:
                merged.append((mean, count))
        self.buckets = merged

    def percentile(self, q):
        self.buckets.sort()
        total = sum(c for _, c in self.buckets)
        target = q * total
        running = 0
        for m, c in self.buckets:
            running += c
            if running >= target:
                return m
        return self.buckets[-1][0]

def query_p99_per_service(store, t0, t1):
    """Top-10 services by p99 latency in [t0, t1]. Primer 14 in production."""
    digests = defaultdict(TDigest)
    for (name, tags), pts in store.series.items():
        if name != "request_latency":
            continue
        service = dict(tags).get("service")
        if not service:
            continue
        for t, v in pts:
            if t0 <= t <= t1:
                digests[service].add(v)
    scores = [(d.percentile(0.99), s) for s, d in digests.items()]
    return heapq.nlargest(10, scores)  # top-K via heap

What this sketch ignores: a real TSDB compresses by orders of magnitude using delta-of-delta encoding for timestamps and XOR encoding for floats (Facebook's Gorilla paper introduced this). Cardinality limits, retention enforcement, downsampling pipelines, multi-tenant isolation, replication, the agent protocol, and the query language are each their own production system. Real t-digest is more sophisticated; this version is enough to pass the punchline.

What breaks at scale

Tier 0 (one host, one service): a Prometheus instance scraping itself. Grafana drawing the dashboard. Free, works fine, terrible for incident response (it goes down when the host goes down). Acceptable for hobbyist projects.

Tier 1 (a small cluster): Prometheus per cluster, federation to a central instance. Loki for logs. Tempo for traces. The first cardinality problem: someone tagged a metric with build_sha or pod_uid, and the TSDB is now storing a new series per build. Fix is label hygiene, or moving to remote-write into Mimir which absorbs higher cardinality.

Tier 2 (a real product): hosted service (Datadog, Grafana Cloud, Honeycomb) or self-hosted Mimir/Cortex. Per-team budgets, retention tiers, alerting that does not page every engineer. The first storage cost shock arrives here.

Tier 3 (hundreds of services, multi-region): replicated ingest per region with cross-region query federation. Tail-based sampling for traces because storing 100% is impossible. Service-level dashboards built from a service catalogue, not from individuals' Grafana boards. Datadog's "Service Catalog" exists for exactly this scale.

Tier 4 (the vendor itself): you are now Datadog or Honeycomb. Custom storage engine, custom query language, sharded everything. The economics of running observability for thousands of customers is a different business than running it for yourself.

Permanent failure modes:

  • Cardinality explosion: someone tags by user_id, the TSDB doubles in memory overnight, queries slow to a crawl. Fix: enforce a cardinality budget at the ingest gateway, reject offending tags, log the offender.
  • Self-observation failure: your observability stack goes down right when you need it. Solution: a second, independent system watching the first. Honeycomb famously uses a smaller separate Honeycomb to watch the main one.
  • Alert fatigue: 10,000 noisy alerts make engineers ignore alerts entirely. Fix is technical (deduplicate, route by service, escalate on persistence) and cultural (alert review process, kill rate metric on every alert).
  • Cost surprise: a customer's bill triples after a deploy enabled a verbose log. Solution: cost dashboards as a first-class feature, dry-run cost estimation in the agent SDK.

What an interview / staff-engineer review will ask

Q1: A customer tags every metric with request_id. Cardinality explodes. What do you do? A: Reject the tag at the ingest gateway and emit a warning to the customer. Cardinality is a contract: high-cardinality data goes to the trace or log stream, not to the metric stream. The architectural mistake is mixing them, and a good observability platform refuses to let customers do it silently.

Q2: How do you compute p99 latency over the last hour for 500 services in under 500 ms? A: Pre-aggregate histograms at the agent (HDR or t-digest). Store the histogram buckets, not the raw events. At query time, scan the histograms across the time range and merge them; percentile is a binary search on the merged histogram. 500 services × 60 minutes × ~50 buckets = 1.5M numbers, easily under 500 ms.

Q3: How do you store traces at scale when you have 1 million spans/second? A: Tail-based sampling: hold every in-progress trace in a buffer, decide at the end whether to keep it (slow, error, sampled, or matches an interesting tag set). The buffer is sized by request latency × span rate (a few seconds × 1M = 5-10M spans in memory). You keep 1-5% of traces; the rest are dropped. Honeycomb's whole platform is built around this trade-off.

Q4: A spike at 2:47 AM caused an outage. How does the platform help debug? A: From the alert, jump to the metric chart at the spike time. From the chart's hover, jump to the slowest traces in that minute (the observability platform pre-indexes "traces with errors" and "slow traces" so this is a constant-time lookup). From the trace, jump to the logs of the affected span. Three clicks from alert to root cause is the platform's promise; if it takes more, the user churns.

Q5: How do you keep the observability stack up when the customer's infrastructure is on fire? A: Independent infrastructure from the customer (run on a different cloud, or at least different account and region). Ingest pipelines must absorb 10x normal load without backpressure to the customer (the customer's outage is when they need you most). Stale-while-revalidate on dashboards: serve cached data if the query backend is slow.

In the AI-integrated workspace

LLM-powered observability is the hottest area in the space right now. Three concrete things are landing in production:

Natural-language query. "Show me the slowest endpoints in the last hour" becomes a generated PromQL or TraceQL query. Honeycomb shipped Query Assistant in 2023; Datadog's Bits AI does the same. The trick is grounding: the LLM must know the schema and the data model of the customer's specific tags. Hallucinated query syntax is worse than no query at all.

Anomaly summarisation. When an alert fires, the platform invokes an LLM to write a one-paragraph summary: "p99 latency on /checkout spiked at 02:47, correlated with a deploy of v2.1.3 to us-east-1, 30% of traces show timeouts on the database client". This summary is grounded in the actual telemetry, not generated freely.

Tracing for AI agents. As agentic systems multiply, the trace becomes a tree of LLM calls instead of a tree of microservice calls. The same architecture (span trees, parent IDs, distributed propagation) works without modification; the differences are span attributes (model name, token counts, latency-per-token) and what counts as an interesting outlier (low-information generations, mode collapse, prompt injection attempts). LangSmith, Helicone, and Datadog's LLM Observability are competing in this space; the architectural shape is identical to APM tracing.

For parikshan, the observability stack is what makes the proctoring product trustworthy. Every exam session emits a stream of events (window-focus changes, AI invocations, hint requests, dispute submissions) that gets stored as time-series and event data. The dispute flow's audit trail is an observability query: "show me everything that happened in session X between t1 and t2". The integrity score is a derived aggregation over the same data. Without an observability stack, the product's integrity story has no evidence; with one, every claim is verifiable.

Variants and adjacent systems

Application Performance Monitoring (APM): traces plus latency metrics plus error tracking, packaged for application developers. Datadog APM, New Relic, AppDynamics. Same architecture, marketing focus.

Real User Monitoring (RUM): browser and mobile SDK shipping every page load. Same pipelines, smaller per-event size, much higher volume. Datadog RUM, Sentry, Google Analytics 4.

Profiling (Pyroscope, Datadog Profiler): stack samples instead of metrics. Same storage shape (time-series of stacks with weights), different query language (flame graphs). Pyroscope shipped continuous profiling at scale for the first time on commodity hardware.

Security observability / SIEM (Splunk, Sumo Logic, Elastic SIEM): same architecture, different retention policy (years for compliance) and different query patterns (search-heavy). Most modern SIEMs are observability stacks with a security UI layer.

Synthetic monitoring (Datadog Synthetics, Catchpoint): emit events from probe locations rather than from production hosts. Same ingest, narrower volume.

eBPF-based observability (Pixie, Cilium Hubble): a kernel-level agent emits events without code changes. New ingest source, same downstream architecture.

The thread across all variants: the architectural problem of "ingest a torrent of small events, store them cheaply, query them flexibly" is the same. The differences live in the SDK and the query UI; the heart is one of a few storage engines and the algorithms above. Once you can defend the shape, you can move between vendors and tools without re-learning the system design.