System: Search Infrastructure

Reading time: ~24 minutes · Prerequisites: Arrays & Hashing, Math, Matrix & Heaps, Binary Search · Capstone: search-infrastructure design problem (parikshan systems track). Index 10M product documents and serve sub-100 ms autocomplete with typo tolerance.

Search is the system every product eventually rediscovers it needs and almost always builds wrong the first time. The naive "SELECT * WHERE title LIKE '%query%'" works at 10,000 rows and dies at 10 million. The architecture that replaces it, an inverted index served by a distributed retrieval engine, is one of the most reused designs in the industry. This primer covers how Elasticsearch, OpenSearch, and Algolia structure that work, what trade-offs they made differently, and where the algorithm bank's patterns appear at every layer.

The product behind the system

Elasticsearch / Elastic Stack powers logs, observability search, e-commerce catalogues, and security analytics. Wikipedia, GitHub's code search (pre-Blackbird), and Uber's marketplace search all ran on Elasticsearch or its fork OpenSearch. Recommended shard sizing is 10-50 GB per shard, with a per-shard document ceiling of about 200 million before query latency degrades; the underlying Lucene index has a hard limit of 2^31 - 1 documents per shard.

OpenSearch is Amazon's fork of Elasticsearch 7.10 (after Elastic relicensed). It is the default search backend on AWS and powers OpenSearch Service across thousands of customers. Architecturally identical to Elasticsearch through the fork point; divergent since 2021.

Algolia is a managed search-as-a-service used by Stripe Docs, Twitch, Lacoste, and around 17,000 customer applications. They publish that their median query latency is around 20 ms globally, achieved by serving from 70+ points of presence with the entire index held in RAM. Algolia made a different architectural bet: keep the index small enough to fit in memory, then push it to the edge.

Google Search is the only one of these that does true web-scale search (trillions of pages), but its architecture is private and atypical. The three above are the systems most engineers actually integrate with.

What the requirements actually are

Functional:

Full-text search: tokenise, stem, lowercase, accent-fold, return documents containing the query terms ranked by relevance.
Filters and facets: "shoes, red, size 10, under $100, in stock". Often more important to users than the free-text query.
Typo tolerance: "iphne" returns iPhone results. Algolia made this their entire pitch.
Autocomplete / instant search: results appear as the user types, sub-100 ms keystroke-to-paint.
Synonyms: "couch" matches "sofa". Manually curated or learned from query logs.
Highlighting: show the matching snippet in context.
Geo search: "restaurants within 2 km of me". Treated as a special filter, not a free-text term.
Sorting and ranking customisation: not just by relevance, also by price, date, popularity, or a learned ranker.

Non-functional:

Latency: p99 under 100 ms for autocomplete; under 200 ms for full search. Algolia targets 20 ms median; Elasticsearch on default settings sits around 50-200 ms.
Throughput: thousands to millions of queries per second at large customers (Wikipedia handles around 1,500 QPS just for search).
Index freshness: e-commerce sites need a new product to be searchable within seconds; observability tools need logs queryable within a minute of ingestion.
Index size: Elasticsearch clusters in the 100 TB+ range are common; the largest publicly described clusters are in the multi-petabyte range.
High availability: replication, automatic failover, no read downtime during reindex.
Multi-tenancy: one cluster serving thousands of independent customer indexes (Algolia's whole business model).

The architect's framing

Five core components plus an analyzer:

Indexer: takes raw documents (a product, a log line, a wiki page), runs them through the analyzer (tokenise, lowercase, stem, dedupe), and writes them into the inverted index.
Analyzer: the pipeline that turns "Running Shoes Size 10" into the terms ["run", "shoe", "size", "10"]. Language-specific, customer-customisable, the source of most search-quality gains.
Inverted index: the on-disk data structure that maps each term to the list of documents containing it. Lucene segments are the canonical implementation.
Query coordinator: receives a query, decides which shards to scatter it to, gathers responses, merges, applies pagination, returns to the caller. Elasticsearch calls this "scatter-gather"; the pattern is universal.
Ranker: scores documents using BM25 (the default), TF-IDF (older), or a learned model (modern, Algolia and Elasticsearch's ML plugin both do this).

A sixth piece, vector retrieval, is now standard for semantic search; we cover it under variants.

Dataflow for a single query:

                +--------------------------+
   query  ----->|   Query Coordinator      |
                |   (parses, plans, fans)  |
                +-----------+--------------+
                            |
       scatter to N shards (parallel fan-out)
       |          |          |          |
       v          v          v          v
  +--------+ +--------+ +--------+ +--------+
  |Shard 0 | |Shard 1 | |Shard 2 | |Shard 3 |
  | (term  | | (term  | | (term  | | (term  |
  |  ->    | |  ->    | |  ->    | |  ->    |
  |  docs) | |  docs) | |  docs) |  | docs) |
  +---+----+ +---+----+ +---+----+ +---+----+
      |          |          |          |
      +----------+----+-----+----------+
                      |
                      v top-K from each shard
              +------------------+
              | merge + global   |
              | top-K via heap   |  <-- this is where 14-math-matrix-heap.md lives
              +--------+---------+
                       |
                       v
                +--------------+
                |  rank model  |
                |  (BM25 / ML) |
                +------+-------+
                       |
                       v
                  response

The shape is identical to MapReduce: scatter, local compute, merge. The merge step uses a heap of size K to pick the global top-K from the shard-local top-Ks, which is exactly the pattern in the kth-largest-element and top-k-frequent-elements problems.

The trade-offs we will name

Inverted index vs vector index. Inverted indexes are lexical: they find documents that contain the exact tokens you asked for. Vector indexes (FAISS, HNSW, ScaNN) are semantic: they find documents whose embedding is near your query's embedding. Modern systems use both ("hybrid search"). Inverted indexes are deterministic, debuggable, fast, and cheap; vector indexes catch "auto" matches "car" but cost more compute and require an embedding model. Choosing one over the other is wrong; the trade-off is how to combine the scores, which is an open research area.

Shard count. Too few shards and you cannot parallelise queries; one slow query takes a whole shard with it. Too many shards and the merge cost dominates: every query has to talk to every shard, and the coordinator's heap merge becomes the bottleneck. Elastic's guidance: 10-50 GB per shard, 200 M documents per shard maximum. Algolia avoids the trade-off by keeping each customer's index small enough to fit on one machine, paying for it with restrictions on per-customer index size.

Strong vs eventual consistency on the index. Elasticsearch's refresh interval (default 1 second) means new documents are not searchable for up to a second. You can refresh on every write but throughput drops by 10x. Algolia takes the opposite position: every write is immediately searchable, but writes are slower because the index is rebuilt and pushed globally. The trade-off is throughput vs freshness. E-commerce sites mostly choose Algolia's bet; observability tools mostly choose Elasticsearch's.

Memory vs disk. Algolia keeps the entire customer index in RAM, which is why their latencies are 20 ms but their per-GB-stored cost is 30-100x higher than Elasticsearch's. Elasticsearch happily serves from disk via OS page cache, accepting higher tail latencies in exchange for cheaper storage. The right answer depends on whether you serve 1 GB of catalogue or 100 TB of logs.

Ranking: BM25 vs learned-to-rank. BM25 is a 1990s probabilistic relevance model that is shockingly hard to beat for the simple cases. Learned-to-rank (LightGBM, XGBoost, or a neural reranker) can give 20-30% better NDCG when you have query logs to train on. The trade-off is operational: BM25 has zero training pipeline, learned-to-rank has a feedback loop, an embedding service, and a retraining cadence. Most teams start with BM25 plus hand-tuned boosts and only graduate to ML when they have the data and the headcount.

Where the algorithms from this bank actually appear

Hashing (primer 01). Every term in the inverted index is keyed by its hash. Algolia's "tries plus hashes" structure for prefix lookup is the same Map<term, List<docId>> you build in group-anagrams. Bloom filters (a hash variant) are used to short-circuit "this shard has no documents matching this term", saving a disk read.

Top-K / heaps (primer 14). The merge phase of scatter-gather is a min-heap of size K populated by each shard's top results. This is kth-largest-element and k-closest-points-to-origin applied at production scale. Every search engine in this primer uses a heap for this exact reason.

Binary search (primer 05). The posting list for each term is sorted by document ID. Joining two terms ("red AND shoe") is an intersection of sorted lists, done in O(min(n, m) log max(n, m)) by binary-searching the smaller list's elements in the larger list, or by walking both with two pointers. Lucene uses both depending on list size ratio.

Two pointers (primer 02). AND queries across posting lists are the canonical two-pointer problem: walk two sorted lists in tandem, emit positions where both advance.

Skip lists / B-trees (named here for completeness; not in the bank). Posting lists in Lucene use skip lists to support fast positional skipping for phrase queries. The intuition is the same as binary search but on a linked-list structure.

Tries (primer 01 and onward). Autocomplete uses a finite-state transducer (Lucene) or a prefix trie (Algolia). The "give me all words starting with 'pho'" query is a depth-first walk of a prefix subtree. Same primitive as the word ladder problem's BFS, only smaller alphabet.

Sliding window (primer 03). Highlighting (find the smallest passage containing all query terms) is minimum-window-substring on the document, with terms as the multiset.

1-D DP (primer 10). Typo tolerance uses Levenshtein edit distance, computed as a DP over the token's character grid. The edit-distance problem in the bank is the exact algorithm.

Sketch implementation

A minimal inverted-index search engine, ~55 lines:

from collections import defaultdict
import heapq
import re

class InvertedIndex:
    def __init__(self):
        self.index = defaultdict(list)        # term -> sorted list of (doc_id, tf)
        self.doc_lengths = {}                  # doc_id -> term count
        self.docs = {}                         # doc_id -> raw doc (for highlighting)

    def analyze(self, text):
        # Real systems: lowercase, accent-fold, stem, drop stopwords.
        return [t for t in re.findall(r'\w+', text.lower()) if len(t) > 1]

    def add(self, doc_id, text):
        terms = self.analyze(text)
        self.doc_lengths[doc_id] = len(terms)
        self.docs[doc_id] = text
        freqs = defaultdict(int)
        for t in terms:
            freqs[t] += 1
        for t, tf in freqs.items():
            # Insert keeping the posting list sorted by doc_id.
            self.index[t].append((doc_id, tf))

    def search(self, query, top_k=10):
        terms = self.analyze(query)
        if not terms:
            return []

        # AND across terms: intersect posting lists by walking with two pointers.
        candidates = self._intersect([self.index[t] for t in terms])

        # Score each candidate with BM25.
        avg_dl = sum(self.doc_lengths.values()) / max(1, len(self.doc_lengths))
        N = len(self.doc_lengths)
        scores = []
        for doc_id in candidates:
            score = 0
            for t in terms:
                df = len(self.index[t])
                idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
                tf = next(tf for (d, tf) in self.index[t] if d == doc_id)
                dl = self.doc_lengths[doc_id]
                k1, b = 1.2, 0.75
                norm = tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / avg_dl))
                score += idf * norm
            scores.append((score, doc_id))

        # Top-K via heap: this is the primer-14 pattern in production.
        return heapq.nlargest(top_k, scores)

    def _intersect(self, lists):
        # Sort by smallest first to minimise work.
        lists = sorted(lists, key=len)
        result = {d for d, _ in lists[0]}
        for lst in lists[1:]:
            ids = {d for d, _ in lst}
            result &= ids
            if not result:
                break
        return result

What this sketch ignores: shard scatter-gather, segment merges (Lucene compacts on a schedule), refresh interval semantics, soft deletes, replicas, the analyzer pipeline's full complexity (stemming, synonyms, language detection), positional posting lists for phrase queries, faceting (a separate inverted-like structure keyed by field value), and vector search. Each of those is the size of this whole sketch again.

What breaks at scale

Tier 0 (single host, 100K docs): a single Elasticsearch node, one shard, one replica. Latency is fine, indexing is fine. No drama.

Tier 1 (millions of docs, single index): shard the index into 3-5 primaries. Add replicas for read availability. The "refresh interval" trade-off appears here: developers expect writes to be immediately searchable; they are not. Multi-second freshness lags surprise teams.

Tier 2 (tens of millions, multiple indices): hot/warm/cold tiers. Time-series data (logs, metrics) rolls daily; old indices live on cheaper storage. The Elastic ILM (Index Lifecycle Management) controls this. Aggregations across many indices become the bottleneck.

Tier 3 (hundreds of millions, multi-tenant): noisy-neighbour problems. One customer's bad query (a wildcard scan of a 10 TB index) eats the cluster's heap and breaks every other customer's latency. Solution: per-tenant query budgets, circuit breakers, and dedicated nodes for the largest tenants. Algolia avoids this entirely by giving each customer their own machines.

Tier 4 (web-scale, billions to trillions): only Google, Bing, and Yandex live here, and the architecture diverges. Distributed query planning, custom hardware (Google's TPUs for retrieval), and learned indexes. The patterns above stop being enough.

Common permanent failure modes:

Hot shard: one shard gets all the queries (often because one customer or one keyword dominates). Solution is routing keys and per-tenant shards.
Heavy aggregations: counting unique users across 1 TB of logs allocates so much memory that the JVM OOMs. Cardinality estimation (HyperLogLog) is the fix; Elasticsearch's cardinality agg uses HLL by default.
Mapping explosion: customers POST documents with new fields on every request, and the index mapping grows past 10K fields. Lucene degrades sharply past this. Solution: strict mapping mode and field flattening.
Reindex pain: any analyzer change requires reindexing every document. At hundreds of terabytes this takes weeks. Operate the reindex via the _reindex API with throttling, and run old and new indexes in parallel behind an alias.

What an interview / staff-engineer review will ask

Q1: Why an inverted index instead of a B-tree on the text column? A: A B-tree on a text column lets you find documents whose entire text equals a value, or starts with a prefix. It cannot find documents containing a word in the middle without a full scan. The inverted index pre-computes that mapping at indexing time, trading write cost (slower inserts) for read cost (orders of magnitude faster queries).

Q2: How would you support typo tolerance? A: Build the index with n-grams (e.g., 3-grams: "iphne" -> "iph", "phn", "hne") so misspellings share grams with the correct word. At query time, look up grams, then re-rank candidates by Levenshtein edit distance using the edit-distance DP. Algolia uses this approach and exposes a typoTolerance parameter; Elasticsearch's fuzzy query is the same idea.

Q3: How do you avoid one slow query taking down the cluster? A: Query timeouts at the coordinator. A "circuit breaker" on per-query heap allocation (Elasticsearch has these built in: fielddata circuit breaker, request circuit breaker). Separate node pools for ingest vs query vs coordination. Per-tenant rate limits at the API layer. The principle is bulkhead isolation: a bad actor in one bulkhead never sinks the whole ship.

Q4: How do you keep the index in sync with the source of truth (a Postgres database)? A: Change Data Capture (Debezium reading the WAL) into Kafka, with a consumer that applies inserts/updates/deletes to the index. Idempotent application (use document version, last-write-wins on the index). Re-sync from a snapshot if the consumer lags too far. Stripe and Shopify both run this pattern.

Q5: When would you choose Algolia over self-hosted Elasticsearch? A: When the team is small, the search experience is a product surface (not just a backend feature), and the index size is under 10 GB per customer. The 30-100x cost premium pays for itself in engineering time saved on relevance tuning, ranking, autocomplete, A/B testing, and global edge deployment. Past 10 GB or in regulated environments, self-host.

In the AI-integrated workspace

The biggest shift in search infrastructure 2023-2026 is the rise of retrieval-augmented generation (RAG). An LLM-backed assistant cannot fit a 10 GB knowledge base in its context window, so it calls a retrieval service to pull the relevant chunks first. Every RAG implementation reuses the search infrastructure above, often with a vector index added.

A correct RAG retrieval stack looks like:

Lexical retrieval (BM25, inverted index) returns the top-200 candidate chunks. Cheap, deterministic, catches exact-match terms.
Vector retrieval (HNSW over embeddings) returns the top-200 semantically similar chunks. Catches paraphrases the lexical layer misses.
Reciprocal Rank Fusion merges the two lists into a top-50.
A reranker (a cross-encoder model) scores those 50 in context of the query, returns the top-5.
The LLM gets those 5 chunks as context.

Step 1 is exactly the search engine above. The vector step at scale uses the same scatter-gather pattern (partitions instead of shards, ANN instead of inverted index lookup), and the merge is the same top-K heap.

Practical pitfalls AI agents stumble on when wiring this up: (a) embedding model drift requires reindexing the entire vector store on upgrade, just like an analyzer change; (b) hybrid scoring is a research problem, not a config knob (RRF is the boring default); (c) chunk boundaries matter more than the embedding model, and most teams overinvest in the latter and underinvest in the former.

For parikshan, the same search infrastructure underlies the problem bank lookup, the hint retrieval, the AI-tutor's RAG layer (when a student asks "I don't get how arrays differ from lists"), and the proctoring system's behavioural query ("show me sessions where the camera signal was lost for >5 seconds and the answer was correct").

Variants and adjacent systems

Code search (GitHub Blackbird, Sourcegraph, ripgrep): token-stream search over code. Different tokenisers (split on _, ., camelCase, do not stem), often regex-based. Sourcegraph indexes 6+ million repos using Zoekt, a custom Go-based engine.

Log search (Loki, Splunk, Datadog Logs): full-text on log messages. Loki famously inverts the model: index only labels, scan log content lazily. Cheaper than Elasticsearch, slower per-query.

Vector databases (Pinecone, Weaviate, Milvus, pgvector): the same architecture but vectors all the way down. HNSW or IVF-PQ instead of inverted index, but same scatter-gather and top-K merge.

E-commerce search (Algolia, Coveo, Bloomreach): catalogue search with merchandising, A/B testing, and personalisation built in. Higher per-doc cost, dramatically higher conversion impact.

Geo search (Elasticsearch geo queries, Postgres PostGIS, Uber's H3): spatial index (R-tree, geohash, or H3 hex grid) instead of inverted index. Often integrated with text search ("Italian restaurants near me").

Semantic search-as-a-service (Vertex AI Search, AWS Kendra, Coveo): managed RAG. Same building blocks, marketed as turnkey.

The constant across every variant: the pattern is scatter-gather-merge, the merge is a top-K heap, and the retrieval is keyed by some pre-computed structure that maps query features to document candidates. Whether the structure is an inverted index, a tree, or a vector graph, the architectural shape is shared.