System: Feed Ranking

Reading time: ~20 minutes · Prerequisites: Math, Matrix & Heaps, Linked List, Arrays & Hashing · Capstone: feed-ranking design problem (parikshan systems track)

The product behind the system

When you open Twitter (now X), Instagram, LinkedIn, or TikTok, the screen you see is the output of a ranking pipeline that decided, within roughly 100 ms, which roughly 30 posts to show you out of a candidate pool that may be hundreds of millions. The candidate pool is itself the output of a retrieval pipeline that decided which roughly 1,000 posts you might care about out of the billions posted in the recent window.

LinkedIn published the "Brewing FeedMix" architecture in the mid-2010s and has documented its ranking stack in subsequent talks at QCon and Strata. Twitter open-sourced the "For You" ranking pipeline in 2023, exposing the candidate-generation and heavy-ranker stages. Meta's "Instagram Explore" and Pinterest's "Pixie" share a near-identical staged-ranking shape. The architect's understanding of feed ranking is the understanding of that staged pipeline: retrieve, rank, re-rank, blend, with each stage trading recall for precision.

Discord and Slack are not feed-ranked in the same sense (chronological dominates), but their search and notification ranking is the same problem at smaller scale. YouTube's home feed and Spotify's "Made For You" are the same problem in different media. The algorithms transfer across; the constraints differ.

What the requirements actually are

Functional requirements:

Given a user opening the app, deliver a personalised feed of N items (typically 20-50 for the first page, infinite-scroll thereafter).
The feed must be fresh (recent activity from followed accounts appears within seconds, not hours).
The feed must be personalised (different users see different orderings of the same candidate set).
Support negative feedback (hide, mute, report) and update the ranking within the session.
Support multiple feed surfaces (home, explore, search, notifications) with shared infrastructure.

Non-functional requirements:

Throughput: 500M daily active users at peak hour. Roughly 5M feed requests per minute at peak; 100K per second sustained.
Latency: p95 of feed-page render under 500 ms; p99 under 1 second. Of that budget, ranking gets ~150 ms.
Durability: posts are durable in a primary store; feed materialisations are derived state and can be rebuilt.
Consistency: eventually consistent across regions; a tweet posted in Tokyo may appear in São Paulo 5 seconds later. Strong consistency would kill latency and is not what users perceive anyway.
Cost ceiling: the most expensive query in the social graph at most platforms. Cost-per-feed-render is a tracked KPI; halving it is a multi-million-dollar engineering effort.

The architect's framing

Feed ranking decomposes into six components. The famous architectural split is fan-out-on-write vs fan-out-on-read, and most production systems run a hybrid.

                +-----------------+
   write  -->   |  ingest         |--- mirror to ranking features ---+
                |  (post create)  |                                   |
                +--------+--------+                                   |
                         |                                            v
                         | fan-out worker                    +-----------------+
                         v                                   |  feature store  |
                +-----------------+                          |  (post + user   |
                |  follower fan   |                          |   embeddings)   |
                |  out queue      |                          +--------+--------+
                +--------+--------+                                   |
                         |                                            v
                         v                                   +-----------------+
                +-----------------+                          |  retrieval      |
                |  user feed      |                          |  (candidate     |
                |  (Redis list /  | <------ blend ---------- |   generation)   |
                |   timeline)     |                          +--------+--------+
                +--------+--------+                                   |
                         |                                            v
                         |                                   +-----------------+
                         |                                   |  ranker         |
                         |                                   |  (DNN / GBDT)   |
                         |                                   +--------+--------+
                         |                                            |
                         |  read time                                 |
                         +------------- blend & re-rank --------------+
                                              |
                                              v
                                          user device

The architect's mental model: a candidate set comes from the fan-out queue (cheap, biased toward who-you-follow) and from retrieval (expensive, biased toward what-you-might-like). They are merged, ranked by an ML model that scores each (user, post) pair, re-ranked for diversity, and served.

The trade-offs we will name

1. Fan-out-on-write vs fan-out-on-read.

Choice: hybrid. Fan-out-on-write for users with under ~100K followers; fan-out-on-read for celebrities and bots.
Alternative considered: pure fan-out-on-write (Facebook's original approach in the 2000s) or pure fan-out-on-read.
Why hybrid wins: pure fan-out-on-write is catastrophic when a celebrity posts (Cristiano Ronaldo has ~600M Instagram followers; one post would write 600M Redis entries in seconds). Pure fan-out-on-read forces every feed render to do a SELECT * FROM posts WHERE author IN (your_followees) query, which is expensive and slow for normal users. The hybrid is the production answer at Twitter, Instagram, and LinkedIn; Twitter documented this explicitly in the 2013 "Timelines at Scale" talk.
When pure fan-out-on-write would win: small platforms with no celebrity accounts and uniform follower-count distributions (rare; the power law is the default).
When pure fan-out-on-read would win: read-heavy systems where writes are cheap and reads need maximum personalisation (Reddit's home feed leans this way).

2. Pre-computed feed vs query-time computed feed.

Choice: pre-computed candidate set, query-time ranked.
Alternative considered: fully pre-computed (rank at fan-out time and cache the result).
Why partial pre-compute wins: ranking features change every minute (your last 5 likes, your current dwell pattern). A pre-ranked feed cached overnight would be stale and wrong. Pre-computing the candidate set is cheap (it changes at the rate posts are made); pre-computing the ordering is wasteful (it changes every page load). Twitter and Instagram pre-compute candidates and rank at request time.
When full pre-compute would win: digest emails (LinkedIn's daily digest, Substack notifications) where the feed is computed once per user per day and shipped.

3. Top-K via heap vs top-K via sort.

Choice: min-heap of size K (the Math, Matrix & Heaps pattern).
Alternative considered: full sort of the candidate set, take top K.
Why heap wins: candidate sets are 1,000 to 10,000 items; K is 30-50. Full sort is O(N log N) = 1,000 × 10 = 10,000 comparisons. Heap is O(N log K) = 1,000 × 6 = 6,000 comparisons. The cache locality of the heap is also better because the working set is K elements. For very large candidate sets (a million), the win grows.
When sort wins: very small N (under 100) where the constant factor of the heap loses to a vectorised sort.

4. Recency vs relevance.

Choice: score = relevance × recency_decay, with the decay calibrated per-surface.
Alternative considered: chronological (recency only) or pure relevance (no recency penalty).
Why blended wins: pure chronological is what Twitter looked like in 2010 and what users say they want; users return less when they get it (the platform's revenue and engagement metrics drop). Pure relevance ignores freshness; a brilliant tweet from yesterday outranks a breaking-news tweet from now. The blend is the answer everyone arrived at, with the blend ratio tuned per surface (home feed is more relevance-weighted; search is more relevance-weighted; notifications are recency-weighted).
When pure recency wins: messaging surfaces (Slack, Discord, WhatsApp) where the user expects strict chronological order.

5. Single-stage ranker vs multi-stage ranker.

Choice: multi-stage (cheap retrieval → mid-tier ranker → heavy ranker → re-ranker).
Alternative considered: a single neural network that scores every (user, post) pair end-to-end.
Why multi-stage wins: a heavy ranker (a transformer over 100 features) costs ~1 ms per (user, post) pair. With a 10M-post candidate space, that is 10,000 seconds per request, impossible. Retrieval cuts the candidate set to 1,000; the mid-tier ranker cuts it to 100; the heavy ranker cuts it to 30. Each stage is the right cost/quality trade-off for its volume. This is the information-retrieval pipeline going back to web search; feed ranking adopted it directly.
When single-stage would win: small candidate spaces (a notifications list of 200 items) where the cheap stages are unnecessary.

Where the algorithms from this bank actually appear

The top-K selection at the end of every ranking stage is the Math, Matrix & Heaps pattern in its purest form. top-k-frequent-elements builds the muscle memory; the production version replaces "frequency" with "predicted engagement score" but the heap discipline is identical. kth-largest-element is the streaming variant: imagine the ranker is producing scored posts in a stream and you want the top 30 without materialising the whole stream.

k-closest-points-to-origin is the geometric analogue and the candidate-retrieval primitive. Embedding-based retrieval (Twitter's two-tower model, Pinterest's PinSage) converts each post to a vector and finds the K nearest neighbours to the user's vector. The structure inside the vector database (HNSW, IVF, ScaNN) is heap-and-graph at the bottom; the data structure transfers directly.

The user timeline as a doubly-linked list is the Linked List pattern at production scale. Twitter's Redis-backed timelines are conceptually a list-per-user with capped length (typically 800 most-recent items, evicting the tail). The lru-cache primer's sentinel + hash-map + doubly-linked-list discipline is what makes "insert at head, evict tail, update by id" all O(1) at the volume Twitter sees.

For deduplication (you must not show the same post twice across multiple retrieval channels), the Arrays & Hashing primer's set-based dedupe is the answer, with one production tweak: at the scale of a billion items per day, the set is a Bloom filter or a HyperLogLog, not a Python set. The asymptotic discipline is the same; the data structure shifts under the constant-factor pressure.

Sketch implementation

The blend-and-rank step, the heart of the request-time pipeline:

from heapq import heappush, heappushpop
from time import time

def rank_feed(user, candidates, k=30, lam=0.05):
    """
    candidates: list of (post_id, author_id, posted_at, retrieval_score)
    Returns top-k post ids by blended score.
    """
    now = time()
    seen_authors = {}                       # author -> count, for diversity cap
    heap = []                               # min-heap of (score, post_id)

    features = batch_lookup_features(user, [c[0] for c in candidates])

    for (post_id, author_id, posted_at, retrieval_score) in candidates:
        f = features[post_id]
        relevance = heavy_ranker.predict(user.embedding, f)   # 0..1
        age_hours = (now - posted_at) / 3600
        recency = pow(2.71828, -lam * age_hours)              # exponential decay
        score = relevance * recency * retrieval_score

        # diversity: penalise the 3rd+ post from the same author
        cnt = seen_authors.get(author_id, 0)
        if cnt >= 2:
            score *= 0.3

        if len(heap) < k:
            heappush(heap, (score, post_id, author_id))
        elif score > heap[0][0]:
            popped = heappushpop(heap, (score, post_id, author_id))
            # decrement seen count for the post we just kicked out
            seen_authors[popped[2]] = seen_authors.get(popped[2], 1) - 1
        else:
            continue
        seen_authors[author_id] = cnt + 1

    return [pid for (_, pid, _) in sorted(heap, reverse=True)]

What the code commits to:

A min-heap of size K (the Math, Matrix & Heaps idiom).
Feature lookup is batched: one round trip to the feature store, not one per post. This is the difference between a 50 ms ranker and a 5 second ranker.
The diversity penalty is multiplicative, not a hard cap. Hard caps cause ugly edges (the 3rd post by your favourite author disappears entirely); multiplicative penalties degrade gracefully.

What it omits but production needs: per-post negative-feedback signal, A/B test bucket selection, watch-time prediction, calibration of the model output to a probability scale, fairness constraints across protected attributes, and a fallback to a chronological feed if the ranker times out. Each is its own staff-engineer-week of work.

What breaks at scale

10K daily active users: a single Postgres ORDER BY score LIMIT 30 works. Premature ranking infrastructure is the bigger risk than missing it.

100K DAU: fan-out-on-write works for everyone (no celebrity problem yet). The ranker is a logistic regression or GBDT, served from a single ML server. Feed list lives in Redis.

1M DAU: the first power-law accounts appear. A user with 50K followers is a write storm at fan-out time. The hybrid (fan-out for normal users, query-time merge for celebrities) becomes necessary. The ranker is a small DNN behind a model server with batched inference.

50M DAU: the candidate space is too large for a single-stage ranker. Two-stage retrieval (lexical + embedding-based) plus a multi-stage ranker is the architecture. Feature store is a real product (Feast, Tecton, or in-house). Embedding-based retrieval requires a vector DB (FAISS-derived, Pinecone, Vespa). p95 budget is now tight; every millisecond is fought for.

500M+ DAU (TikTok, Instagram, X): model training is a continuous job (online learning), the feature store ingests events at firehose rate, the ranker is sharded across thousands of GPUs, the candidate retrieval uses learned indexes. Engineering organisations of 200+ people work on this single pipeline. The architect's job is to keep the pipeline understandable at the team boundaries: who owns retrieval, who owns ranking, who owns re-ranking, who owns the model platform.

Named failure modes:

Thundering herd on celebrity post: a celebrity tweets, the fan-out worker queue spikes, downstream feature lookups stall. Mitigation: rate-limit fan-out per author, fall back to query-time merge above a follower threshold.
Stale embeddings: a user's embedding is computed nightly; their interests shifted today. The feed feels off until tomorrow. Mitigation: incremental online updates to the user embedding from the current session's interactions.
Feedback-loop collapse: the ranker over-promotes what users click, users click only what the ranker shows, the feed converges to a single topic. Mitigation: exploration (epsilon-greedy or Thompson sampling), diversity constraints, a regularly retrained model that sees user actions outside the recommended set.
Cold start (new user, new post): no interaction history. Mitigation: bootstrap with content embeddings for posts and demographic priors for users, then learn from session signals quickly.

What an interview / staff-engineer review will ask

Q: Walk me through what happens between a user opening the app and seeing the first post. A: DNS + TCP + TLS to the closest edge POP. Edge routes to the feed service in the user's home region. Feed service issues parallel requests: (1) the pre-computed timeline from Redis (fan-out-on-write side), (2) celebrity authors' recent posts (fan-out-on-read side), (3) retrieved candidates from the recommendation engine. The three streams merge into a candidate pool of ~1,000 items. The ranker scores each (user, post) pair. The top 30 are re-ranked for diversity and freshness. Feed is paginated and streamed to the client. p95 budget: 500 ms end-to-end.

Q: Why not fan-out-on-write everywhere? A: Celebrities. Cristiano Ronaldo has ~600M Instagram followers. One post is 600M writes; at 10K writes/second per Redis shard, that is 60,000 seconds (17 hours) of fan-out lag. By the time the last follower's timeline is updated, the post is no longer fresh. The hybrid breaks the math: read-side merge for high-follower authors is bounded by the reader's QPS, not the celebrity's follower count.

Q: How do you handle the "I muted this account" signal? A: A user-level filter applied after ranking, not before. Pre-filtering changes the candidate set per user and breaks caching; post-filtering is a cheap predicate over the top-K. The filter set lives in a user-state hot cache, refreshed on mute action.

Q: How do you A/B test a new ranking model? A: Online traffic is bucketed by hash(user_id, experiment_id) mod 100. Treatment group gets the new model; control gets the production model. The same user always falls in the same bucket within an experiment to avoid leakage. Metrics are tracked per bucket: engagement rate, retention, time-spent, negative-feedback rate. A single experiment runs 1-2 weeks; bad ones are killed in hours. The infrastructure is sometimes called "experimentation platform" and at Meta and Netflix is a team of dozens.

Q: What is the failure mode if the ranker times out? A: Fall back to chronological feed from the pre-computed timeline. The user gets a reasonable-quality feed; the rate of fall-backs is a paged metric. Hard cutoff at p99 of the ranking budget (typically 200 ms); the ranker returns whatever it has scored when the timer fires. Partial ranking is better than no feed.

In the AI-integrated workspace

AI coding assistants are surprisingly good at the mechanical parts of feed ranking (the heap, the decay function, the diversity penalty) and surprisingly bad at the product parts (when to use fan-out-on-write, how to bound the ranker's latency, how to design the feature contract).

What the architect uses AI for:

Boilerplate transformation: converting a model output into a calibrated probability, writing the batching wrapper around the feature store, generating the test fixtures for the heap. AI saves hours here.
Cross-language ports: the heap-based blend in Python can be ported to Go or Rust with AI in minutes; the test cases come along.
Edge-case enumeration: "list the failure modes of a feed ranker under partial feature-store outage". AI will produce a useful starting list; the architect prunes it.

What the architect audits personally:

The score blend (relevance * recency * retrieval_score). If AI generates an additive blend, it is usually wrong; products typically multiply (independence assumption), not add. The architect checks the form and the calibration.
The latency budget per stage. AI happily writes a ranker that calls the feature store inside the loop; the architect demands batching.
The fallback path. If the ranker fails, what does the user see? AI will not generate the fallback unless asked. The architect asks every time.

What the architect refuses from AI:

Any ranker that promises "fairness across protected attributes" without naming the fairness criterion (group fairness, individual fairness, equal opportunity). These criteria are mathematically incompatible; you must pick one. AI will hand-wave.
Any candidate generator that "uses semantic similarity" without specifying the embedding model and its training data. The embedding is the recommender; the data choice is a product decision, not an implementation detail.
Any A/B test design that does not control for novelty effects. New features always look good for the first week because users try them; AI rarely warns about this.

Variants and adjacent systems

Search ranking: query → candidates → rank. Same staged pipeline, different signals (query relevance instead of personalisation). Elasticsearch is the open-source skeleton; Vespa is the open-source full stack.
Notification ranking: which push notification to send, when. The ranker decides "is this worth interrupting the user?" Latency budget is looser (hours), value is higher (a missed notification is a lost session).
Ad ranking: same pipeline, score is predicted_CTR × bid. The auction adds a second-price layer. Google's Ad Auction and Meta's ad system are the canonical examples.
Music / video recommendation: the same pipeline with a longer item-half-life (a song is fresh for weeks, a tweet for hours). Spotify's "Discover Weekly" and YouTube's home feed are the analogues.
Email digest: the slow-pace cousin. Ranker runs once a day per user; output is shipped via email. LinkedIn's daily digest and Substack's recommendations live here.
Friend / connection recommendation: rank potential edges in the social graph, not posts. The candidate generation is graph-traversal-heavy; the ranking is the same.