parikshan

System: Recommendation Engine

Reading time: ~25 minutes  ·  Prerequisites: Math, Matrix & Heaps, Arrays & Hashing, 1-D DP  ·  Capstone: recommendation-engine design problem (parikshan systems track). Music recommender for 100K users and 1M tracks returning a personalised top-30 list under 100 ms.

A recommendation engine answers one question: "of all the millions of items I could show this user right now, which dozen should I pick?" Every consumer product that grew past a single page of content runs one. The good ones (Spotify Discover Weekly, Netflix's homepage, TikTok For You) are described in research papers; the rest reinvent the same architecture badly. This primer covers what the shape looks like, why the patterns from the algorithm bank turn up at every layer, and what breaks when you move from a hobby recommender to one serving a billion users.

The product behind the system

Spotify Discover Weekly ships a fresh 30-track playlist every Monday to roughly 713 million users across 100+ million tracks. It runs on three complementary models: collaborative filtering on listening behaviour, NLP over music blogs and metadata, and raw audio analysis. The candidates from all three are merged, deduped against the user's listening history, and ranked. The 30-track output is small but the candidate pool per user is in the hundreds of thousands.

Netflix homepage ranks every row and every item in every row for every user, every visit. Netflix paid $1 million for the original Prize in 2009 to push collaborative filtering forward; the production system has long since left matrix factorisation behind for deep ranking models. Netflix has stated publicly that recommendations drive ~80% of viewing.

TikTok For You feed is the most aggressive interactive recommender shipped. It ranks an effectively infinite candidate pool per swipe, with sub-second feedback loops. ByteDance's "Monolith" recommender (publicly described in a 2022 paper) handles billions of users with online learning, updating embeddings as users interact in real time. The aggressive learning loop is what makes TikTok feel uncannily good at predicting taste.

YouTube and Instagram Reels are architecturally similar to TikTok with different feature sets. Both publicly describe two-tower retrieval plus deep reranking.

All four solve the same problem at scale: from a corpus of millions to billions of items, surface the right dozen per user per moment.

What the requirements actually are

Functional:

  1. Candidate generation: from millions of items, reduce to a manageable pool (~1000) of items the user might plausibly enjoy.
  2. Ranking: order those candidates by predicted relevance for this user, at this moment.
  3. Diversity and freshness: do not show 10 versions of the same item; mix old favourites with new arrivals.
  4. Cold start: serve reasonable recommendations to a new user with no history, and surface a new item with no engagement.
  5. Feedback ingestion: every click, skip, like, share, watch-time signal must update the user's profile.
  6. Explainability (increasing as a requirement): "because you watched X" or "because users like you enjoyed Y".

Non-functional:

  1. Latency: <100 ms from request to ranked list, end to end. TikTok needs <50 ms per swipe.
  2. Throughput: TikTok ranks ~600 billion items per day across its user base. Netflix ranks every homepage view.
  3. Freshness of the model: TikTok and Reels rebuild user embeddings within seconds of a behavioural event. Spotify Discover Weekly only retrains weekly because the product surface is weekly.
  4. A/B testing: every change is shipped behind a randomised experiment; the platform supports hundreds of concurrent experiments.
  5. Fairness and abuse resistance: no echo chambers, no exploitable feedback loops where bots can promote items.
  6. Compliance: explainability for the EU Digital Services Act, opt-out for personalised recommendation in some jurisdictions.

The architect's framing

Five components, plus a feature store that sits across all of them.

  1. Candidate generator: retrieves the top-N (often 500-1000) items per user from a corpus of millions. Approximate, fast, recall-focused. This is typically a two-tower neural network or an embedding-based retrieval index.
  2. Feature store: serves user features (recent history, demographics, embeddings) and item features (popularity, embeddings, freshness) with low latency. Feast, Tecton, and in-house systems (Uber's Michelangelo) all live here.
  3. Ranker: scores the ~1000 candidates per user with a heavier model that uses cross-features (user x item, context). Precision-focused, slower per-candidate but operating on a small set.
  4. Filter / business logic layer: removes items the user has already seen, dedupes by artist or franchise, enforces editorial rules (no explicit content for under-18, no competitor brands for sponsored slots).
  5. Diversification: the final pass that ensures the output is not 30 copies of one genre. MMR (Maximal Marginal Relevance), DPP (Determinantal Point Processes), or rule-based reranking are common.
                +----------------------------+
   user --->    |  Online Service / API      |
                |  (assembles features,      |
                |   calls retrieval + rank)  |
                +-------+----------+---------+
                        |          |
       +----------------+          +----------------+
       |                                            |
       v                                            v
+------------------+                       +------------------+
| Candidate Gen    |                       |  Feature Store   |
| (2-tower ANN)    |<--------------------- |  (user + item    |
| Returns ~1000    |     features          |   features, low  |
| items in ~10 ms  |                       |   latency)       |
+--------+---------+                       +------------------+
         |
         v
+------------------+
| Ranker           |
| (DLRM, transformer
|  cross-tower,    |    <- heavy model, ~10 ms for 1000 items
|  trained offline,|
|  served on GPU)  |
+--------+---------+
         |
         v
+------------------+        +------------------+
| Filter / business|------->| Diversification  |
| rules            |        | (MMR / DPP)      |
+------------------+        +------------------+
                                     |
                                     v
                              top-30 to user

   <-- async path -->

+------------------+    +------------------+    +------------------+
| Behavioural events |->| Kafka / Kinesis  |->| Stream processor |
| (clicks, watches)  |  |                  |  |  (Flink, Spark)  |
+------------------+    +------------------+    +--------+---------+
                                                        |
                                                        v
                                              +------------------+
                                              | Feature updates  |
                                              | Model retraining |
                                              +------------------+

The two paths matter: the serving path is synchronous and must hit 100 ms, the learning path is asynchronous and can take seconds to hours to update the model. The cleanest architectures keep them separate; the buggy ones blur the line and end up with a serving system that is also trying to train.

The trade-offs we will name

Collaborative filtering vs content-based vs hybrid. Collaborative filtering ("users like you watched X") needs interaction data and fails for cold-start. Content-based ("this song sounds like that song") works for cold-start but cannot capture surprising taste connections (the "people who liked metal also liked Bach" effect). Every production system is hybrid; the question is which signal gets which weight, and that weight is learned, not set.

Offline trained vs online learning. Spotify's Discover Weekly trains offline once a week and serves the result; the model is static within the week. TikTok updates user embeddings in real time as interactions stream in; the system is closer to online reinforcement learning. The trade-off is engineering complexity (online learning is much harder to deploy, monitor, and debug) versus product responsiveness (TikTok's signature "it learned what I like in 10 swipes" is impossible offline).

Retrieval recall vs precision. The candidate generator is allowed to be sloppy (recall@1000 matters; the ranker filters) but cannot be too sloppy (if the right item is not in the top 1000, it cannot be ranked). The ranker is allowed to be expensive per item (it sees only 1000) but cannot be too expensive (it has to score 1000 in 10 ms). Splitting these two stages is the single most important architectural decision in the system; combining them into one heavy model gives worse latency and worse quality.

Personalisation vs diversity vs editorial. Pure personalisation collapses to a filter bubble. Pure diversity is random. Pure editorial does not scale. Every product picks a position on this triangle; Spotify's Editorial team curates playlists that get recommended; Netflix's "Because you watched" rows are personalisation; TikTok injects exploration items into the top of the feed deliberately. The product team owns this knob, not engineering.

Latency vs model size. A transformer ranker with 1 billion parameters is more accurate than a logistic regression. It is also 1000x slower. Production systems use model distillation (train the big model offline, distill to a small model for serving) and feature crossing precomputed at index time to shrink the inference cost. Netflix's two-stage architecture exists precisely to manage this trade-off.

Where the algorithms from this bank actually appear

Heaps and top-K (primer 14). The candidate generator's output is a top-K retrieval over an approximate index; this is k-closest-points-to-origin applied to high-dimensional embedding space. The ranker's output is the top-30 by score; another heap pass. The kth-largest-element pattern shows up at both stages.

Hashing (primer 01). The user's "items already seen" set is a hash set; checking "have they seen this in the last 14 days?" is the same as contains-duplicate with a time-bound. Locality-sensitive hashing (LSH) is the original approximate retrieval method, predating modern vector indexes.

1-D DP (primer 10). Sequence-aware ranking ("after a user watched A, B, C, what is the probability they enjoy D?") is closer to language modelling than to classic CF. The Viterbi-like DP under the hood is the same pattern as longest-increasing-subsequence extended to weighted transitions. Spotify's Sequential Skip Prediction Challenge in 2018 was explicit about this framing.

Graphs (primer 09). The user-item interaction matrix is a bipartite graph. Many CF algorithms (PinSage at Pinterest, the GraphSAGE family) explicitly run graph convolutions over this structure. The "are these two users connected through fewer than three hops of shared items" question is BFS.

Two pointers and sliding window (primers 02, 03). The session-based features ("last 50 items the user touched") are a sliding window over the event stream. Computing recency-weighted averages is a windowed aggregation.

Bit manipulation (primer 15). Sparse feature vectors are often hashed into bitmaps for cheap intersection (count of users who watched both X and Y is a bitmap AND + popcount). Bloom filters are the production form of this.

Greedy (primer 12). MMR diversification is greedy: at each step, pick the item with the best (relevance - λ·similarity-to-already-picked). The structure is identical to partition-labels (greedy local choice with a global invariant).

Sketch implementation

A two-stage recommender for a music app, ~55 lines:

import numpy as np
import heapq
from collections import defaultdict

class TwoTowerRetriever:
    """Stage 1: candidate generation via nearest-neighbour in embedding space."""
    def __init__(self, item_embeddings):  # item_id -> 64-dim vector
        self.item_emb = item_embeddings
        # In production: replace this loop with a FAISS or ScaNN ANN index.
        self.items = list(item_embeddings.keys())
        self.matrix = np.array([item_embeddings[i] for i in self.items])

    def candidates(self, user_emb, k=1000):
        # Cosine similarity, top-K via a heap. This is primer-14 in practice.
        scores = self.matrix @ user_emb
        # heapq.nlargest is heap-based top-K, O(n log k).
        top = heapq.nlargest(k, range(len(self.items)), key=lambda i: scores[i])
        return [(self.items[i], scores[i]) for i in top]

class DeepRanker:
    """Stage 2: precise scoring with cross-features."""
    def __init__(self, model):
        self.model = model  # opaque ML model: predict_proba(features) -> score

    def score(self, user_features, candidates, item_features):
        out = []
        for item_id, retrieval_score in candidates:
            feat = self._cross(user_features, item_features[item_id], retrieval_score)
            out.append((self.model.predict(feat), item_id))
        return out  # (score, item_id) pairs

    def _cross(self, u, i, retrieval_score):
        # Real systems do hundreds of cross-features. Sketch: concat + dot products.
        return np.concatenate([u, i, [np.dot(u, i), retrieval_score]])

def recommend(user_id, retriever, ranker, feature_store, history, k=30, lam=0.5):
    user_emb = feature_store.user_embedding(user_id)
    user_feat = feature_store.user_features(user_id)

    # Stage 1: 1000 candidates in ~10 ms.
    candidates = retriever.candidates(user_emb, k=1000)

    # Filter already-seen.
    candidates = [(i, s) for (i, s) in candidates if i not in history]

    # Stage 2: rank.
    item_feats = feature_store.batch_item_features([i for i, _ in candidates])
    scored = ranker.score(user_feat, candidates, item_feats)

    # Diversification: greedy MMR. Pick best, then pick best - lam*sim-to-chosen.
    chosen = []
    chosen_set = set()
    while len(chosen) < k and scored:
        if not chosen:
            best = max(scored, key=lambda x: x[0])
        else:
            def mmr(item):
                score, iid = item
                sim = max(cosine(feature_store.item_embedding(iid),
                                 feature_store.item_embedding(c)) for c in chosen)
                return score - lam * sim
            best = max(scored, key=mmr)
        chosen.append(best[1])
        chosen_set.add(best[1])
        scored = [s for s in scored if s[1] != best[1]]
    return chosen

What this sketch leaves out: the actual two-tower training pipeline, online feature freshness, candidate dedup across multiple retrieval sources, exploration injection (epsilon-greedy or bandit-based), the A/B testing harness, the audit log for explainability, model serving infrastructure (Triton, KServe), and the fallback to a popularity baseline when any of the above fails. Each is its own team's job at a large company.

What breaks at scale

Tier 0 (10K users, 10K items): a sklearn NearestNeighbors index over 64-dim embeddings rebuilt nightly works fine. p99 latency is dominated by Python overhead.

Tier 1 (1M users, 100K items): ANN library required (FAISS, Annoy, ScaNN). Feature store appears as a Redis cluster. The user embedding becomes the bottleneck if you store it server-side per request, so it gets cached.

Tier 2 (10M users, 1M items): model serving must be on dedicated infrastructure (TensorFlow Serving, Triton). Cold-start problems appear because a meaningful fraction of users have <10 interactions; this is when popularity baselines and content-based features earn their slot. A/B testing infrastructure becomes essential because every change ships with an experiment.

Tier 3 (100M users, 100M items): real-time feature updates from a streaming pipeline (Kafka + Flink). Model retraining shifts from nightly to hourly. Sharding the ANN index by item population (Pinterest's "shard by board" approach, Spotify's "shard by genre") because a single index does not fit in any one machine's RAM.

Tier 4 (billion users, billion items): TikTok-scale. Online learning with gradient updates pushed in real time. Model serving on GPU clusters. Multi-stage retrieval (a "pre-candidate" generator before the candidate generator). Architectural complexity that requires a dedicated platform team.

Common permanent failure modes:

  • Popularity-bias collapse: ranker learns "popular items get clicked", recommends only popular items, makes them more popular. Fixed by exposing the model to a controlled fraction of randomly ranked impressions for training.
  • Filter bubbles: user gets stuck in one cluster, never sees anything new. Fixed by explicit exploration injection in the diversification stage.
  • Feedback loops with bots: an adversary's bot farm clicks an item to promote it; the model learns the false signal. Fixed by trust scoring of interactions and downweighting suspect sessions.
  • Embedding drift: re-trained model produces embeddings that are not in the same coordinate system as the old. Every cached vector is now wrong. Solution: align embeddings or recompute everything; never deploy a new model without one of these.

What an interview / staff-engineer review will ask

Q1: Cold-start: a new user with no history just signed up. What do you recommend? A: Three signals in order of preference. (1) Demographic-based popular items (most-streamed in your age/country/language). (2) Signup wizard preferences if you collect them. (3) Item-content features (if the user clicks one item, recommend similar items by content even before any behavioural signal exists). Avoid pure global popularity; it makes every new user's experience identical.

Q2: How do you serve recommendations at 100 ms p99? A: Pre-compute per-user candidates offline if possible (Spotify Discover Weekly does this; the playlist is computed before Monday and just served). For real-time, two-stage architecture: ANN retrieval in 10 ms, ranker on 1000 items in 10-30 ms, business logic in 5 ms. Feature lookups are the silent killer; they must be in a low-latency store (Redis, in-memory) not a database.

Q3: How do you A/B test a recommender model? A: Split users by stable hash of user_id into control and treatment, route to different model versions. Measure online metrics: CTR, watch-time, retention, not offline accuracy (AUC). Offline metrics often disagree with online (the "offline-online gap"). Run for at least one full weekly cycle. Be honest about novelty effect: a new model is sometimes better just because it shows different things.

Q4: A model retrain just produced a worse offline NDCG. Do you ship it? A: Offline metrics are a noisy proxy. Ship behind A/B, measure online, decide on online metrics. But "offline got worse and online is the same" is a real warning; the model may be overfit to one slice or under-fit to another. Investigate before shipping; do not blind-trust the A/B alone.

Q5: A high-profile user complains the recommendations are biased / wrong / inappropriate. How do you respond? A: First, audit log: pull the full feature set, candidates, scores, and rules that led to the offending recommendation. Reproduce the decision deterministically. Then decide if the fault is data (a bad signal), model (it generalised wrongly), or rules (a missing filter). Patch the rule layer first (fast), retrain the model if the signal was bad (slow), and add a regression test to the offline evaluation set. Recommender bugs almost always live in the data and the rules, not the model.

In the AI-integrated workspace

Recommenders pre-date LLMs and will outlive them. But LLMs are showing up in two places:

LLM as feature generator. Use an LLM to produce a natural-language summary of each item ("a 2023 indie folk track with introspective lyrics, female vocalist, slow tempo"). Embed these summaries with a sentence-transformer; you now have content-based features for the long tail of items with no interaction history. This is one of the cheapest cold-start wins available in 2025-2026.

LLM as the ranker explainer. "We are recommending this because..." is generated by an LLM at request time, conditioned on the feature contributions of the actual ranker. Netflix is reportedly experimenting with this; the trick is keeping the explanation faithful to the model rather than letting the LLM hallucinate plausible-sounding but wrong reasons.

LLM-driven retrieval ("conversational recommender"). Spotify's AI DJ, YouTube's "search by description", and ChatGPT-with-tools all let the user describe what they want in natural language. The LLM emits a structured retrieval query that the recommender executes. The recommender architecture is unchanged; only the input changed.

For parikshan, the same retrieval-and-rank pattern recommends the next problem to a student. Candidates are the unsolved problems in the bank; features are the student's success rate per pattern, recency of attempt, and ELO-like difficulty rating. The ranker is a much simpler model than Netflix's, but the architecture is identical.

Variants and adjacent systems

Ad ranking (Google Ads, Facebook Ads, TikTok Ads): same architecture, optimising for predicted click rate * bid price. The ranker is more elaborate (auction theory, second-price calibration). Same retrieval, same heap-based top-K, same diversification.

E-commerce ranking (Amazon, Shopify, Etsy): same architecture, optimising for predicted purchase probability times margin. Heavy attention to merchandising rules (sponsored items, new arrivals, sale items).

Job recommendation (LinkedIn, Indeed): same architecture, longer feedback cycle (job application -> interview -> hire), so the system relies more on graph features and less on click-time signals.

News feed (Facebook News Feed, Twitter Home): same architecture, with an extra temporal-decay stage because content freshness matters. Twitter's "Heavy Ranker" public description matches this pattern almost line for line.

Search ranking: technically a sibling, not a variant; the candidate generator is replaced by inverted-index retrieval, but the ranker is structurally identical. Google's search ranking has used learning-to-rank models since 2005.

Notification ranking: same architecture deciding which notifications to send and which to suppress. Apple, Google, and Slack all run this internally.

Game matchmaking: same architecture deciding who plays whom, optimising for predicted match quality and queue time. Riot Games and Blizzard have published their versions.

The common thread: any time a system must pick K from a corpus of millions for one user, the architecture above is the right starting point. The differences are in the features, the model, and the business rules; the shape is the same.