parikshan

top-k-frequent-elements

Difficulty: Medium  ·  Topic: Arrays & Hashing  ·  Practice it on parikshan: /practice/top-k-frequent-elements/start

The story

An analyst at YouTube is building the daily trending-hashtags widget. Every 15 minutes, the pipeline aggregates the last hour of tagged uploads, of which there can be tens of millions, and surfaces the top 50 hashtags by count. The query has a hard 200 ms p95 budget because the widget renders on the front page. The aggregation runs on Dataflow, but the inner "top-k by count" routine runs in a single process and must be tight.

That is top-k-frequent-elements. It is the workhorse of every "trending", "most-clicked", "hottest", "highest-volume" widget you have ever seen on a tech product. The pattern is two-stage: count, then select.

What's actually being asked

Given an integer array nums and an integer k, return the k integers that appear most frequently. If multiple integers tie for the k-th frequency slot, prefer the smaller integer. Print the chosen k integers in ascending order, separated by single spaces. The tie-break and the output order make the answer deterministic.

The naive approach

For each distinct value, scan the whole array to count it; then pick the top k. If d is the number of distinct values, that is O(n * d) counting plus O(d log d) sorting. When n = 10^5 and d = 10^5, counting is 10^10 operations.

A second naive variant: count distinct pairs (value, frequency) in a dict, then convert to a list and sort by count descending. That is O(n) counting plus O(d log d) sorting, which is fine. Most candidates write this and stop. It is correct and almost optimal. The next step matters only when k is very small relative to d.

The key insight

Counting is the cheapest part of the problem. The interesting question is how you pick the top k from the counts.

  • Sort entries by (-count, value) so larger counts come first, ties broken by smaller value: one O(d log d) sort, slice the first k, sort those k values ascending for output. Total O(n + d log d). This is the pragmatic answer.
  • Heap of size k: push (count, value) onto a min-heap, pop when size exceeds k. End with the k largest. O(d log k), better when k << d. The tie-break inside the heap needs care ((count, -value) so smaller value sorts earlier among equal counts).
  • Bucket sort by count: counts are bounded by n, so an array of buckets indexed by count gives O(n + d). Walk buckets high to low until you have k values, tie-break on value within a bucket.

For the problem's scale (n <= 10^5), the sort approach wins on simplicity and is fast enough. For the YouTube-scale version of the same problem, heaps or bucket sort matter.

Step-by-step approach

  1. Build counts = Counter(nums), mapping each distinct value to its frequency.
  2. Form a list of (value, count) pairs.
  3. Sort by key=(-count, value), so larger counts come first; on ties, smaller value comes first.
  4. Take the first k pairs. Extract their values.
  5. Sort the resulting values ascending. Print space-separated.
from collections import Counter

def top_k_frequent(nums, k):
    counts = Counter(nums)
    pairs = sorted(counts.items(), key=lambda vc: (-vc[1], vc[0]))
    chosen = [v for v, c in pairs[:k]]
    chosen.sort()
    return chosen

Worked example

nums = [1, 2, 3, 4, 5, 1, 2, 3]
k    = 2

Counter:

valuecount
12
22
32
41
51

Three values tie at count 2. Sort by (-count, value):

value-countsort key
1-2(-2, 1)
2-2(-2, 2)
3-2(-2, 3)
4-1(-1, 4)
5-1(-1, 5)

Take the first 2: [(1, 2), (2, 2)]. Extract values [1, 2]. Already ascending. Output: 1 2.

The tie-break rule made the answer deterministic: [1, 3] or [2, 3] would otherwise be equally "correct" but the spec forbids them.

Complexity

  • Time: O(n + d log d) for the sort variant; O(d log k) for the heap variant; O(n + d) for bucket sort.
  • Space: O(d) for the counter and O(k) for the output.

Brute force was O(n * d). All three optimised variants improve over brute force; they differ in which constants matter at scale.

Common pitfalls

  • Forgetting the tie-break: without the (-count, value) key, Python's Counter.most_common(k) returns insertion-order ties, which is non-deterministic across runs and fails the judge.
  • Heap mistakes: a max-heap by (count, value) returns the largest value among tied counts; the spec wants the smallest. Either use (count, -value) or a min-heap of size k with negated counts.
  • Returning the pairs instead of the values: the output is k integers, not k (value, count) pairs. Read the spec.
  • Sorting the chosen values descending: the spec says ascending. Read the spec twice.
  • Counting with a list of zeros indexed by value: when values can be negative or very large (-10^9 <= nums[i] <= 10^9), array-of-counts breaks. Use a hash map.
  • Confusing k with 0-indexed top: top-1 is one element, not zero elements; off-by-one bites in the slice.

Where this shows up in the enterprise

  • YouTube trending hashtags / Twitter trending topics: same shape, larger n, distributed over many shards but each shard runs this exact algorithm.
  • Spotify "Top Tracks" charts: the daily top-50 per country is top-k-frequent over play events.
  • Reddit hot posts: combines this pattern with an exponential decay function, but the substructure is unchanged.
  • Cloudflare Bot Management top offenders: the dashboard surfaces top-N IPs by request count over the last 5 minutes; identical algorithm.
  • Datadog top traces by error rate: top-k by a derived count.
  • Stripe top merchants by volume: in the daily report email, this is a join on charge counts followed by top-k.
  • Sumo Logic / Splunk "top values": every log analytics product has a built-in top operator implemented as some flavour of this.
  • Amazon "Customers who bought this also bought": count co-occurrences across user baskets, take top-k per item.

The algorithm shows up anywhere you see a leaderboard, a most-popular list, or a "biggest offenders" report.

Variants you'll see elsewhere

  • kth-largest-element: top-k where k=1; min-heap of size k is the canonical solution.
  • top-k-frequent-words: same shape, strings instead of ints, lexicographic tie-break.
  • sort-characters-by-frequency: full sort by count, no truncation to k.
  • reorganize-string: greedy assignment from a max-heap by count.
  • task-scheduler: greedy scheduling using a heap of remaining counts; same primitive.
  • find-the-most-competitive-subsequence: a different shape but the heap reasoning rhymes.
  • merge-k-sorted-lists: same heap-of-k pattern applied to streams.

How parikshan turns this into a learning loop

Read this article, then click into practice. The editor runs your code against public tests plus per-session dynamic private tests generated fresh each attempt, so memorising someone else's snippet does not pass. Hints are graduated: free first hint, small penalty for the second, larger for the third. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, Optimise, each with a small integrity cost so you spend AI deliberately. In practice mode, AI chat is free and unpenalised, the right place to debate sort vs heap vs bucket. After the verdict, one click in the dispute flow sends a contested result to a human reviewer. Attempt in practice mode first, then redo in exam mode untouched. When you can write the sort-based solution in three minutes and explain the heap variant in another two, the pattern is internalised.

In the AI-integrated workspace

An AI agent will write Counter(nums).most_common(k) on the first try. That is wrong here, because most_common does not implement the tie-break the spec requires. Your job is to:

  1. Read the spec and notice the tie-break rule. If the agent ignored it, the code passes the obvious test cases and fails the tie test. That single oversight is the difference between a correct review and a shipped bug.
  2. Steer the agent to the right tool. For n in the millions, ask for the heap. For n in the thousands, ask for the sort with explicit key. Tell it which.
  3. Verify the agent handled the output order. The spec says ascending; most_common returns descending by count.
  4. Stress-test on edge cases agents skip: k = d (return everything sorted), all values identical (one bucket), negative values, ties at the boundary.

A senior engineer reads "give me the top 50 trending hashtags with ties broken by alphabetic order" and immediately maps it to (value, count) sorting with (-count, value). The candidate who can do that productively directs the agent in one sentence. The candidate who cannot rubber-stamps most_common(50) and ships a non-deterministic leaderboard. parikshan trains the first kind.