kth-largest-element
Difficulty: Medium · Topic: Math, Matrix & Heaps · Practice it on parikshan: /practice/kth-largest-element/start
The story
A site reliability engineer at Datadog watches API response times in a live dashboard. They do not care about every request's latency; they care about the 99th-percentile latency, the threshold above which only 1% of requests fall. To compute it across a stream of 50 million requests per minute, the system never sorts the whole stream. It maintains, per service, the top 500,000 slowest requests in a heap. The smallest value in that heap is the running p99. Same pattern at Grafana, Bloomberg's risk monitor, Stripe's fraud anomaly detector, and the Twitter ranking pipeline that surfaces the K most engaged tweets per user.
That is kth-largest-element. The toy version asks for the K-th largest of a fixed array. The pattern generalises to streams, percentiles, and top-K queries on any orderable metric.
What's actually being asked
Given an integer array nums and an integer k, return the K-th largest element by value. Note "by value", not "K-th distinct": duplicates each count toward the ranking. The K-th largest of [7, 7, 7] with k = 2 is 7, not the second distinct value (there is no second distinct value).
The naive approach
Sort nums ascending and return nums[n - k]. O(n log n) time, O(1) extra space if the sort is in place. Comfortably under the time limit for the problem's bound of n <= 10^5. Correct, simple, and the right answer for the offline case. It loses work, but the work it loses does not matter at this scale.
The key insight
Sorting computes more information than the question requires. The K-th largest is one value; sorting reveals the full ordering. The clean optimisation: maintain only the top K seen so far, in a structure where the worst of those K is cheap to identify and replace.
That structure is a min-heap of size K. The smallest element of the heap is the weakest of the current top K, and it is also the K-th largest seen so far. When a new value arrives larger than this minimum, push it and pop the minimum; the heap now holds the K largest values seen so far.
After processing every value in nums, the root of the heap is the K-th largest of the entire array.
Time O(n log k), space O(k). Strictly faster than sorting when k << n, and the same memory regardless. The bigger win is that the heap approach generalises to streams where you cannot hold nums in memory.
Step-by-step approach
- Initialise an empty min-heap
h. - For each
xinnums:- Push
xontoh. - If
len(h) > k, pop the minimum.
- Push
- Return
h[0], the root, which is the K-th largest.
In Python, use heapq.heappush and heapq.heappop. Or, more idiomatically, heapq.nlargest(k, nums)[-1], which runs in O(n log k) internally.
Step-by-step approach (sort)
- Sort
numsascending. - Return
nums[n - k].
For interviews, present both. Lead with the heap because it generalises; mention sort for clarity.
Worked example
nums = [3, 2, 1, 5, 6, 4], k = 2
Min-heap walkthrough:
- x=3: heap=[3]. size 1 <= 2.
- x=2: heap=[2, 3]. size 2 <= 2.
- x=1: heap=[1, 3, 2]. size 3 > 2, pop 1. heap=[2, 3].
- x=5: heap=[2, 3, 5]. size 3 > 2, pop 2. heap=[3, 5].
- x=6: heap=[3, 5, 6]. size 3 > 2, pop 3. heap=[5, 6].
- x=4: heap=[4, 6, 5]. size 3 > 2, pop 4. heap=[5, 6].
Root is 5. The 2nd largest of [3, 2, 1, 5, 6, 4] is 5. Correct.
Now the duplicates case:
nums = [3, 2, 3, 1, 2, 4, 5, 5, 6], k = 4
Sorted descending: [6, 5, 5, 4, 3, 3, 2, 2, 1]. The 4th is 4. The heap approach finds the same answer in O(n log 4) work.
Complexity
Heap:
- Time: O(n log k). Each of
nvalues triggers a push (O(log k)) and at most one pop (O(log k)). - Space: O(k) for the heap.
Sort:
- Time: O(n log n).
- Space: O(1) in place, or O(n) for a stable sort that allocates.
QuickSelect (mentioned for completeness):
- Time: O(n) average, O(n^2) worst case unless the pivot is randomised or median-of-medians is used.
- Space: O(1) auxiliary, plus in-place mutation of the input.
- Best when
nis large andkis in the middle of the range. Less useful whenkis small (heap wins) or when stability matters.
Common pitfalls
- Sorting descending and returning
nums[k - 1]: equivalent to the ascending version and works, but if you are going to sort, ascending plusnums[n - k]is the conventional form. - Using a max-heap of size n instead of a min-heap of size k: correct but throws away the memory advantage. A max-heap is the right structure when you need the K largest in order, not just the K-th.
- Wrong heap polarity: Python's
heapqis a min-heap. Tracking the K largest with a min-heap is correct (the smallest of the K is the candidate to evict). Tracking the K largest with a max-heap means popping the wrong element. Agents reliably get this wrong; pin the polarity down in your instructions. - Off-by-one with K-indexing:
kis 1-indexed in the spec. The K-th largest of a length-N array sorted ascending is at indexn - k, notn - k - 1. - Counting distinct values: read the spec. Duplicates count.
Where this shows up in the enterprise
- Datadog, Grafana, New Relic: p95 / p99 / p999 latency over a stream is a K-th largest query. The exact computation uses a heap of size
n * (1 - p); the approximate version uses a t-digest. - Stripe, Adyen fraud detection: top-K riskiest transactions per minute, refreshed in a sliding window. Heap of size K.
- Twitter / X home feed: top-K most engaging candidate tweets per timeline. K is small (around 50); the candidate pool is large; heap dominates sort.
- Spotify Discover Weekly: top-K recommended tracks from a multi-million candidate pool. Heap-based selection over batched scores.
- Bloomberg risk monitor: top-K riskiest positions in a portfolio. Heap over a moving window.
- Cloudflare bot detection: top-K most aggressive client IPs by request rate. Heap of size K, with a TTL on entries.
- PagerDuty incident triage: top-K longest-running open incidents per team. Heap with a heartbeat update.
The pattern is universal: any time you want the K worst, K largest, K most expensive, K hottest items from a large set, the answer is a heap of size K.
Variants you'll see elsewhere
k-closest-points-to-origin: K-th largest by negated distance, with a tuple tie-break. See [[k-closest-points-to-origin]].top-k-frequent-elements: build a frequency map, then heap-of-K over the frequencies.kth-smallest-in-bst: in-order BST traversal stops at the K-th visit.find-median-from-data-stream: two heaps, one max-heap for the lower half and one min-heap for the upper half. The K-th largest at K = n/2.last-stone-weight: max-heap simulation. Different application, same priority queue underneath.
QuickSelect is its own family worth knowing: same problem, average-linear time, the foundation of numpy.partition and std::nth_element.
How parikshan turns this into a learning loop
After you read this article and click through to the practice link, the parikshan flow integrates four layers:
- The editor evaluates against public examples and dynamic private tests generated each session. The suite covers
k = 1(find the maximum),k = n(find the minimum), arrays with many duplicates, arrays with mixed positive and negative numbers, and the 10^5 upper bound. - Hints are graduated. The first observes that you only need the top K, not the full sort. The third hands you the min-heap-of-size-K pattern. The fourth gives the implementation skeleton with both heap and sort versions.
- The AI assistant can Explain, Review, Find Bugs, Add Comments, or Optimise at integrity costs in exam mode, free in practice. Use it to ask why the min-heap (not max-heap) is the right polarity for tracking the K largest, and to translate your code into a QuickSelect version for the discussion.
- The dispute flow lets you escalate a contested verdict to a human reviewer.
If you can ship the heap solution and articulate why k = 1 makes the heap version equivalent to a running max, the pattern is yours. Target: four minutes in exam mode, including handling duplicates.
In the AI-integrated workspace
When you build observability or recommendation features at work, an agent will write the K-th largest code. Your edge is:
- Recognising the top-K shape when a colleague describes "p95 latency", "top earners", "highest-risk accounts", "most active users". The trigger phrases vary; the shape is the same.
- Telling the agent precisely: "min-heap of size K, push then pop when size exceeds K, return the root." That one instruction prevents the agent from defaulting to a full sort or to the wrong heap polarity.
- Verifying the polarity. The Math, Matrix & Heaps primer for this cluster explicitly calls out the heap-polarity mistake; on this problem it is the single most common failure.
- For genuine streaming workloads, switch to a t-digest or a Greenwald-Khanna sketch. The K-heap approach is correct only when
kis fixed; for percentiles over an unbounded stream where you do not knownin advance, you need a different data structure entirely. Knowing when the heap stops working is the senior-engineer skill.
Engineers who can name "this is K-th largest, heap of size K, min-heap polarity" in five seconds of reading a spec direct AI agents productively. Engineers who reach for a full sort every time eventually ship code that times out in production when the input size grows.