parikshan

k-closest-points-to-origin

Difficulty: Medium  ·  Topic: Math, Matrix & Heaps  ·  Practice it on parikshan: /practice/k-closest-points-to-origin/start

The story

A user opens Google Maps in Bandra and searches "coffee near me". The backend has roughly 12,000 cafés in Mumbai indexed by latitude and longitude. The user wants the five closest. The system does not sort 12,000 cafés just to return five. It scans every candidate against a heap of size five, keeping only the current top-five-closest, and discards everything else along the way. Same logic powers Uber and Ola dispatch (the K closest drivers to a rider), Bloomberg's "K most active tickers near a target price", and your phone's "Find My" pinning the K closest devices.

That is k-closest-points-to-origin. The trick is that you do not need to know the order of every point, only which K are closest.

What's actually being asked

You are given n points on a 2D plane and an integer k. Return the k points closest to the origin (0, 0) measured by Euclidean distance.

Two subtleties about output. Tie-break for selection: when more points lie at the boundary distance than the k slots can hold, prefer the smaller x (then the smaller y). Output order: print the chosen k points sorted by (x, y) ascending, one per line.

The naive approach

Compute every distance, then for each of k passes, find the minimum that has not been taken yet. O(n * k) time. Fine for tiny inputs, quadratic in the worst case when k = n. Burns far more work than the problem needs.

The key insight

Squared distance is monotone with distance. Since both quantities are non-negative, comparing x^2 + y^2 preserves the ordering of sqrt(x^2 + y^2). That keeps everything in integer arithmetic, which means no floating-point error and no surprises around equidistant points on a circle. With |x|, |y| <= 10^5, the squared distance fits comfortably in a 64-bit integer (max 2 * 10^10).

Given integer squared distances, there are three reasonable algorithms:

  • Sort by (d, x, y), take the first k. Time O(n log n), space O(n). The cleanest code; one sort line and a slice.
  • Min-heap or max-heap of size k. Time O(n log k), space O(k). Beats sort when k << n.
  • QuickSelect on the squared-distance key. Average O(n), worst case O(n^2) (or O(n) with median-of-medians). The fastest in expectation, the most complex to implement.

For this problem's scale (n <= 10^5), all three pass comfortably. The reference solution uses sort because the tie-break is easier to encode as a key tuple than as a heap comparator. For an interview, mention the heap when k is small and the QuickSelect option when the interviewer pushes for O(n).

Step-by-step approach

  1. Read all points into a list pts.
  2. Sort pts by the key (x*x + y*y, x, y) ascending. The triple is the deterministic tie-break: smaller distance wins; on ties, smaller x wins; on further ties, smaller y wins.
  3. Take the first k points: chosen = pts[:k].
  4. Sort chosen by (x, y) ascending for the output ordering.
  5. Print each point on its own line as x y.

Step-by-step approach (heap-based)

  1. Initialise an empty max-heap. In Python, heapq is a min-heap, so we push negated keys: (-d, -x, -y, x, y). The first three negated values implement max-heap semantics; the last two are the actual point to recover later.
  2. For each point (x, y):
    • Compute d = x*x + y*y.
    • Push (-d, -x, -y, x, y) onto the heap.
    • If the heap size exceeds k, pop the root (the largest (d, x, y) among the current k+1).
  3. After processing all points, the heap holds the k chosen points.
  4. Extract, sort by (x, y), and print.

The heap version is O(n log k) and uses O(k) memory. It wins when k is much smaller than n, or when you cannot hold the whole point set in memory (a stream).

Worked example

n=3, k=2
points = [(1, 3), (-2, 2), (5, 8)]

Squared distances: 1 + 9 = 10, 4 + 4 = 8, 25 + 64 = 89.

Sort by (d, x, y): (-2, 2, d=8), (1, 3, d=10), (5, 8, d=89).

Take first 2: [(-2, 2), (1, 3)].

Sort by (x, y): (-2, 2) comes before (1, 3). Output:

-2 2
1 3

The tie-break case from the spec:

n=4, k=2
points = [(1, 0), (-1, 0), (0, 1), (0, -1)]

All four points are at squared distance 1. Sort by (d, x, y): (-1, 0, 1), (0, -1, 1), (0, 1, 1), (1, 0, 1). Take first two: (-1, 0) and (0, -1). Sort by (x, y): (-1, 0) before (0, -1). Output matches the spec exactly.

This is why the tie-break is encoded into the sort key, not added as post-processing.

Complexity

Sort-based:

  • Time: O(n log n) dominated by the sort.
  • Space: O(n) for the point list.

Heap-based:

  • Time: O(n log k). Each of n points triggers a push (O(log k)) and at most one pop (O(log k)).
  • Space: O(k) for the heap.

QuickSelect:

  • Time: O(n) average, O(n^2) worst case (or O(n) with median-of-medians).
  • Space: O(1) auxiliary, plus the input mutated in place.

Common pitfalls

  • Using math.sqrt and float comparisons: equidistant points become randomly ordered because of float rounding. Stay in integer arithmetic by comparing squared distance.
  • Ignoring the tie-break: the spec requires (d, x, y) for selection. Hand-rolled comparators that ignore the secondary key produce non-deterministic outputs when points are equidistant. Encode the tie-break in the sort key.
  • Confusing selection order with output order: the chosen k are by (d, x, y); the output is by (x, y). Two sorts, on two different keys. The reference solution makes the distinction explicit.
  • Wrong heap polarity: Python's heapq is a min-heap. To keep a max-heap of size k, push negated keys or use heapq.nsmallest(k, pts, key=...). AI agents reliably get the polarity wrong if you do not pin it down.

Where this shows up in the enterprise

  • Google Maps, Apple Maps: "K nearest restaurants / hospitals / petrol pumps" queries. Geospatial indexes (R-trees, S2 cells) shortlist candidates; a final K-closest pass refines the answer.
  • Uber, Ola, Lyft dispatch: when a rider requests, the system needs the K closest available drivers. K is small (5 to 20), the candidate pool is large; the heap version dominates.
  • Spotify Discover Weekly: K closest tracks by embedding distance to a user's taste vector. The candidate pool is millions; the heap version is what makes it tractable.
  • Twitter / X home feed ranking: top-K most engaging tweets per timeline refresh. Heap of size K.
  • Cloudflare top-talkers monitoring: top-K IPs by request volume per minute. Streamed input, fixed K, heap of size K.
  • Bloomberg trade prioritisation: K most urgent orders by a composite urgency score. Heap-based for the streaming case, sort-based for the batch nightly run.
  • Datadog and Grafana p95/p99 latency: not exactly K-closest, but the same heap intuition powers approximate quantile streaming with t-digests.

The pattern is universal: any "top-K by metric" question over a large candidate set is a K-closest problem.

Variants you'll see elsewhere

  • kth-largest-element and kth-smallest-element: same heap intuition, scalar instead of point. See [[kth-largest-element]].
  • top-k-frequent-elements: heap keyed by frequency. Build a frequency map first, then heap the keys by value.
  • find-k-closest-elements: K closest to a target value in a sorted array. Binary search to find the insertion point, then expand outward.
  • merge-k-sorted-lists: heap of list heads, pop the smallest, advance that list. The same priority-queue scaffolding.
  • median-of-stream: two heaps, one max-heap of the lower half and one min-heap of the upper half. K-closest with K=1, sort of.

How parikshan turns this into a learning loop

After you read this article and click through to the practice link, the parikshan flow integrates four layers:

  1. The editor evaluates against public examples and dynamic private tests generated each session. The suite covers k=1, k=n, all-equidistant points, points with negative coordinates, and the boundary case where the tie-break selects a specific point.
  2. Hints are graduated. The first introduces squared distance. The third hands you the composite sort key. The fourth gives the full skeleton with the two-sort approach.
  3. The AI assistant can Explain, Review, Find Bugs, Add Comments, or Optimise at integrity costs in exam mode, free in practice. Ask it to translate your sort solution into a heap version and to verify the polarity.
  4. The dispute flow lets you escalate a contested verdict to a human reviewer.

If you can ship both the sort-based and heap-based solutions and articulate when each is better, the pattern is yours.

In the AI-integrated workspace

When you build geo or recommendation features at work, an agent will write the K-closest code. Your edge is:

  1. Recognising the K-closest shape when a colleague describes a top-K query over a large candidate pool. The trigger phrases are "nearest", "top", "closest", "most relevant".
  2. Picking the right algorithm for the workload. Sort for batch jobs of moderate size; heap for streaming or small K; QuickSelect when an interviewer pushes for O(n).
  3. Verifying the agent stays in integer arithmetic. The primer for this cluster calls out the heap-polarity mistake explicitly; on this problem, the float-vs-int mistake is just as common.
  4. Insisting on a deterministic tie-break. AI-generated code often ignores ties, which produces flaky outputs and tests that pass locally but fail in CI when the input ordering changes.

Engineers who can spot the K-closest shape and pick the right algorithm in seconds direct AI agents productively. Engineers who reach for sort every time end up with O(n log n) code where a heap would have given O(n log k).