binary-search

Difficulty: Easy · Topic: Binary Search · Practice it on parikshan: /practice/binary-search/start

The story

A betting platform stores 10 million accepted bets sorted by timestamp. When a regulator asks "what was the first bet placed after 03:14:00 today?", the database cannot scan all 10 million; it has to land on the answer in under 25 comparisons. The trick has been the same since the 1940s: keep halving the search space.

That is binary search. The textbook framing ("find x in a sorted array") undersells it; the real value is that any monotone yes/no question can be answered in log₂(n) steps. Once you see this, you find binary searches in places you didn't expect: tuning a parameter, finding the breakpoint, deciding capacity.

What's actually being asked

Given a sorted array of integers nums and an integer target, return the index of target if it is present; otherwise return -1.

The naive approach

Walk the array. O(n). For n=10 million, that is 10 million comparisons. Binary search reduces it to 24.

The key insight

The array is sorted. So if you look at the middle element and it is less than the target, the target cannot be in the left half. Eliminate the left half. Repeat. Each step halves the remaining range. After ⌈log₂ n⌉ steps the range has length 1.

The trick is that the mental model is wrong until you see binary search as finding a boundary under a monotone predicate, not "find a number in a sorted array". The predicate "is this element ≥ target?" is True at the answer and False to its left. The False/True boundary is what we want.

Step-by-step approach

The template that does not lie:

def search(nums, target):
    lo, hi = 0, len(nums)              # hi is EXCLUSIVE; "answer is in [lo, hi)"
    while lo < hi:
        mid = (lo + hi) // 2           # left-biased midpoint
        if nums[mid] >= target:        # the monotone predicate
            hi = mid                   # mid might be the answer; keep it
        else:
            lo = mid + 1               # mid is definitely too small
    # lo == hi; "first index with nums[idx] >= target"
    if lo < len(nums) and nums[lo] == target:
        return lo
    return -1

Three commitments that make this template invincible:

hi is exclusive. The invariant is "answer is in [lo, hi)".
The predicate is "True if nums[mid] >= target". It is True at the answer and at every index to its right.
When the loop exits, lo == hi is the first index where the predicate is True. Check whether that index actually holds the target.

Worked example

nums = [-1, 0, 3, 5, 9, 12]
target = 9

lo=0, hi=6 (length 6)
mid=3, nums[3]=5, 5<9 → lo=4
lo=4, hi=6
mid=5, nums[5]=12, 12≥9 → hi=5
lo=4, hi=5
mid=4, nums[4]=9, 9≥9 → hi=4
lo=4, hi=4 → exit
nums[4]=9 == target → return 4. ✓

Counter-case: target=7. Same execution as above gets to lo=hi=4, nums[4]=9 ≠ 7 → return -1.

Complexity

Time: O(log n).
Space: O(1).

Common pitfalls

The four classics:

Inclusive hi confusion: if you set hi = n - 1 and use while lo <= hi, you must also write lo = mid + 1 and hi = mid - 1 (both adjust off the midpoint). The half-open variant in this article is harder to write off-by-one bugs into.
mid = (lo + hi) // 2 overflow in fixed-width integer languages: for very large lo + hi, the sum can wrap. Safe form: mid = lo + (hi - lo) // 2. Python is fine; C, Java, Rust, Go need care.
lo = mid causes an infinite loop: when lo == hi - 1, mid == lo, and you set lo = mid, you never advance. Always either lo = mid + 1 or hi = mid (one of the two strictly shrinks the range).
Forgetting the final equality check: the loop finds the boundary, not the match. After the loop you must check nums[lo] == target for the exact-match version.

Where this shows up in the enterprise

Database B-tree lookups: every primary-key index in PostgreSQL, MySQL, MongoDB, DynamoDB is a tree of binary searches.
git bisect: binary search over commit history to find the first broken commit.
Capacity planners: smallest server count that holds the load is binary search on a feasibility predicate.
Datadog / Grafana percentile queries: t-digest and HDR-histogram lookups are binary searches over centroids.
Kafka offset by timestamp: returning the first offset at or after a given time uses binary search over segment indexes.
Auto-tuning ML training batch size: largest batch that fits in GPU memory; binary-search the answer space.
Stripe rate-limit token allotment: largest quota that does not violate fairness; binary search.
CDN cache warm-up: smallest TTL that keeps hit rate above a target; binary search.
Pricing experiments: lowest price that keeps conversion above threshold.

When you see "smallest X such that ..." or "largest X such that ..." in a spec, the bell should ring.

Variants you'll see elsewhere

search-insert-position: same template; return lo even on miss.
first-bad-version: binary search on a predicate function, not an array.
sqrt-x: binary search the answer space (the integers from 0 to x).
find-min-in-rotated-sorted-array: binary search when the array is rotated; the predicate becomes "is this index past the rotation?".
search-in-rotated-sorted-array: harder predicate, same template.
find-peak-element: binary search on "is this slope going up?".

How parikshan turns this into a learning loop

Binary search is the topic where the AI training mode is most useful and most dangerous. Useful because if you ask the assistant for an explanation, it can describe the invariant cleanly. Dangerous because if you ask for code, the agent will routinely produce a working but inconsistent template (sometimes inclusive, sometimes exclusive, sometimes off by one).

The parikshan loop recommendation: solve binary-search, search-insert-position, first-bad-version, and sqrt-x in practice mode with no AI. Only after all four are AC do you turn on AI training and ask it to Optimise or Review a solution. The point is to lock the template into your fingers; only then can you tell when the agent's template is correct.

The proctor and integrity engine track which features you used. A submission that uses no AI carries no AI tag; a submission that used AI is annotated, so a future review can weigh that signal honestly.

In the AI-integrated workspace

In production, binary search is the algorithm an agent will write most often, because so much enterprise code is "find the first row that ..." or "find the largest X that fits". The agent will get most of them right and a few catastrophically wrong, and the wrong ones will look like the right ones.

Your verification ritual on every agent-generated binary search:

Ask: "what is the loop invariant?" The agent should state it in one sentence.
Ask: "is hi inclusive or exclusive?" The agent should know.
Trace the smallest non-trivial input (length 0, length 1, length 2). Generated code that handles length 3+ correctly often dies on length 0.
Trace an input where the target is not present. The -1 (or sentinel) return must be reached.

Engineers who can do that ritual in under a minute are the ones who ship correct binary-search code in seven languages despite each having different integer semantics. parikshan trains exactly that ritual.