sqrt-x

Difficulty: Easy · Topic: Binary Search · Practice it on parikshan: /practice/sqrt-x/start

The story

NVIDIA's CUDA scheduler decides how many threads per block fit into shared memory. The maximum threads-per-block is the largest integer t such that t * t * bytes_per_thread <= shared_memory_size. There is no array; there is a feasibility predicate on integer values, and the schedule must find the largest feasible t in microseconds across thousands of kernels.

That is sqrt-x in disguise. The integer square root of a number x is the largest r with r * r <= x. There is no sorted array; there is a predicate over the integers that flips from True to False at exactly one boundary. Binary search the predicate, not an array. This is the mental shift this problem teaches.

What's actually being asked

Given a non-negative integer x (up to 2^31 - 1), return the integer square root: the largest non-negative r such that r * r <= x. Do not call a built-in sqrt or isqrt; the spirit is to practice the binary-search loop.

The naive approach

Walk r = 0, 1, 2, ... and return the last r for which r * r <= x. O(sqrt(x)) work. For x = 2^31 - 1, that is about 46,000 iterations. Fast enough on a modern CPU, but it misses the entire point: the problem is here to make you binary-search the answer space, not the input.

Newton's method (r' = (r + x // r) // 2) converges in around 6 iterations for any x up to 2^31. It is faster than binary search in practice, but it is the wrong tool to learn here. Newton is for when you already speak binary search fluently and want a constant-factor win.

The key insight

There is no array. The "thing being searched" is the integer answer itself; the answer space is [0, x].

Define the predicate P(r) = "r * r <= x". P is monotone: True on [0, isqrt(x)], False on (isqrt(x), x]. The boundary between True and False is the answer. You can apply the first-bad-version template directly, treating each integer in [0, x] as a "version" and P(r) as the predicate.

Binary search the answer space. Whenever a spec asks "the largest X such that ..." or "the smallest X such that ...", and the predicate is monotone, you do not search an input array. You search the integer range of legal answers. The shift from "binary-search the array" to "binary-search the answer" is the single most generalisable insight in the Binary Search cluster.

Step-by-step approach

We want the largest r with P(r) = True. The half-open template, with the predicate negated for clarity, finds the smallest r with P(r) = False, and returns one less:

def isqrt(x):
    if x < 2:
        return x                       # 0 and 1 are their own roots
    lo, hi = 1, x // 2 + 1             # half-open: hi is one past the largest plausible root
    while lo < hi:
        mid = (lo + hi) // 2
        if mid * mid > x:              # "not feasible" predicate (monotone True on the right)
            hi = mid
        else:
            lo = mid + 1
    return lo - 1                      # last feasible r

Loop invariant in English: the boundary (first infeasible r) lies in [lo, hi). When the loop exits, lo is the first infeasible value, so lo - 1 is the last feasible one, which is isqrt(x).

The hi = x // 2 + 1 upper bound is correct because for x >= 4, isqrt(x) <= x // 2. We add 1 to keep hi exclusive of the actual answer.

A more direct "find the largest" variant uses a closed window and tracks the best feasible candidate:

def isqrt(x):
    lo, hi, best = 0, x, 0
    while lo <= hi:
        mid = (lo + hi) // 2
        if mid * mid <= x:
            best = mid                 # feasible; remember it and try larger
            lo = mid + 1
        else:
            hi = mid - 1
    return best

Both are correct. The half-open form composes with the rest of the cluster; the closed form is more obvious for "find the largest" framings. Pick one and commit; mixing them is what produces the most off-by-one bugs in generated code.

Worked example

x = 8. Expected answer: isqrt(8) = 2 (since 2*2 = 4 <= 8 < 9 = 3*3).

Using the half-open template:

lo=1, hi=5. mid=3, 3*3=9 > 8 → hi=3.
lo=1, hi=3. mid=2, 2*2=4 ≤ 8 → lo=3.
lo=3, hi=3 → exit. Return 3 - 1 = 2. ✓

Counter-case: x = 0. The guard returns 0 immediately. (The loop would also work, but the guard saves a multiplication and clarifies intent.)

Counter-case: x = 2^31 - 1 = 2147483647. The loop runs around 30 times; the largest mid * mid evaluated is roughly (46341)^2 ≈ 2.15 * 10^9. In Python this is one big-integer multiplication, no overflow. In C or Java with int, this overflows; you must compute mid * mid in 64-bit, or use mid <= x // mid.

Complexity

Time: O(log x) iterations; each iteration is one integer multiplication.
Space: O(1).

For x = 2^31, that is 31 iterations. Newton converges faster (~6 iterations for the same x) but binary search is asymptotically equivalent and dramatically easier to reason about.

Common pitfalls

Searching [0, x] literally. Fine on Python; on C, mid * mid for mid close to x overflows 64-bit signed. The fix is to bound hi to x // 2 + 1 (for x ≥ 4) or use mid <= x // mid as the predicate (no multiplication).
Returning lo instead of lo - 1 (half-open form) or forgetting to track best (closed form). The half-open form returns the first infeasible value; the answer is one less. Mismatching the return statement to the loop invariant is the classic Binary Search bug.
Forgetting x < 2. With hi = x // 2 + 1, the bound is 1 for x = 1 (a one-iteration loop that returns 0, wrong). The guard fixes this cleanly.
Floating-point sqrt. int(math.sqrt(x)) works for small x but loses precision near 2^31. Specifically, math.sqrt(2**31 - 1) rounds in a way that returns 46341 on some platforms, 46340 on others. The whole point of integer square root is to avoid this. Use math.isqrt if you are in production; binary-search it yourself if you are in the parikshan exam.

Where this shows up in the enterprise

NVIDIA / AMD GPU schedulers: thread-block sizing as in the story above. The largest t such that t*t*<bytes> <= shared_mem is binary-searched per kernel launch.
TensorFlow / PyTorch auto-tuning: the largest tile dimension that fits L2 cache during a matmul. Binary search the tile size.
AWS / GCP capacity planning: the smallest server count s such that s * 1/utilisation >= peak_qps. Same template, different predicate.
Stripe payout sizing: the largest batch size that fits Stripe's per-API-call payout limit. Binary-search the batch.
Cloudflare cache tier sizing: the largest cache TTL that keeps hit rate above a target. Binary-search TTL.
Bloomberg risk engines: the largest position size that keeps margin under a threshold. Binary-search the position.

The fingerprint is always the same: a monotone feasibility predicate over an integer range, with no input array. Once you see that fingerprint, the Binary Search template applies and the problem is solved before it is read.

Variants you'll see elsewhere

valid-perfect-square: same predicate, return whether mid * mid == x for some mid.
koko-eating-bananas: binary-search the eating speed; predicate is "can Koko finish in H hours?". The exact same answer-space binary search as sqrt-x at a higher level of abstraction.
capacity-to-ship-packages-within-d-days: binary-search the ship capacity.
split-array-largest-sum: binary-search the largest allowed sum.
find-the-smallest-divisor-given-a-threshold: binary-search the divisor.

Every problem in this list is the same loop with a different predicate. If you can write sqrt-x in 60 seconds, you can write all of these in 90.

How parikshan turns this into a learning loop

sqrt-x is the problem that converts "binary search the array" reflexes into "binary search the answer" reflexes. Recommended sequence on parikshan:

Solve binary-search, search-insert-position, and first-bad-version first.
In practice mode, AI off, write the half-open template with P(r) = mid * mid > x as the predicate. Run public tests.
The dynamic private tests will throw x = 0, x = 1, x = 2, x = 2^31 - 1, and perfect squares like x = 4, x = 2147395600. A template that returns lo instead of lo - 1 fails on every perfect square. The pattern of failures tells you which off-by-one you committed.
Re-solve with the closed-form best-tracking template. Compare the two. Decide which one you prefer; commit to it for the rest of the cluster.

This is one of the few problems where the parikshan AI assistant is genuinely useful in practice mode: ask "is my upper bound tight?" and you will get a careful answer. The integrity penalty is small in practice mode (zero); use it to dig, not to skip.

In the AI-integrated workspace

Cursor and Copilot will routinely write sqrt-x by linear scan ("walk r upward until r*r > x") because the linear version is shorter and equally correct on the small samples. The bug only shows up on x = 2^31 - 1 where the linear scan does 46,000 iterations per call. In a hot path that runs a million times a second, that is 46 billion operations per second; you will see CPU at 100% and have no idea why.

Verification ritual on every agent-generated isqrt:

Ask: "is this O(log x) or O(sqrt(x))?" The agent should be able to defend the answer.
Ask: "does the multiplication mid * mid overflow in C?" The agent should know that Python is exempt and C requires the mid <= x // mid rewrite.
Trace by hand on x = 0, 1, 2, 4, 8. Generated code that gets x = 8 → 2 right will sometimes get x = 1 → 0 wrong because of an over-eager upper bound.

The general lesson: when an interviewer asks "find the largest X such that ...", the correct answer is binary search the answer space, even when the spec does not say "binary search". Engineers who reach for binary search on those phrasings are the ones who write GPU schedulers, capacity planners, and pricing engines. parikshan trains exactly that reflex.