parikshan

missing-number

Difficulty: Easy  ·  Topic: Arrays & Hashing  ·  Practice it on parikshan: /practice/missing-number/start

The story

A platform engineer at PostgreSQL is debugging a sequence-corruption bug. A table uses an auto-increment ID generator that should produce IDs 0..n for n + 1 rows, but a recent failed migration committed n rows with one ID dropped on the floor. The on-call needs the missing ID, fast, to issue a targeted patch insert. The set of IDs that did make it to disk fits in memory (a few hundred thousand integers), but allocating a parallel hash set risks OOM on the already-stressed instance.

That is missing-number. The problem hands you a complete set {0, 1, ..., n} and a near-complete sample (exactly one missing). Your job is to identify the missing value without spending memory you do not have.

What's actually being asked

You are given an array nums of n distinct integers, each drawn from the range [0, n]. Exactly one number from that range is missing. Find and return it. The solution should run in linear time and use only constant extra space beyond the input.

The naive approach

Hash set: load every value into a set, walk i from 0 to n and return the first i not in the set. O(n) time, O(n) space. Works, violates the constant-space ask.

Sort and scan: sort the array; the first index i where nums[i] != i is the missing number; if all match, the answer is n. O(n log n) time, O(1) extra space if you sort in place. Acceptable, slower than necessary.

The key insight

You know the full set {0, 1, ..., n} and you know the array. The missing value is whatever the full set has that the array does not. "Whatever is in one but not the other" can be encoded as an invariant of each collection.

  • Sum invariant: sum({0, 1, ..., n}) = n * (n + 1) / 2. The missing number equals the closed-form expected sum minus the actual sum of the array. One pass to sum. O(n) time, O(1) space.
  • XOR invariant: XOR every number in {0, 1, ..., n} against every number in the array. Pairs cancel because x XOR x = 0, leaving exactly the missing value. O(n) time, O(1) space, and no overflow concerns in fixed-width languages.

Both approaches are linear in time, constant in extra space. In Python, sum is simpler. In C/C++/Java, XOR avoids overflow concerns when n is large.

Step-by-step approach

def missing_number_sum(nums):
    n = len(nums)
    expected = n * (n + 1) // 2
    actual = sum(nums)
    return expected - actual

XOR variant:

def missing_number_xor(nums):
    acc = len(nums)
    for i, v in enumerate(nums):
        acc ^= i ^ v
    return acc

The XOR variant seeds acc with n (because the array's indices only go up to n - 1, missing the index n), then folds in pairs (i, nums[i]) for each i. Every value in the range that appears once in indices and once in nums cancels itself; the single missing value survives.

Worked example

nums = [3, 0, 1]   # n = 3

Sum approach: expected = 3 * 4 / 2 = 6. actual = 3 + 0 + 1 = 4. Missing = 6 - 4 = 2.

XOR approach: seed acc = 3.

inums[i]stepacc
--seed3 (011)
033 ^ 0 ^ 3 = 00 (000)
100 ^ 1 ^ 0 = 11 (001)
211 ^ 2 ^ 1 = 22 (010)

Final acc = 2. Same answer.

Try nums = [3, 0, 1, 4] (n = 4). Expected = 10. Actual = 8. Missing = 2.

Complexity

  • Time: O(n), one pass.
  • Space: O(1) extra.

Hash-set was O(n) time, O(n) space. Sort-and-scan was O(n log n) time, O(1) extra space. The sum and XOR invariants hit both O(n) time and O(1) space at the same time.

Common pitfalls

  • Off-by-one in the sum formula: n * (n + 1) / 2 uses the value of n (the upper bound of the range), which equals the array length here. If you confuse "length minus one" with "length", you get the wrong expected sum.
  • Integer overflow in C++/Java with the sum approach: for n = 10^5, the sum fits comfortably, but at n = 10^9 (different problem) it overflows a 32-bit int. Use 64-bit accumulators or the XOR variant.
  • Forgetting that 0 is in the range: [0, n] includes 0. A common bug is using 1..n and computing a wrong expected sum.
  • Forgetting that the missing value can be 0 or n: scanning for "first index where nums[i] != i" handles in-between cases but misses the endpoints unless you check after the loop.
  • Trying to mutate the input as a marker: works but mutates caller data; if mutation is not allowed in your variant, this is a bug.
  • Floating-point division: n * (n + 1) / 2 in some languages produces a float and loses precision at large n. Always integer-divide (// in Python, / with integer operands in C++).

Where this shows up in the enterprise

  • PostgreSQL / MySQL sequence gap detection: the story above; on-call diagnostic for migration recovery.
  • AWS S3 multipart upload completion: the server must verify all part numbers from 1 to N were uploaded; gaps surface via sum or XOR over reported part numbers.
  • Kafka topic-partition assignment audit: brokers know the expected partition IDs; missing assignments are detected via the same invariant.
  • Cloudflare anycast PoP heartbeat: in a fleet of N PoPs reporting heartbeats every minute, the missing PoP IDs are the invariant difference; sum or XOR over reported IDs identifies which ones missed the window.
  • Stripe webhook delivery sequence: webhooks numbered 1..N; on customer complaints, detect which delivery never made it.
  • GitHub PR numbering audit: across migrations, verify no PR number was orphaned by hash-summing the actual vs expected IDs.
  • Sumo Logic / Splunk index integrity: each log block has a sequence number; the indexer detects gaps using the same arithmetic.
  • Akamai content-delivery chunk validation: chunked file delivery; missing chunk index is found via running XOR of received indices against expected.

Variants you'll see elsewhere

  • find-all-numbers-disappeared-in-an-array: same shape but more than one missing; switch to a marker-mutation approach or a hash set.
  • single-number: every element appears twice except one; the same XOR cancellation trick is the canonical answer.
  • single-number-ii: every element appears three times except one; bit-by-bit modular counting.
  • find-the-duplicate-number: the inverse, find the duplicate in 1..n; Floyd's tortoise-and-hare gives O(1) extra space.
  • gas-station: greedy invariant over an array; same "use the structure to avoid sorting" flavour.
  • xor-of-all-pairs: extends the XOR trick to pairs and triples.

The skill, "spot the closed-form invariant, compute the actual invariant, subtract or XOR", reaches well past missing-number.

How parikshan turns this into a learning loop

Read this article, click into practice. The editor runs your code against public tests plus per-session dynamic private tests generated fresh each attempt, so memorising a gist fails. Hints are graduated: free first hint, small penalty for the second, larger for the third. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, Optimise, each with a small integrity cost so you spend AI deliberately. In practice mode AI chat is free, the right place to compare sum vs XOR vs set. After the verdict, the dispute flow sends contested results to a human reviewer in one click. Attempt in practice, then redo in exam mode without AI. When you can write the two-line sum solution in 90 seconds and explain why XOR is the C++ choice, the pattern is internalised.

In the AI-integrated workspace

An AI agent will write the hash-set version by default. It is the most general and the most memory-wasteful. Your job is to:

  1. Read the constraints. If the problem says "constant extra space", the hash set is wrong. Push back.
  2. Decide between sum and XOR. In Python, sum is fine. In C++ or Java with large n, XOR avoids overflow. Generated code rarely makes this trade-off explicitly.
  3. Check the closed-form formula. The agent will sometimes write n * (n - 1) // 2 (off by one) and silently produce wrong answers.
  4. Stress-test on the endpoints: missing 0, missing n, single-element array (n = 1).
  5. If the variant has more than one missing value, the agent will copy this algorithm and get garbage. Demand a different approach (marker mutation, hash set, or bitmask) explicitly.

A senior engineer reads "find the missing value in [0, n], constant extra space" and immediately maps it to sum-invariant or XOR-invariant. The candidate who can do that directs AI agents productively. The candidate who cannot ships the hash-set version into an OOM-constrained production environment. parikshan trains the first kind.