parikshan

contains-duplicate

Difficulty: Easy  ·  Topic: Arrays & Hashing  ·  Practice it on parikshan: /practice/contains-duplicate/start

The story

A platform engineer at Shopify is debugging a payment retry storm. The mobile app, when network drops, fires the same checkout request multiple times. The backend assigns each attempt a UUID, and the dedup layer must answer one question, fast: in the batch of UUIDs that arrived in the last 30 seconds, did any UUID repeat? If yes, drop the charge; if no, process them all. The batch can be 100,000 UUIDs in a single Cloud Run instance. The dedup function has a 50 ms budget.

That is contains-duplicate. It is the smallest, sharpest version of "I have seen this before". Once you can write it in your sleep, every duplicate-detection feature in your future career, idempotency keys, dedup pipelines, fingerprint caches, becomes a five-minute task.

What's actually being asked

Given an integer array nums, return true if any value appears at least twice; return false if every value is distinct. That is it. No indices, no count, no first-duplicate location, just a yes/no.

The naive approach

Two nested loops. For every i, for every j > i, check nums[i] == nums[j]. If you ever find a match, return true. If both loops finish, return false. This is O(n^2) time, O(1) extra space. At n = 10^5, that is 5 billion comparisons in the worst case. With Python's per-comparison overhead, the timer will trip long before you get a verdict.

A second naive attempt: sort and check adjacent pairs. That works in O(n log n). It is acceptable here, mutates the input, and is the right move if you have no extra memory. But if you have memory, you can do strictly better.

The key insight

The question "have I seen this value already?" is a membership question. Membership in O(1) is exactly what a hash set is built for. Walk the array once. For each value, ask the set first: do you contain this? If yes, return true. If no, add it and move on. If you finish the loop without ever hitting yes, return false.

The whole correctness argument fits in one line: a duplicate (i, j) with i < j is found exactly when the loop reaches index j, because by then nums[i] is already in the set. You never miss one, you never report one twice.

Step-by-step approach

  1. Create an empty hash set seen.
  2. For each value x in nums:
    • If x is in seen, return true. Done.
    • Otherwise, add x to seen.
  3. If the loop completes, return false.
def contains_duplicate(nums):
    seen = set()
    for x in nums:
        if x in seen:
            return True
        seen.add(x)
    return False

That is the entire production algorithm.

Worked example

nums = [1, 2, 3, 1]
stepxseen beforehit?seen after
11{}no{1}
22{1}no{1, 2}
33{1, 2}no{1, 2, 3}
41{1, 2, 3}yesreturn true

Compare with nums = [1, 2, 3, 4]: the table has no hit, the loop ends, we return false.

Complexity

  • Time: O(n) average case (Python set lookup is amortised O(1)).
  • Space: O(n) for the hash set.

Brute force was O(n^2) time, O(1) space. Sort-and-scan is O(n log n) time, O(1) extra space (in-place sort). The hash-set approach buys linear time by paying linear memory. In every real system, that trade wins.

Common pitfalls

  • Adding before checking: writing seen.add(x); if x in seen: return True always returns true after the first element. Always check first, then add.
  • Comparing length to set size as the whole solution: len(nums) != len(set(nums)) is correct but builds the entire set even when the very first repeat would have answered the question. Wastes time and memory on the early-exit case. Fine as a one-liner in a Jupyter notebook, suboptimal in production.
  • Trying to do it in one expression with any and a list comprehension: builds a list even after the answer is decided. Use a generator (any(...)) or a plain loop.
  • Assuming nums is sortable in place when the caller does not allow mutation: silent mutation bites callers in production. Either copy first or use the hash set.
  • Forgetting that hash sets are not safe across processes: if dedup must hold across two pods, the set lives in Redis, not in Python memory. Same algorithm, different store.

Where this shows up in the enterprise

  • Stripe idempotency keys: every payment request carries an idempotency key; the API rejects the second request with the same key for 24 hours. The check is a contains-duplicate over Redis.
  • Shopify checkout retries: the dedup story above; thousands of mobile clients re-fire the same order on flaky LTE and a hash-set check at the edge prevents double charges.
  • AWS S3 multipart uploads: each part has an ETag; the server rejects duplicate part numbers within a single upload. contains-duplicate per upload session.
  • Cloudflare WAF token replay protection: short-lived JWTs are cached for their lifetime; second use within the TTL is rejected.
  • GitHub webhook delivery dedup: webhooks can deliver the same event multiple times. Receivers maintain a hash set of delivery IDs in the last hour.
  • Google Ads click fraud screening: a click ID arriving twice within the de-dup window is a strong fraud signal; the system flags and drops.
  • Postgres unique constraints: at the database layer, UNIQUE is contains-duplicate enforced before insert. The B-tree index does the membership lookup in O(log n).
  • Apache Kafka exactly-once semantics: the broker tracks producer IDs and sequence numbers in a hash table to reject duplicate batches on retry.

If you see "deduplicate", "exactly-once", "idempotent", or "replay protection" in a design doc, the underlying primitive is this problem.

Variants you'll see elsewhere

  • contains-duplicate-ii: a duplicate is only a duplicate if the two indices are within k of each other; replace the unbounded set with a sliding window.
  • contains-duplicate-iii: extends ii to value-range tolerance; needs a sorted structure like a TreeSet or bucket sort.
  • find-the-duplicate-number: array of n+1 values in [1, n] with exactly one duplicate; bonus is to solve in O(1) extra space using Floyd's cycle detection.
  • first-unique-character (in this bank): the inverse, "first non-repeating", needs a counter not a set.
  • valid-sudoku: 27 simultaneous contains-duplicate checks across rows, columns, and 3x3 boxes.

The whole family shares a single skill: noticing when a question can be answered by "have I seen this".

How parikshan turns this into a learning loop

After you read this article and click the practice link, four layers kick in. The editor runs your code against public tests instantly and against per-session dynamic private tests generated freshly each attempt, so memorising someone else's gist does not pass. Hints are graduated: take hint 1 free (a one-line nudge), hint 2 for a small integrity hit, hint 3 for a larger penalty that gives you the algorithmic skeleton. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, and Optimise, each with a small integrity cost, so you spend assistance deliberately. In practice mode, AI chat is free and unpenalised, the right place to ask "why does adding before checking break it?". Finally, the dispute flow sends a contested verdict to a human reviewer in one click. Read this article, attempt in practice mode, accept the integrity hit only when you genuinely need a hint, then redo in exam mode untouched. When you finish under three minutes with no AI, the pattern is muscle memory.

In the AI-integrated workspace

In your future job, no one will pay you to type contains-duplicate from scratch. An agent will. Your job is to:

  1. Recognise the pattern in a feature spec written by a product manager. "Drop duplicate clicks" reads as contains-duplicate over a sliding time window; you must spot it before the agent generates 200 lines of bespoke logic.
  2. Tell the agent precisely: "use a hash set with a per-element check-before-insert; reject the second seen value".
  3. Review the agent's code for the three classic mistakes: inserting before checking, building the full set when an early exit is correct, and hidden O(n^2) creeping in via if x in list_so_far.
  4. Verify the agent considered the distributed case. If two pods can both call dedup(), the set must live in Redis or a database, not in process memory. AI agents miss this constantly.

The candidates who direct agents productively are the ones who can read a spec and say "this is contains-duplicate over a 30-second window, store the keys in Redis with TTL". Candidates who cannot recognise the shape end up approving generated code that looks fine on the sample and ships a double-charge bug to production. parikshan is built to train the first kind.