longest-consecutive-sequence

Difficulty: Medium · Topic: Arrays & Hashing · Practice it on parikshan: /practice/longest-consecutive-sequence/start

The story

A platform engineer at GitHub is investigating "issue numbering gaps" inside a single repository. Issues and PRs share a numeric counter, but soft-deleted, spam-filtered, and admin-purged records leave gaps. The product team wants the longest contiguous run of present numbers in the database, because that run is the period of healthy contribution velocity to show on the repo insights page. The query returns the set of present issue numbers; the code must, in linear time, find the longest run of consecutive integers regardless of input order.

That is longest-consecutive-sequence. It is a classic in the "stop sorting, start hashing" school. The temptation is to sort the array and walk it; the discipline is to ask whether sorting is actually necessary.

What's actually being asked

Given an unsorted array of integers nums, return the length of the longest run of consecutive integers that appear in the array. Consecutive means values of the form k, k+1, k+2, ... for some starting k. The values do not need to be adjacent in the input. Duplicates count once. Aim for O(n) total time.

The naive approach

Sort, walk, count adjacent diffs of 1. O(n log n) time, O(1) extra space (in-place sort). At n = 10^5, this is fast enough to pass. But the problem specifies "aim for O(n)", and there is an elegant way to get there.

A second naive variant: for each x, walk outward from x checking x+1, x+2, ... and x-1, x-2, ... via linear scans of the input. That is O(n^2). Correct, terrible.

The key insight

Sorting wastes the structural opportunity. The question is: starting from some value k, how long is the run k, k+1, k+2, ... that appears in the array? You can answer this in O(1) per step if you have a hash set of the values. So the question becomes: for which starting points k do you bother to walk?

Only walk from a value x if x - 1 is not in the set. Otherwise, some smaller value will start the same run, and walking from x would re-count work the smaller start already does. With this guard, every value is part of at most one successful walk (the one starting at the smallest member of its run). All other "tries" are rejected by the x - 1 in s check in O(1). Total work is O(n).

That guard, "only start when you're the head of the run", is the whole insight. Without it, the algorithm is O(n^2) in the worst case.

Step-by-step approach

Build s = set(nums). This deduplicates and gives O(1) membership.
Initialise best = 0.
For each x in s:
- If x - 1 is in s, skip (not a run head).
- Otherwise, walk: cur = x, length = 0. While cur is in s, increment length and cur. Update best = max(best, length).
Return best.

def longest_consecutive(nums):
    s = set(nums)
    best = 0
    for x in s:
        if x - 1 in s:
            continue
        cur = x
        length = 0
        while cur in s:
            length += 1
            cur += 1
        best = max(best, length)
    return best

Worked example

nums = [100, 4, 200, 1, 3, 2]

Build s = {1, 2, 3, 4, 100, 200}.

x	x - 1 in s?	skip?	walk	length	best
100	no (99)	walk	100	1	1
4	yes (3)	skip	-	-	1
200	no (199)	walk	200	1	1
1	no (0)	walk	1, 2, 3, 4	4	4
3	yes (2)	skip	-	-	4
2	yes (1)	skip	-	-	4

Result: 4. The successful walk happened exactly once, from the head of the run. Everything else was rejected at the guard.

Complexity

Time: O(n) amortised. Each value enters at most one successful walk and is rejected in O(1) on every "not a head" attempt.
Space: O(n) for the hash set.

Sort-based was O(n log n) time, O(1) space (if in-place sort is allowed). Hash-set approach trades extra memory for linear time. At n = 10^5 the difference in wall time is small; the elegance is in being able to demonstrate the linear-time argument.

Common pitfalls

Walking from every value without the guard: turns the algorithm O(n^2). The x - 1 in s check is the entire correctness-of-complexity argument; without it the asymptotic analysis fails.
Iterating for x in nums instead of for x in s: causes duplicate starts on repeated values. With the guard, both still produce the correct answer but you do redundant work. Iterate over the set.
Off-by-one in the run length: starting length = 1 after the first match and then incrementing inside the while loop double-counts. Start at 0, increment inside the loop.
Forgetting empty input: n = 1 should return 1, not 0. n = 0 should return 0. Don't return best initialised wrong.
Using a sorted list instead of a set for the x - 1 in s check: that turns the O(1) lookup into O(log n), and total time into O(n log n). Defeats the point.
Trying to do it with sorted(set(nums)) and adjacent diffs: correct but O(n log n). Mention this if the interviewer asks for a simpler alternative; do not present it as the answer when "aim for O(n)" is on the wall.

Where this shows up in the enterprise

GitHub issue numbering gaps: the story above; the repo insights page literally shows this run.
Datadog metric continuity: longest run of consecutive 1-minute buckets where a metric was reported. Identifies the longest healthy reporting streak.
Stripe nightly settlement runs: longest sequence of days a merchant settled without a hold; used in trust scoring.
AWS CloudWatch alarm "ok streaks": monitoring dashboards show longest consecutive period an alarm stayed green.
Splunk index integrity checks: detect longest run of consecutively present block IDs to verify index health.
Google Photos memory grouping: longest run of consecutive days with photo uploads; powers the "you were on vacation for 12 days" story.
Strava streaks: longest consecutive days a user ran. Same shape, dates instead of integers.
Bloomberg pricing feed continuity: longest run of consecutive trading days a security has a price; gap-detection signal for data quality.

Whenever a system shows "longest streak" or "longest contiguous run" over a numeric or date-time axis, the underlying algorithm is this.

Variants you'll see elsewhere

longest-consecutive-sequence-ii: return the actual run, not just the length.
binary-tree-longest-consecutive-sequence: same shape, walked over a tree.
longest-increasing-subsequence: harder; values need not be consecutive integers and the DP is O(n log n).
find-all-numbers-disappeared-in-an-array: find what is missing inside [1, n].
summary-ranges: collapse an array of integers into a list of consecutive-range strings; same "walk while in-set" logic.
arithmetic-slices: consecutive index slices with a constant step; same idea generalised to step != 1.

How parikshan turns this into a learning loop

Read this article, click into practice. The editor runs your code against public tests plus per-session dynamic private tests generated fresh per attempt, so memorising a gist fails. Hints are graduated: free first hint, small penalty for the second, larger for the algorithmic skeleton. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, Optimise, each with a small integrity cost. In practice mode, AI chat is free, the place to argue with yourself about why the x - 1 in s guard makes the algorithm linear. After the verdict, the dispute flow sends contested results to a human reviewer in one click. Attempt in practice, redo in exam mode without AI. When you can deliver the eight-line solution in three minutes and explain the amortised-O(n) argument out loud, the pattern is learnt.

In the AI-integrated workspace

An AI agent will write the sorted(set(nums)) + adjacent-diff solution by default. It is correct and O(n log n). Your review job is:

Decide whether the linear-time variant matters. For n = 10^5 and a one-off batch job, the agent's answer is fine. For a hot path called inside a tight loop on millions of users, demand the hash-set version.
If the agent attempts the hash-set version, verify the guard is correct. Many generated implementations walk from every value and accidentally end up O(n^2), masked because the test cases happen to be short.
Check the iteration is over set(nums), not nums. Iterating over nums with duplicates still gives the right answer but does redundant work the agent does not realise it incurs.
Stress on adversarial inputs: a single value, all equal, dense ranges ([0..10^5]), and inputs with many small runs.

A senior engineer reads the spec and immediately maps it to "hash set + only-walk-from-run-head". The candidate who can do that productively turns a "make this O(n)" review comment into one line. The candidate who cannot wonders why the production batch job takes ten minutes longer than the prototype suggested. parikshan trains the first kind.