parikshan

longest-substring-without-repeating

Difficulty: Medium  ·  Topic: Sliding Window  ·  Practice it on parikshan: /practice/longest-substring-without-repeating/start

The story

A backend team at GitHub is auditing access tokens. Each token, when first issued, is supposed to use a non-repeating prefix as its short identifier. A migration script must scan every existing token and report the longest stretch of its body where no character repeats. Tokens with long "unique runs" are flagged as candidates for the new short-id field; tokens with no clean run are flagged for rotation. They send you the token string. You return one integer: the length of the longest substring with all distinct characters.

That is longest-substring-without-repeating. The same pattern powers de-duplication on log streams, gesture-input recognition, and DNA sequence k-mer profiling.

What's actually being asked

Given a string s, return the length of the longest contiguous substring of s in which every character is distinct.

"Substring" means contiguous. "Subsequence" would be a different (harder) problem. On pwwkew, the answer is 3 from wke (or kew), not 4 from the subsequence pwke.

The naive approach

Two nested loops. For every start index i, expand a substring forward while characters stay unique (using a set), recording the longest. Worst case O(n^2). For n = 5 * 10^4, that is 2.5 billion. Times out.

The key insight

A sliding window with two pointers left and right. The invariant: every character in the window s[left..right] is unique. Walk right forward one step at a time:

  • If s[right] is already in the window, advance left (kicking characters out of the window) until s[right] is no longer in the window.
  • Record right - left + 1 as a candidate length.
  • Maintain a seen structure (a set, or a dict mapping char to its latest index) to make membership and removal O(1).

The fastest version uses the last-seen index trick: when s[right] was previously seen at index prev, jump left to max(left, prev + 1) in O(1). No removal loop. One pass.

Step-by-step approach

  1. last = {} (char to its most recent index), left = 0, best = 0.
  2. For right from 0 to n-1:
    • If s[right] is in last and last[s[right]] >= left, set left = last[s[right]] + 1.
    • last[s[right]] = right.
    • best = max(best, right - left + 1).
  3. Print best.

The last[..] >= left check is the subtle bit: a character can be in the last map from a previous, already-jumped-past collision. Without that check, left would jump backwards, which is wrong.

Worked example

s = "abcabcbb"
rights[right]last (before)left (after)last (after)windowbest
0a{}0{a:0}"a"1
1b{a:0}0{a:0,b:1}"ab"2
2c{a:0,b:1}0{a:0,b:1,c:2}"abc"3
3alast[a]=0 >= 01{a:3,b:1,c:2}"bca"3
4blast[b]=1 >= 12{a:3,b:4,c:2}"cab"3
5clast[c]=2 >= 23{a:3,b:4,c:5}"abc"3
6blast[b]=4 >= 35{a:3,b:6,c:5}"cb"3
7blast[b]=6 >= 57{a:3,b:7,c:5}"b"3

Answer: 3.

Complexity

  • Time: O(n), single pass.
  • Space: O(min(n, A)) where A is the alphabet size. For ASCII strings, effectively O(1).

Each of left and right only moves forward. Together they move at most 2n times. That is the amortised argument that makes this O(n) even though the inner update looks like it might be O(window).

Common pitfalls

  • Forgetting the last[..] >= left guard: lets left jump backward, which over-counts the window.
  • Using a set without index info: forces you to remove characters one at a time as left advances, which is correct and still O(n) amortised but slower than the index-jump trick.
  • Confusing substring with subsequence: the problem says contiguous. A subsequence problem (longest unique-character subsequence) has a different, much smaller answer space.
  • Resetting the dictionary on every restart: do not. The dictionary is a single source of truth across the whole pass.

Where this shows up in the enterprise

  • GitHub / GitLab / Bitbucket: token-prefix uniqueness scans, short-id generation.
  • Twilio / SendGrid / Mailgun: detecting longest unique-character segments in email headers for spam classification.
  • Splunk / Datadog log dedup: longest run of unique event types within a session window.
  • Genomic pipelines at Illumina / 23andMe: longest k-mer with no repeated nucleotide, a building block for read-quality scoring.
  • Stripe fraud signals: longest unique-character run in a card-not-present descriptor field.
  • Captcha generation at Cloudflare / hCaptcha: ensuring the generated string has a minimum unique-character run.
  • DNA / RNA bioinformatics (BLAST, Bowtie): seed-extension that needs unique-character anchors.

Wherever "no repeat in a window" matters, this pattern is the right hammer.

Variants you'll see elsewhere

  • longest-substring-with-at-most-k-distinct: relax "all distinct" to "at most k distinct"; same sliding window, swap the set for a counter map.
  • longest-substring-with-exactly-k-distinct: compute "at most k" minus "at most k-1".
  • permutation-in-string: fixed window length, look for an anagram match.
  • minimum-window-substring: dual problem; minimum, not maximum, and the constraint is "contains all of T".
  • longest-repeating-character-replacement: variant of "at most k different from the majority".

Every "longest window with property P" problem has the same scaffold.

How parikshan turns this into a learning loop

This problem is where many candidates first see the while versus if distinction that the topic primer warns about. The dynamic private tests on parikshan will probe both:

  • The "all unique" case (abcdefg) where the inner shrink never fires.
  • The "all repeats" case (aaaa) where the inner shrink fires every iteration.
  • The "spaced repeat" case (abba) where the naïve "shrink one position" would be wrong and the last >= left guard is essential.

If you write if instead of while (or the equivalent max(left, prev+1) instead of a plain prev+1), the abba test will fail. Fix it once and you internalise the invariant for life. The AI sparring partner is genuinely useful here in practice mode to walk through abba keystroke by keystroke if you need it.

In the AI-integrated workspace

When an agent writes "longest unique" code, the four checks for code review:

  1. Is the inner shrink correct? Specifically, can left jump backwards? It must not.
  2. Does the membership data structure store indices, not just presence? Indices enable the O(1) jump; presence forces a slower loop.
  3. Is the answer recorded after the jump (when the window is valid), not during it?
  4. Is the alphabet assumption correct? ASCII versus full Unicode versus a custom byte alphabet changes the space bound (and the data-structure choice: a fixed-size array beats a hash map when the alphabet is small).

The pattern recognition is the skill. A senior engineer reading an agent's O(n^2) solution should be able to say "this is a sliding-window problem, switch to two pointers with a last-seen map" without re-deriving the algorithm. parikshan exists to make that recognition automatic.

longest-substring-without-repeating