parikshan

longest-repeating-character-replacement

Difficulty: Medium  ·  Topic: Sliding Window  ·  Practice it on parikshan: /practice/longest-repeating-character-replacement/start

The story

A bioinformatics team at Illumina sequences a DNA segment and gets a string of A, C, G, T nucleotides. The base-calling algorithm has a known error rate, so up to k positions in any read may be wrong. The team wants to ask: what is the longest stretch of this read that could plausibly have been a uniform run of a single nucleotide, allowing for at most k miscalls? That length feeds into a heuristic for repeat-region detection.

longest-repeating-character-replacement asks exactly this. Generalised: given a string of uppercase letters, you may change at most k characters; return the length of the longest substring you can turn into a single repeated letter.

What's actually being asked

Given a string s and an integer k, find the longest substring you can transform into one of repeating characters by changing at most k characters in that substring. Return its length.

Equivalent reframing (this is the key one): find the longest window in which window_length - count_of_most_frequent_letter <= k. The quantity on the left is "how many characters would I have to change to make the window uniform".

The naive approach

Triple nested: every start i, every end j, count the most frequent letter in s[i..j], check (j - i + 1) - max_count <= k. O(n^3) (or O(n^2 * 26) with a frequency array). For n = 10^5, far too slow.

The key insight

Slide a window. Maintain a frequency array count[26] and a running max_count = max frequency in the window. The window is valid iff (right - left + 1) - max_count <= k.

When the window becomes invalid, shrink from the left. Subtle but critical: when shrinking, you decrement the leaving letter's count but you do not need to recompute max_count. Why? You only care about the largest valid window seen so far. If max_count ever overestimates the true window max, the validity test only becomes looser, which means you might keep a slightly-too-large window for one extra step, but you can never report a window length larger than what max_count plus k allows for. Since max_count is bounded above by the true max frequency of the best window seen up to that point, the answer is correct.

This non-recomputation is the famous "trick" that makes the solution O(n) instead of O(n * 26). It is a beautiful argument and the kind of insight interviewers probe for.

Step-by-step approach

  1. count = [0] * 26, max_count = 0, left = 0, best = 0.
  2. For right from 0 to n-1:
    • count[s[right] - 'A'] += 1.
    • max_count = max(max_count, count[s[right] - 'A']).
    • While (right - left + 1) - max_count > k:
      • count[s[left] - 'A'] -= 1.
      • left += 1.
    • best = max(best, right - left + 1).
  3. Print best.

The inner while actually runs at most once per outer iteration (the window grows by one, then shrinks by at most one to restore validity). So the "while" is really an "if" in disguise. Either works; while is more defensive.

Worked example

s = "AABABBA", k = 1
rightcharcount[A]count[B]max_countwindow lengthvalid? (len - max_count <= 1)actionbest
0A1011yes (0 <= 1)-1
1A2022yes (0 <= 1)-2
2B2123yes (1 <= 1)-3
3A3134yes (1 <= 1)-4
4B3235no (2 > 1)shrink: drop s[0]=A, count[A]=2, left=1, len=4, valid (1 <= 1)4
5B2335no (2 > 1)shrink: drop s[1]=A, count[A]=1, left=2, len=4, valid (1 <= 1)4
6A2335no (2 > 1)shrink: drop s[2]=B, count[B]=2, left=3, len=4, valid (1 <= 1)4

Answer: 4 (e.g., AABA becomes AAAA with one B-to-A flip).

Complexity

  • Time: O(n). Each pointer moves at most n times.
  • Space: O(26) = O(1).

Common pitfalls

  • Recomputing max_count on shrink: harmless for correctness, but adds an O(26) factor. Skip it; the looser max_count is provably fine.
  • Using a hash map instead of a 26-int array: works, but the problem restricts to uppercase letters, so the array is cleaner and faster.
  • Treating k as the window size: k is the budget for replacements, not the window length. Read the spec twice.
  • Not understanding why the "loose max_count" argument works: this is the most common source of doubt. The argument is: the answer is the length of some valid window. While processing that window, the max_count will equal the true max of that window, so the answer is recorded correctly. After that window, the max_count may overestimate the current window's true max, but that only delays shrinking, which cannot inflate best.

Where this shows up in the enterprise

  • Illumina / Pacific Biosciences / Oxford Nanopore: repeat-region detection with bounded miscall budgets.
  • Spotify / YouTube / Netflix recommender features: longest "almost-uniform" session segment where the user mostly stuck to one category, allowing for a few exploratory clicks.
  • AdTech (Google Ads / Meta Ads / TheTradeDesk): longest "near-uniform-creative" window in an impression stream, accepting a budget of off-target impressions.
  • Telecom QoE monitoring: longest window with "mostly one error code" allowing for k stray codes.
  • Trading microstructure (Citadel / DRW): longest tick window dominated by a single direction (up or down), allowing for k contrary ticks.
  • GitHub Actions log triage: longest run of "passing" status with at most k transient flakes, used to declare a flake versus a real regression.
  • A/B test bucket integrity: longest run of a single variant with at most k assignment errors.

The shape is "one dominant value tolerating noise". Real life is full of it.

Variants you'll see elsewhere

  • longest-substring-with-at-most-k-distinct: a dual; window with at most k distinct characters.
  • max-consecutive-ones-iii: same problem restricted to a binary string with k allowed flips.
  • permutation-in-string: fixed-size window, anagram check.
  • subarray-with-at-most-k-distinct: count-based, not length-based.
  • Any "near-uniform with bounded edits" problem in stream analysis.

How parikshan turns this into a learning loop

This is the first sliding-window problem in the curriculum that uses a non-trivial invariant: "window_length - max_count <= k". The parikshan dynamic private tests will include:

  • k = 0 (no replacements allowed, answer is longest run of the same letter).
  • k = |s| (full budget, answer is |s|).
  • s of length 1 (answer is 1).
  • All same letter (answer is |s| regardless of k).
  • A pathological string where max_count over-tracks for a long stretch, to test that you did not recompute it.

The integrity-scored AI sparring partner is most useful here to debate the "loose max_count" argument before exam mode. Many candidates either (a) recompute it and pass with a slower O(26n) solution, or (b) skip the recompute and worry they have a bug. Both are correct, but the second deserves an integrity bonus for trusting the invariant.

In the AI-integrated workspace

Agent-generated solutions almost always recompute max_count on shrink because the proof for not recomputing is non-obvious. Your code review checks:

  1. Does the agent use the "window_length - max_count <= k" invariant, or does it brute-force the most-frequent-letter check?
  2. Is the frequency structure a fixed 26-int array (fast) or a hash map (slower constant factor)?
  3. Does the agent record best after the shrink loop, not inside it?
  4. Does the function correctly handle k = 0 (answer is longest run of one letter)?

You do not need to ask the agent to "optimise"; you need to recognise the slow recompute and tell the agent to remove it with the loose-max argument. That recognition is what separates a senior engineer from a junior one who would accept the working-but-slow code. parikshan trains the recognition by making you write the tight version yourself in exam mode.

longest-repeating-character-replacement