longest-repeating-character-replacement
Difficulty: Medium · Topic: Sliding Window · Practice it on parikshan: /practice/longest-repeating-character-replacement/start
The story
A bioinformatics team at Illumina sequences a DNA segment and gets a string of A, C, G, T nucleotides. The base-calling algorithm has a known error rate, so up to k positions in any read may be wrong. The team wants to ask: what is the longest stretch of this read that could plausibly have been a uniform run of a single nucleotide, allowing for at most k miscalls? That length feeds into a heuristic for repeat-region detection.
longest-repeating-character-replacement asks exactly this. Generalised: given a string of uppercase letters, you may change at most k characters; return the length of the longest substring you can turn into a single repeated letter.
What's actually being asked
Given a string s and an integer k, find the longest substring you can transform into one of repeating characters by changing at most k characters in that substring. Return its length.
Equivalent reframing (this is the key one): find the longest window in which window_length - count_of_most_frequent_letter <= k. The quantity on the left is "how many characters would I have to change to make the window uniform".
The naive approach
Triple nested: every start i, every end j, count the most frequent letter in s[i..j], check (j - i + 1) - max_count <= k. O(n^3) (or O(n^2 * 26) with a frequency array). For n = 10^5, far too slow.
The key insight
Slide a window. Maintain a frequency array count[26] and a running max_count = max frequency in the window. The window is valid iff (right - left + 1) - max_count <= k.
When the window becomes invalid, shrink from the left. Subtle but critical: when shrinking, you decrement the leaving letter's count but you do not need to recompute max_count. Why? You only care about the largest valid window seen so far. If max_count ever overestimates the true window max, the validity test only becomes looser, which means you might keep a slightly-too-large window for one extra step, but you can never report a window length larger than what max_count plus k allows for. Since max_count is bounded above by the true max frequency of the best window seen up to that point, the answer is correct.
This non-recomputation is the famous "trick" that makes the solution O(n) instead of O(n * 26). It is a beautiful argument and the kind of insight interviewers probe for.
Step-by-step approach
count = [0] * 26,max_count = 0,left = 0,best = 0.- For
rightfrom 0 ton-1:count[s[right] - 'A'] += 1.max_count = max(max_count, count[s[right] - 'A']).- While
(right - left + 1) - max_count > k:count[s[left] - 'A'] -= 1.left += 1.
best = max(best, right - left + 1).
- Print
best.
The inner while actually runs at most once per outer iteration (the window grows by one, then shrinks by at most one to restore validity). So the "while" is really an "if" in disguise. Either works; while is more defensive.
Worked example
s = "AABABBA", k = 1
| right | char | count[A] | count[B] | max_count | window length | valid? (len - max_count <= 1) | action | best |
|---|---|---|---|---|---|---|---|---|
| 0 | A | 1 | 0 | 1 | 1 | yes (0 <= 1) | - | 1 |
| 1 | A | 2 | 0 | 2 | 2 | yes (0 <= 1) | - | 2 |
| 2 | B | 2 | 1 | 2 | 3 | yes (1 <= 1) | - | 3 |
| 3 | A | 3 | 1 | 3 | 4 | yes (1 <= 1) | - | 4 |
| 4 | B | 3 | 2 | 3 | 5 | no (2 > 1) | shrink: drop s[0]=A, count[A]=2, left=1, len=4, valid (1 <= 1) | 4 |
| 5 | B | 2 | 3 | 3 | 5 | no (2 > 1) | shrink: drop s[1]=A, count[A]=1, left=2, len=4, valid (1 <= 1) | 4 |
| 6 | A | 2 | 3 | 3 | 5 | no (2 > 1) | shrink: drop s[2]=B, count[B]=2, left=3, len=4, valid (1 <= 1) | 4 |
Answer: 4 (e.g., AABA becomes AAAA with one B-to-A flip).
Complexity
- Time: O(n). Each pointer moves at most
ntimes. - Space: O(26) = O(1).
Common pitfalls
- Recomputing
max_counton shrink: harmless for correctness, but adds anO(26)factor. Skip it; the loosermax_countis provably fine. - Using a hash map instead of a 26-int array: works, but the problem restricts to uppercase letters, so the array is cleaner and faster.
- Treating
kas the window size:kis the budget for replacements, not the window length. Read the spec twice. - Not understanding why the "loose
max_count" argument works: this is the most common source of doubt. The argument is: the answer is the length of some valid window. While processing that window, themax_countwill equal the true max of that window, so the answer is recorded correctly. After that window, themax_countmay overestimate the current window's true max, but that only delays shrinking, which cannot inflatebest.
Where this shows up in the enterprise
- Illumina / Pacific Biosciences / Oxford Nanopore: repeat-region detection with bounded miscall budgets.
- Spotify / YouTube / Netflix recommender features: longest "almost-uniform" session segment where the user mostly stuck to one category, allowing for a few exploratory clicks.
- AdTech (Google Ads / Meta Ads / TheTradeDesk): longest "near-uniform-creative" window in an impression stream, accepting a budget of off-target impressions.
- Telecom QoE monitoring: longest window with "mostly one error code" allowing for k stray codes.
- Trading microstructure (Citadel / DRW): longest tick window dominated by a single direction (up or down), allowing for k contrary ticks.
- GitHub Actions log triage: longest run of "passing" status with at most k transient flakes, used to declare a flake versus a real regression.
- A/B test bucket integrity: longest run of a single variant with at most k assignment errors.
The shape is "one dominant value tolerating noise". Real life is full of it.
Variants you'll see elsewhere
longest-substring-with-at-most-k-distinct: a dual; window with at most k distinct characters.max-consecutive-ones-iii: same problem restricted to a binary string withkallowed flips.permutation-in-string: fixed-size window, anagram check.subarray-with-at-most-k-distinct: count-based, not length-based.- Any "near-uniform with bounded edits" problem in stream analysis.
How parikshan turns this into a learning loop
This is the first sliding-window problem in the curriculum that uses a non-trivial invariant: "window_length - max_count <= k". The parikshan dynamic private tests will include:
k = 0(no replacements allowed, answer is longest run of the same letter).k = |s|(full budget, answer is|s|).sof length 1 (answer is1).- All same letter (answer is
|s|regardless ofk). - A pathological string where
max_countover-tracks for a long stretch, to test that you did not recompute it.
The integrity-scored AI sparring partner is most useful here to debate the "loose max_count" argument before exam mode. Many candidates either (a) recompute it and pass with a slower O(26n) solution, or (b) skip the recompute and worry they have a bug. Both are correct, but the second deserves an integrity bonus for trusting the invariant.
In the AI-integrated workspace
Agent-generated solutions almost always recompute max_count on shrink because the proof for not recomputing is non-obvious. Your code review checks:
- Does the agent use the "
window_length - max_count <= k" invariant, or does it brute-force the most-frequent-letter check? - Is the frequency structure a fixed 26-int array (fast) or a hash map (slower constant factor)?
- Does the agent record
bestafter the shrink loop, not inside it? - Does the function correctly handle
k = 0(answer is longest run of one letter)?
You do not need to ask the agent to "optimise"; you need to recognise the slow recompute and tell the agent to remove it with the loose-max argument. That recognition is what separates a senior engineer from a junior one who would accept the working-but-slow code. parikshan trains the recognition by making you write the tight version yourself in exam mode.