permutation-in-string

Difficulty: Medium · Topic: Sliding Window · Practice it on parikshan: /practice/permutation-in-string/start

The story

A trust-and-safety team at Discord runs a tokenless screening over message bodies. They have a list of trigger words for moderation review, but bad actors obfuscate by anagramming the letters: face becomes cafe, acef, feca, and so on. The team does not want to enumerate all permutations of every trigger; that is 7! = 5040 for a seven-letter word. Instead, they want a function that, given a trigger word s1 and a message s2, returns true iff s2 contains any anagram of s1 as a contiguous substring.

That is permutation-in-string. It powers anagram-aware searches in messaging, search engines, and document deduplication.

What's actually being asked

Given strings s1 and s2, return true iff some contiguous substring of s2 of length |s1| has the same letter multiset as s1. Equivalent: is some anagram of s1 a substring of s2?

The naive approach

For every length-|s1| substring of s2, sort it and compare to sorted s1. O((|s2| - |s1| + 1) * |s1| log |s1|). For |s2| = 10^5 and |s1| = 10^5, that is 10^5 * 10^5 * log(10^5) ~ 10^11. Times out catastrophically.

Slightly better: for every length-|s1| substring, build a 26-int frequency array and compare to s1's. O(|s2| * 26). Better but still has overhead the slide-by-one approach avoids.

The key insight

A fixed-size sliding window of length |s1| over s2. Maintain a 26-int frequency array for the current window. When the window slides one step right, add the entering character, remove the leaving character: two updates instead of recomputing 26 cells.

To compare against s1's frequency in O(1) rather than O(26), keep a single counter matches = "how many of the 26 letter buckets currently match s1's count". On each slide:

After incrementing the entering letter, if its count now equals s1's, matches += 1; if it was equal before the increment (and now is one too many), matches -= 1.
After decrementing the leaving letter, do the mirror update.
The window is an anagram of s1 iff matches == 26.

This drops the per-slide work from O(26) to O(1), giving O(|s2|) overall.

Step-by-step approach

If |s1| > |s2|, return false.
Build need[26] from s1. Build have[26] from the first |s1| characters of s2.
Compute initial matches = number of letters c where need[c] == have[c]. If matches == 26, return true.
For right from |s1| to |s2| - 1:
- Letter in_c = s2[right]. Increment have[in_c]. Update matches: if have[in_c] now equals need[in_c], matches += 1; else if have[in_c] was equal before (i.e., have[in_c] - 1 == need[in_c]), matches -= 1.
- Letter out_c = s2[right - |s1|]. Decrement have[out_c]. Mirror update.
- If matches == 26, return true.
Return false.

Worked example

s1 = "ab"
s2 = "eidbaooo"

need = [a:1, b:1, ...others 0]. Window size 2.

Initial window "ei": have = [e:1, i:1]. matches = 24 (all the zero-buckets match, plus e and i mismatch, a and b mismatch). Not equal.
Slide to "id": similar, no match.
Slide to "db": have = [b:1, d:1]. a mismatch (need 1, have 0). 25.
Slide to "ba": have = [a:1, b:1]. Now a matches, b matches, d and others are 0 and match. matches = 26. Return true.

Complexity

Time: O(|s1| + |s2|). Initial build is O(|s1| + 26); the slide loop is O(|s2|).
Space: O(1) (two fixed 26-int arrays).

Common pitfalls

Recomputing have from scratch every iteration: O(|s1| * |s2|). Fine for |s1| ~ 10, fatal for |s1| ~ 10^5.
Comparing the 26 cells every iteration: works correctly but adds a 26x factor. The matches counter removes it.
Off-by-one on the slide window: the leaving character is s2[right - |s1|], not s2[right - |s1| + 1] or s2[right - |s1| - 1]. Walk through indices once.
Sorting s1 and every substring: even with one sort of s1, the per-substring sort kills you.
Forgetting the |s1| > |s2| early-return: the window logic does nothing useful in that case and may index out of bounds.

Where this shows up in the enterprise

Discord / Slack / WhatsApp content moderation: anagram-aware trigger-word matching.
Cloudflare WAF / AWS WAF signature scanning: detecting obfuscated keywords in request bodies.
Google Search / Bing: anagram-tolerant query suggestion (less common, but used).
Plagiarism detection (Turnitin, Grammarly): n-gram fingerprints that are bag-of-letters at one stage.
Pharma drug-name screening: detecting confusable drug names where letters are rearranged.
DNA microsatellite detection: detecting repeats with a fixed-multiset signature.
Game design (e.g., Scrabble bots, Words With Friends): finding playable anagrams within a board scan.
Inverted-index compression (Elastic / OpenSearch / Solr): anagram canonicalisation for some token-normalisation passes.

Anywhere "same letters, any order" matters, you find this pattern.

Variants you'll see elsewhere

find-all-anagrams-in-a-string: return all starting indices, not just a boolean.
minimum-window-substring: variable window, multiset is a subset rather than equal.
valid-anagram: trivial case, two strings of equal length, no window needed.
group-anagrams: cluster by sorted letters; uses the same multiset invariant but as a hash key.
substring-with-concatenation-of-all-words: harder generalisation, the multiset is over words not letters.

How parikshan turns this into a learning loop

This problem is where the curriculum first asks "did you really learn the fixed-window pattern, or did you memorise the unique-character version?". The dynamic private tests on parikshan include:

|s1| == |s2| and they are equal (true).
|s1| == |s2| and they are different anagrams (true).
|s1| == |s2| and they are not anagrams (false).
|s1| > |s2| (false; tests the early return).
An s2 that contains many near-misses and one true match at the end (forces the full pass).

A correct solution passes them all in O(|s1| + |s2|). A recompute-every-slide solution times out on the last category. The integrity-scored AI sparring partner is the right place to ask "why does the matches counter give O(1) per slide?" before you write the code in exam mode.

In the AI-integrated workspace

An agent generating an "anagram-in-text" search routine often writes the O(|s1| * |s2|) recompute version because it is the obvious translation of the spec. Your review checks:

Is there a single frequency array that is incrementally updated per slide, not rebuilt?
Is the comparison O(1) via a matches counter, or O(26) via cell-by-cell equality?
Is the leaving-character index right - |s1|? Off-by-ones live here.
Does the function early-return when |s1| > |s2|?

The fix is not to rewrite; it is to point at the inner update and say "increment one, decrement one, do not rebuild". An engineer who can articulate that one line of feedback in code review is the engineer the agent needs.