permutation-in-string
Difficulty: Medium · Topic: Sliding Window · Practice it on parikshan: /practice/permutation-in-string/start
The story
A trust-and-safety team at Discord runs a tokenless screening over message bodies. They have a list of trigger words for moderation review, but bad actors obfuscate by anagramming the letters: face becomes cafe, acef, feca, and so on. The team does not want to enumerate all permutations of every trigger; that is 7! = 5040 for a seven-letter word. Instead, they want a function that, given a trigger word s1 and a message s2, returns true iff s2 contains any anagram of s1 as a contiguous substring.
That is permutation-in-string. It powers anagram-aware searches in messaging, search engines, and document deduplication.
What's actually being asked
Given strings s1 and s2, return true iff some contiguous substring of s2 of length |s1| has the same letter multiset as s1. Equivalent: is some anagram of s1 a substring of s2?
The naive approach
For every length-|s1| substring of s2, sort it and compare to sorted s1. O((|s2| - |s1| + 1) * |s1| log |s1|). For |s2| = 10^5 and |s1| = 10^5, that is 10^5 * 10^5 * log(10^5) ~ 10^11. Times out catastrophically.
Slightly better: for every length-|s1| substring, build a 26-int frequency array and compare to s1's. O(|s2| * 26). Better but still has overhead the slide-by-one approach avoids.
The key insight
A fixed-size sliding window of length |s1| over s2. Maintain a 26-int frequency array for the current window. When the window slides one step right, add the entering character, remove the leaving character: two updates instead of recomputing 26 cells.
To compare against s1's frequency in O(1) rather than O(26), keep a single counter matches = "how many of the 26 letter buckets currently match s1's count". On each slide:
- After incrementing the entering letter, if its count now equals
s1's,matches += 1; if it was equal before the increment (and now is one too many),matches -= 1. - After decrementing the leaving letter, do the mirror update.
- The window is an anagram of
s1iffmatches == 26.
This drops the per-slide work from O(26) to O(1), giving O(|s2|) overall.
Step-by-step approach
- If
|s1| > |s2|, return false. - Build
need[26]froms1. Buildhave[26]from the first|s1|characters ofs2. - Compute initial
matches= number of letterscwhereneed[c] == have[c]. Ifmatches == 26, return true. - For
rightfrom|s1|to|s2| - 1:- Letter
in_c = s2[right]. Incrementhave[in_c]. Updatematches: ifhave[in_c]now equalsneed[in_c],matches += 1; else ifhave[in_c]was equal before (i.e.,have[in_c] - 1 == need[in_c]),matches -= 1. - Letter
out_c = s2[right - |s1|]. Decrementhave[out_c]. Mirror update. - If
matches == 26, return true.
- Letter
- Return false.
Worked example
s1 = "ab"
s2 = "eidbaooo"
need = [a:1, b:1, ...others 0]. Window size 2.
- Initial window
"ei":have = [e:1, i:1].matches= 24 (all the zero-buckets match, pluseandimismatch,aandbmismatch). Not equal. - Slide to
"id": similar, no match. - Slide to
"db":have = [b:1, d:1].amismatch (need 1, have 0). 25. - Slide to
"ba":have = [a:1, b:1]. Nowamatches,bmatches,dand others are 0 and match.matches = 26. Return true.
Complexity
- Time: O(|s1| + |s2|). Initial build is
O(|s1| + 26); the slide loop isO(|s2|). - Space: O(1) (two fixed 26-int arrays).
Common pitfalls
- Recomputing
havefrom scratch every iteration:O(|s1| * |s2|). Fine for|s1| ~ 10, fatal for|s1| ~ 10^5. - Comparing the 26 cells every iteration: works correctly but adds a 26x factor. The
matchescounter removes it. - Off-by-one on the slide window: the leaving character is
s2[right - |s1|], nots2[right - |s1| + 1]ors2[right - |s1| - 1]. Walk through indices once. - Sorting
s1and every substring: even with one sort ofs1, the per-substring sort kills you. - Forgetting the
|s1| > |s2|early-return: the window logic does nothing useful in that case and may index out of bounds.
Where this shows up in the enterprise
- Discord / Slack / WhatsApp content moderation: anagram-aware trigger-word matching.
- Cloudflare WAF / AWS WAF signature scanning: detecting obfuscated keywords in request bodies.
- Google Search / Bing: anagram-tolerant query suggestion (less common, but used).
- Plagiarism detection (Turnitin, Grammarly): n-gram fingerprints that are bag-of-letters at one stage.
- Pharma drug-name screening: detecting confusable drug names where letters are rearranged.
- DNA microsatellite detection: detecting repeats with a fixed-multiset signature.
- Game design (e.g., Scrabble bots, Words With Friends): finding playable anagrams within a board scan.
- Inverted-index compression (Elastic / OpenSearch / Solr): anagram canonicalisation for some token-normalisation passes.
Anywhere "same letters, any order" matters, you find this pattern.
Variants you'll see elsewhere
find-all-anagrams-in-a-string: return all starting indices, not just a boolean.minimum-window-substring: variable window, multiset is a subset rather than equal.valid-anagram: trivial case, two strings of equal length, no window needed.group-anagrams: cluster by sorted letters; uses the same multiset invariant but as a hash key.substring-with-concatenation-of-all-words: harder generalisation, the multiset is over words not letters.
How parikshan turns this into a learning loop
This problem is where the curriculum first asks "did you really learn the fixed-window pattern, or did you memorise the unique-character version?". The dynamic private tests on parikshan include:
|s1| == |s2|and they are equal (true).|s1| == |s2|and they are different anagrams (true).|s1| == |s2|and they are not anagrams (false).|s1| > |s2|(false; tests the early return).- An
s2that contains many near-misses and one true match at the end (forces the full pass).
A correct solution passes them all in O(|s1| + |s2|). A recompute-every-slide solution times out on the last category. The integrity-scored AI sparring partner is the right place to ask "why does the matches counter give O(1) per slide?" before you write the code in exam mode.
In the AI-integrated workspace
An agent generating an "anagram-in-text" search routine often writes the O(|s1| * |s2|) recompute version because it is the obvious translation of the spec. Your review checks:
- Is there a single frequency array that is incrementally updated per slide, not rebuilt?
- Is the comparison
O(1)via amatchescounter, orO(26)via cell-by-cell equality? - Is the leaving-character index
right - |s1|? Off-by-ones live here. - Does the function early-return when
|s1| > |s2|?
The fix is not to rewrite; it is to point at the inner update and say "increment one, decrement one, do not rebuild". An engineer who can articulate that one line of feedback in code review is the engineer the agent needs.