group-anagrams
Difficulty: Medium · Topic: Arrays & Hashing · Practice it on parikshan: /practice/group-anagrams/start
The story
A data engineer at Netflix is cleaning the show-search index. Users type titles every which way: BoJack Horseman, bojack horseman, BOJACK HORSEMAN, and sometimes BoaJck Horseman from a fat-fingered Roku remote. The retrieval team wants a fingerprint that collapses all keyboard mangles of the same title into a single bucket, so the recommender can train on aggregate clicks rather than 50 noisy slices. The first version uses an anagram fingerprint as a cheap pre-clustering step: same letters in any order go into the same bucket. The deeper edit-distance pass runs only across buckets.
That is group-anagrams. Given a list of strings, bucket every set of mutual anagrams together. The pattern is the canonical "group by equivalence" task: define an equivalence relation, find a canonical representative for each class, hash by the representative, drop everything else into a list.
What's actually being asked
Given n strings (lowercase or empty), output the anagram groups. Within each group sort the words lexicographically; sort the groups by their first word lexicographically; print one group per line, words separated by single spaces. The sort rules make the output deterministic for grading. The underlying algorithm does not care about the order; the I/O contract does.
The naive approach
For each pair (i, j), run is_anagram(words[i], words[j]). Union them in a disjoint-set if they match. That is O(n^2) anagram checks, each O(k) where k is word length, so O(n^2 k) total. At n = 10^4 and k = 100, that is 10^10 character comparisons, far too slow.
A second naive variant: keep a list of "anagram representatives" seen so far, and for each new word run an anagram check against every representative. Cuts the work but stays O(n^2) in the worst case (every word in its own group).
The key insight
Two strings are anagrams iff they share the same canonical key. Pick any function f such that f(x) = f(y) exactly when x and y are anagrams. Then bucketing is one pass: hash each word by f(word) into a dict[str, list[str]]. After the pass, the dict values are your groups.
Two natural canonical keys:
- Sorted character string:
''.join(sorted(word)). Trivial to write. O(k log k) per word. - 26-tuple of counts:
tuple(counts)wherecounts[i] = (number of 'a' + i). O(k) per word. Slightly faster, slightly more code.
For lowercase-letter inputs with k <= 100, both are dominated by the I/O. Pick whichever is clearer.
Step-by-step approach
- Initialise
groups = defaultdict(list). - For each word
win input order:- Compute
key = ''.join(sorted(w)). - Append
wtogroups[key].
- Compute
- For each bucket, sort its words ascending.
- Sort the list of buckets by
bucket[0]ascending. - Print each bucket as space-joined words on its own line.
from collections import defaultdict
def group_anagrams(words):
groups = defaultdict(list)
for w in words:
key = ''.join(sorted(w))
groups[key].append(w)
sorted_groups = [sorted(g) for g in groups.values()]
sorted_groups.sort(key=lambda g: g[0])
return sorted_groups
Worked example
words = ["eat", "tea", "tan", "ate", "nat", "bat"]
Per-word canonical keys:
| word | sorted key |
|---|---|
| eat | aet |
| tea | aet |
| tan | ant |
| ate | aet |
| nat | ant |
| bat | abt |
Buckets after the pass:
| key | words |
|---|---|
| aet | [eat, tea, ate] |
| ant | [tan, nat] |
| abt | [bat] |
Sort within each bucket: [ate, eat, tea], [nat, tan], [bat]. Sort buckets by first word: [ate, eat, tea] (starts with ate), [bat] (starts with bat), [nat, tan] (starts with nat).
Output:
ate eat tea
bat
nat tan
Complexity
- Time: O(n * k log k) for the sort-based key, O(n * k) for the count-based key. The final sort of
nfirst words is O(n log n * k) for comparisons, dominated by the keying whenk log k >= log n. - Space: O(n * k) to hold the buckets.
Brute force was O(n^2 * k). The canonical-key trick collapses an n^2 join into an n hash-bucketing. It is one of the highest leverage one-line changes you can make.
Common pitfalls
- Mutating the input word by sorting it in place: in Python this is impossible (strings are immutable), in C++ or Java it is a footgun. Always operate on a copy.
- Forgetting that anagram groups can contain duplicates: if the input has
"ab"twice, both go in the same bucket and both appear in the output. The spec treats input as a multiset. - The empty word is a valid input: its key is the empty string. The bucket
[""]prints as a blank line. Skipping it because it looks weird is a wrong answer. - Stable but wrong group ordering: many candidates print groups in the order their keys were first inserted. That is non-deterministic across language hash seedings and fails the judge. Always sort.
- Using count-based key but forgetting the tuple/hashability conversion:
dictkeys must be hashable, sotuple(counts)works,list(counts)does not. - Calling
sorted(w)returns a list, not a string: you must''.join(...)to get a usable string key.
Where this shows up in the enterprise
- Netflix / Disney+ title fingerprinting: collapse keyboard-mangled queries into a single bucket before the recommender model.
- Spotify track normalisation:
"hello"and"Hello"and"HELLO"are the same song; canonical lowercase + character-count fingerprint is the first pass. - Shopify product catalog dedup: a merchant uploads
Red T-Shirt,red tshirt,T-shirt Red. Anagram-style multi-word reshuffling lands them all in one cluster for the merchant to review. - Twitter spam clustering: spammers reorder words to evade exact-string filters. Group-by-fingerprint catches the family.
- Salesforce duplicate-account detection: company names like
Acme Inc,Inc Acme,acme, inc.cluster by multiset fingerprint. - DocuSign signature normalisation: bucket
John Smith,Smith, John,JOHN SMITHto the same identity before the OCR similarity step. - Google Search semantic clustering: at much larger scale and higher dimensionality, the architecture is the same: hash by a canonical key, bucket together.
- Coca-Cola SAP item master cleanup: the supply-chain team uses canonical-key bucketing to merge near-duplicate SKU descriptions across regional warehouses.
Variants you'll see elsewhere
- valid-anagram: the pairwise version, returns boolean.
group-shifted-strings: same shape; the canonical key is the sequence of character-to-character offsets.partition-array-by-pattern: bucket integers by some property; samedefaultdict(list)idiom.isomorphic-strings: equivalence is character-mapping rather than character-multiset; canonical key is the "pattern" string.unique-email-addresses: canonicalise emails (strip dots, ignore plus-suffix) and count distinct keys.equivalent-strings: any "are these the same under operation X" problem reduces to "find a canonical X-form and hash by it".
How parikshan turns this into a learning loop
Read this article, click into practice. The editor runs your code against public tests plus per-session dynamic private tests that are regenerated every attempt so memorising a gist fails. Hints are graduated: free first hint, small penalty for the second, larger for the algorithmic skeleton. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, Optimise, each with a small integrity cost so you spend AI deliberately. In practice mode AI chat is free, the place to ask "why does the count-tuple key need to be a tuple, not a list?". After the verdict, the dispute flow sends a contested result to a human reviewer in one click. Practice mode first, exam mode after, no AI on the final pass. When you can write the four-line solution in two minutes from a blank screen, the pattern has internalised.
In the AI-integrated workspace
An AI agent will produce a perfectly correct defaultdict(list) solution most of the time. Your job in code review is:
- Check the canonical key is correct. If the spec is "anagrams including casing",
sorted(w.lower())is the right key, notsorted(w). Agents miss spec nuances. - Verify the output ordering matches the grader. Generated code often returns
list(groups.values())in insertion order, which is fine in Python 3.7+ but undefined in Go or Java. The deterministic-output requirement here forces an explicit sort that the agent will often forget. - Catch the silent O(k^2) substring of the algorithm when the agent invents a key like
' '.join(c for c in sorted(set(w)) for _ in range(w.count(c))). That's accidentally O(k^2). Steer back tosorted(w)or count tuples. - For Unicode or accented text, the agent will assume ASCII. Add the
unicodedata.normalize('NFKC', w)step before keying.
A senior engineer reads the spec and says in five seconds: "group-by-canonical-key, sort within buckets, sort buckets by first word, watch out for the empty-string bucket". The candidate who can do that directs AI agents profitably. The candidate who cannot ships code that "works on the sample" and fails on the empty-word test. parikshan is built to train the first kind.