parikshan

group-anagrams

Difficulty: Medium  ·  Topic: Arrays & Hashing  ·  Practice it on parikshan: /practice/group-anagrams/start

The story

A data engineer at Netflix is cleaning the show-search index. Users type titles every which way: BoJack Horseman, bojack horseman, BOJACK HORSEMAN, and sometimes BoaJck Horseman from a fat-fingered Roku remote. The retrieval team wants a fingerprint that collapses all keyboard mangles of the same title into a single bucket, so the recommender can train on aggregate clicks rather than 50 noisy slices. The first version uses an anagram fingerprint as a cheap pre-clustering step: same letters in any order go into the same bucket. The deeper edit-distance pass runs only across buckets.

That is group-anagrams. Given a list of strings, bucket every set of mutual anagrams together. The pattern is the canonical "group by equivalence" task: define an equivalence relation, find a canonical representative for each class, hash by the representative, drop everything else into a list.

What's actually being asked

Given n strings (lowercase or empty), output the anagram groups. Within each group sort the words lexicographically; sort the groups by their first word lexicographically; print one group per line, words separated by single spaces. The sort rules make the output deterministic for grading. The underlying algorithm does not care about the order; the I/O contract does.

The naive approach

For each pair (i, j), run is_anagram(words[i], words[j]). Union them in a disjoint-set if they match. That is O(n^2) anagram checks, each O(k) where k is word length, so O(n^2 k) total. At n = 10^4 and k = 100, that is 10^10 character comparisons, far too slow.

A second naive variant: keep a list of "anagram representatives" seen so far, and for each new word run an anagram check against every representative. Cuts the work but stays O(n^2) in the worst case (every word in its own group).

The key insight

Two strings are anagrams iff they share the same canonical key. Pick any function f such that f(x) = f(y) exactly when x and y are anagrams. Then bucketing is one pass: hash each word by f(word) into a dict[str, list[str]]. After the pass, the dict values are your groups.

Two natural canonical keys:

  • Sorted character string: ''.join(sorted(word)). Trivial to write. O(k log k) per word.
  • 26-tuple of counts: tuple(counts) where counts[i] = (number of 'a' + i). O(k) per word. Slightly faster, slightly more code.

For lowercase-letter inputs with k <= 100, both are dominated by the I/O. Pick whichever is clearer.

Step-by-step approach

  1. Initialise groups = defaultdict(list).
  2. For each word w in input order:
    • Compute key = ''.join(sorted(w)).
    • Append w to groups[key].
  3. For each bucket, sort its words ascending.
  4. Sort the list of buckets by bucket[0] ascending.
  5. Print each bucket as space-joined words on its own line.
from collections import defaultdict

def group_anagrams(words):
    groups = defaultdict(list)
    for w in words:
        key = ''.join(sorted(w))
        groups[key].append(w)
    sorted_groups = [sorted(g) for g in groups.values()]
    sorted_groups.sort(key=lambda g: g[0])
    return sorted_groups

Worked example

words = ["eat", "tea", "tan", "ate", "nat", "bat"]

Per-word canonical keys:

wordsorted key
eataet
teaaet
tanant
ateaet
natant
batabt

Buckets after the pass:

keywords
aet[eat, tea, ate]
ant[tan, nat]
abt[bat]

Sort within each bucket: [ate, eat, tea], [nat, tan], [bat]. Sort buckets by first word: [ate, eat, tea] (starts with ate), [bat] (starts with bat), [nat, tan] (starts with nat).

Output:

ate eat tea
bat
nat tan

Complexity

  • Time: O(n * k log k) for the sort-based key, O(n * k) for the count-based key. The final sort of n first words is O(n log n * k) for comparisons, dominated by the keying when k log k >= log n.
  • Space: O(n * k) to hold the buckets.

Brute force was O(n^2 * k). The canonical-key trick collapses an n^2 join into an n hash-bucketing. It is one of the highest leverage one-line changes you can make.

Common pitfalls

  • Mutating the input word by sorting it in place: in Python this is impossible (strings are immutable), in C++ or Java it is a footgun. Always operate on a copy.
  • Forgetting that anagram groups can contain duplicates: if the input has "ab" twice, both go in the same bucket and both appear in the output. The spec treats input as a multiset.
  • The empty word is a valid input: its key is the empty string. The bucket [""] prints as a blank line. Skipping it because it looks weird is a wrong answer.
  • Stable but wrong group ordering: many candidates print groups in the order their keys were first inserted. That is non-deterministic across language hash seedings and fails the judge. Always sort.
  • Using count-based key but forgetting the tuple/hashability conversion: dict keys must be hashable, so tuple(counts) works, list(counts) does not.
  • Calling sorted(w) returns a list, not a string: you must ''.join(...) to get a usable string key.

Where this shows up in the enterprise

  • Netflix / Disney+ title fingerprinting: collapse keyboard-mangled queries into a single bucket before the recommender model.
  • Spotify track normalisation: "hello" and "Hello" and "HELLO" are the same song; canonical lowercase + character-count fingerprint is the first pass.
  • Shopify product catalog dedup: a merchant uploads Red T-Shirt, red tshirt, T-shirt Red. Anagram-style multi-word reshuffling lands them all in one cluster for the merchant to review.
  • Twitter spam clustering: spammers reorder words to evade exact-string filters. Group-by-fingerprint catches the family.
  • Salesforce duplicate-account detection: company names like Acme Inc, Inc Acme, acme, inc. cluster by multiset fingerprint.
  • DocuSign signature normalisation: bucket John Smith, Smith, John, JOHN SMITH to the same identity before the OCR similarity step.
  • Google Search semantic clustering: at much larger scale and higher dimensionality, the architecture is the same: hash by a canonical key, bucket together.
  • Coca-Cola SAP item master cleanup: the supply-chain team uses canonical-key bucketing to merge near-duplicate SKU descriptions across regional warehouses.

Variants you'll see elsewhere

  • valid-anagram: the pairwise version, returns boolean.
  • group-shifted-strings: same shape; the canonical key is the sequence of character-to-character offsets.
  • partition-array-by-pattern: bucket integers by some property; same defaultdict(list) idiom.
  • isomorphic-strings: equivalence is character-mapping rather than character-multiset; canonical key is the "pattern" string.
  • unique-email-addresses: canonicalise emails (strip dots, ignore plus-suffix) and count distinct keys.
  • equivalent-strings: any "are these the same under operation X" problem reduces to "find a canonical X-form and hash by it".

How parikshan turns this into a learning loop

Read this article, click into practice. The editor runs your code against public tests plus per-session dynamic private tests that are regenerated every attempt so memorising a gist fails. Hints are graduated: free first hint, small penalty for the second, larger for the algorithmic skeleton. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, Optimise, each with a small integrity cost so you spend AI deliberately. In practice mode AI chat is free, the place to ask "why does the count-tuple key need to be a tuple, not a list?". After the verdict, the dispute flow sends a contested result to a human reviewer in one click. Practice mode first, exam mode after, no AI on the final pass. When you can write the four-line solution in two minutes from a blank screen, the pattern has internalised.

In the AI-integrated workspace

An AI agent will produce a perfectly correct defaultdict(list) solution most of the time. Your job in code review is:

  1. Check the canonical key is correct. If the spec is "anagrams including casing", sorted(w.lower()) is the right key, not sorted(w). Agents miss spec nuances.
  2. Verify the output ordering matches the grader. Generated code often returns list(groups.values()) in insertion order, which is fine in Python 3.7+ but undefined in Go or Java. The deterministic-output requirement here forces an explicit sort that the agent will often forget.
  3. Catch the silent O(k^2) substring of the algorithm when the agent invents a key like ' '.join(c for c in sorted(set(w)) for _ in range(w.count(c))). That's accidentally O(k^2). Steer back to sorted(w) or count tuples.
  4. For Unicode or accented text, the agent will assume ASCII. Add the unicodedata.normalize('NFKC', w) step before keying.

A senior engineer reads the spec and says in five seconds: "group-by-canonical-key, sort within buckets, sort buckets by first word, watch out for the empty-string bucket". The candidate who can do that directs AI agents profitably. The candidate who cannot ships code that "works on the sample" and fails on the empty-word test. parikshan is built to train the first kind.