parikshan

valid-anagram

Difficulty: Easy  ·  Topic: Arrays & Hashing  ·  Practice it on parikshan: /practice/valid-anagram/start

The story

A search-quality engineer at DuckDuckGo is shipping a "did you mean" feature for typos. A user types recieve; the search box should immediately offer receive. One of the cheapest, fastest signals the ranking pipeline uses is: are the two words anagrams of each other? Anagram pairs are very often typos (letter transpositions), so the pipeline scores them up before sending to the expensive Levenshtein-distance model. The check has a sub-millisecond budget because it runs on every keystroke for every user.

That is valid-anagram. Strip away the search ranking, and what you have is a tight, useful primitive: do two strings contain the same multiset of characters? It powers spell-check, palindromic-pair detection, plagiarism heuristics, and a dozen game mechanics.

What's actually being asked

Given two strings s and t, return true if t is an anagram of s, otherwise false. Two strings are anagrams when they contain exactly the same characters with the same multiplicities. Order does not matter; lengths must match (a stricter consequence of "same multiplicities"). Both strings are lowercase English letters or empty.

The naive approach

Sort both strings, compare them as strings. sorted("anagram") == sorted("nagaram") is True. That works, and at |s| = 5 * 10^4 it is fast enough. But it does O(n log n) work to answer a question that is fundamentally O(n). When valid-anagram is the inner loop of a larger algorithm (group anagrams across a million documents, for example), the log n factor matters.

A second naive attempt: for each character in s, find and remove its first occurrence in t; if at any point the character is absent, return false. That is O(n^2) due to repeated linear scans of t. Correct, but indefensibly slow.

The key insight

"Same multiset of characters" is a counting question. The right structure for counting is a hash map (or, since the alphabet is fixed at 26 lowercase letters, a 26-length array). Build the count for s, build the count for t, compare. If lengths differ you can return false immediately, but the count comparison handles that case naturally because two counters with different totals cannot be equal.

The slicker single-pass variant: walk both strings together, increment a single counter for each character in s and decrement for each in t. If every count is zero at the end, they are anagrams. Saves one pass and one map; same asymptotic cost.

Step-by-step approach

  1. If len(s) != len(t), return false. (Optional early reject.)
  2. Build count_s = Counter(s).
  3. Build count_t = Counter(t).
  4. Return count_s == count_t.
from collections import Counter

def is_anagram(s, t):
    return Counter(s) == Counter(t)

The fixed-alphabet version, often preferred in interviews:

def is_anagram(s, t):
    if len(s) != len(t):
        return False
    counts = [0] * 26
    for ch in s:
        counts[ord(ch) - ord('a')] += 1
    for ch in t:
        counts[ord(ch) - ord('a')] -= 1
    return all(c == 0 for c in counts)

Worked example

s = "anagram"
t = "nagaram"
changrm
s31111
t31111

Every count matches. Return true.

Contrast with s = "rat", t = "car":

chratc
s1110
t1101

t has count 0 and c has count 1. Counters differ. Return false.

Complexity

  • Time: O(n) where n = |s| (assuming |s| = |t| after the early reject).
  • Space: O(sigma) = O(26) = O(1) for a fixed alphabet. For Unicode, O(distinct characters).

Brute force was O(n^2). Sort-based is O(n log n). The counting approach is the asymptotically best you can do, because you must read each character at least once. Beyond that, the only gains are constant-factor: array vs hash map, one pass vs two.

Common pitfalls

  • Forgetting the length check when you implement the increment/decrement variant naively: s = "ab", t = "abc" could read three characters of t and crash or, worse, silently produce a wrong answer. Always check lengths first or iterate carefully.
  • Comparing sorted strings with == works, but be aware sorting allocates a new list each time. In a hot loop, build counts.
  • Unicode: real text has accents, emoji, combining characters. A 26-length array fails the moment you see é. Use a hash map or a normalised form (unicodedata.normalize) before counting.
  • Case sensitivity: "Tea" and "Eat" are anagrams to most humans, not to a naive byte counter. Lowercase first if the spec allows.
  • Whitespace: in real anagram puzzles ("dormitory", "dirty room"), strip spaces before counting. The problem here forbids that, but every interviewer follow-up will add it.
  • Mutating one string for the "remove and check" approach: in Python, strings are immutable, so you cannot. In other languages you can, and you will mutate caller-owned data.

Where this shows up in the enterprise

  • DuckDuckGo / Bing query rewriting: cheap pre-filter for typo candidates before running the expensive edit-distance model.
  • Grammarly suggestions: token-level anagram detection feeds the "did you mean to type X" heuristic.
  • Slack search: #general and #genaerl rank close in "did you mean" precisely because they are nearly-anagram pairs.
  • Salesforce data dedup: customer names with letters transposed (Smith vs Smtih) match as anagrams and get flagged for merge review.
  • Turnitin / Copyleaks plagiarism detection: sentence-level fingerprints often start with multiset-of-characters comparisons before deeper semantic checks.
  • AWS Comprehend custom entities: anagram fingerprints help detect deliberate obfuscation (e.g., users typing aplpe instead of apple to dodge filters).
  • Word-game backends (Spelling Bee at the New York Times, Words With Friends at Zynga): the entire game ranks player input against a precomputed anagram-bucket dictionary.
  • Bioinformatics at Illumina: k-mer counts in DNA sequences are anagrams of each other when permutations exist; counting is the workhorse.

Variants you'll see elsewhere

  • group-anagrams: bucket a list of strings into anagram groups. Same canonical key.
  • find-all-anagrams-in-a-string: sliding window of length |p| over s, anagram check at each step.
  • permutation-in-string: same sliding window, different framing.
  • minimum-window-substring: counting at a larger scale, with windows of variable size.
  • ransom-note: "can the characters of magazine form ransom?", a one-sided multiset containment check.
  • valid-sudoku: count occurrences across rows, columns, and boxes; reject any duplicate.

The skill ladder is: count, compare counts, slide a window over counts.

How parikshan turns this into a learning loop

Read this article, then click through to practice mode. The editor runs your code against public tests instantly and against per-session dynamic private tests generated fresh each attempt, so memorising a gist does not pass. Hints are graduated: hint 1 is free, hint 2 costs a few integrity points, hint 3 gives you the algorithmic skeleton at a larger cost. The AI training assistant in exam mode lets you Explain, Review, Find Bugs, Add Comments, or Optimise, each with a small integrity penalty, so you spend AI deliberately. In practice mode AI chat is free and unpenalised. After the verdict, if you believe the judge was wrong, one click sends the code to a human reviewer through the dispute flow. Attempt this in practice mode first, accept hints only when you need them, then redo in exam mode with no help. When you can solve it in exam mode under three minutes with no AI, the pattern is in your fingers.

In the AI-integrated workspace

An AI agent will, nine times out of ten, generate the sort-based solution. It is two lines and looks elegant. Your job is to:

  1. Read the surrounding context. If the function will be called millions of times (typo detection on every keystroke), the log n factor matters and you must steer the agent toward Counter or an array of counts.
  2. Check the agent's solution handles Unicode if the input is real text. Almost every generated answer assumes ASCII.
  3. Verify the early-return on length mismatch is present in production code paths, even though Counter would handle it. The cost difference on hot paths is real.
  4. Stress-test on the empty-string case and the all-same-character case. Agents skip these and they are the inputs that break.

The candidate who can say "this is valid-anagram with Unicode, use a normalised Counter and lowercase first" in five seconds is the one who ships the search box. The candidate who approves a sort-based answer without thinking ships a feature that costs Snowflake credits on every query. parikshan trains the first kind.