word-ladder

Difficulty: Hard · Topic: Graphs · Practice it on parikshan: /practice/word-ladder/start

The story

A language-learning startup builds a "step-by-step" feature that turns one word into another, one letter at a time, while always passing through real dictionary words. The marketing team thinks it is a cute warm-up exercise. The engineering team realises within a week that it is the single most-asked-about feature, because the user is delighted by a clean five-step transformation and bored by an eight-step one. The product owner wants the minimum ladder length, computed on a dictionary of up to 5,000 words, in well under a second.

This is word-ladder. The same algorithm runs spell-correction suggestion ranking, DNA mutation distance computation, anagram-distance pathfinding in word-game engines, route-finding over implicit state graphs in puzzle solvers, and graph-based recommendation refinement at Spotify and Netflix when the "graph" is implicit and built lazily. The keyword is implicit graph: the edges are not stored, they are generated on demand.

This is the only Hard problem in the graphs cluster. The reason it is hard is not the algorithm (it is BFS) but the engineering: a naive BFS times out, and the two optimisations that save it are conceptually the most interesting of the cluster.

What's actually being asked

You receive beginWord, endWord, and a dictionary of M same-length lowercase words. A "transformation sequence" is a chain s0, s1, ..., sk where s0 = beginWord, sk = endWord, every adjacent pair differs in exactly one letter, and every s1, ..., sk is in the dictionary. Return the number of words in the shortest sequence (k + 1), or 0 if none exists.

The naive approach

Build an explicit graph: for every pair of words, check whether they differ in exactly one letter. If so, add an edge. Then BFS from beginWord to endWord.

The adjacency build is O(M^2 * L): every pair of M words compared character by character. For M = 5000 and L = 10, that is 2.5 * 10^8 character comparisons just to build the graph. In Python, that is multi-second. The grader runs out of patience.

There is no way to speed up the BFS itself: it is already O(V + E). The bottleneck is the neighbour function.

The key insight

Two words are neighbours iff they differ in exactly one position. Equivalently, they share an L-1-character "pattern" with one position replaced by a wildcard *. Group words by patterns: for each word, generate its L wildcard patterns and put the word into each pattern's bucket. Two words are neighbours iff they sit in at least one shared bucket.

Building the bucket map is O(M * L^2): M words, L patterns per word, L-length string per pattern. For M = 5000, L = 10, that is 5 * 10^5 operations. Negligible.

The BFS now expands a word's neighbours by iterating its L patterns and walking the bucket for each pattern: O(L^2 + L * (bucket size)) per word. Across the whole BFS, total work is O(M * L^2) and the constant factor is small enough that Python comfortably finishes in 100ms on the largest input.

That single optimisation is enough to pass the grader on every test. But there is a second optimisation that drops the wall-clock by another factor of 2 to 10 and that is interview gold.

The key insight (2 of 2): bidirectional BFS

Single-end BFS expands a frontier from beginWord until it touches endWord. The branching factor b is roughly the average number of one-edit neighbours per word. The work is O(b^d) where d is the answer (the shortest ladder length). For b = 10 and d = 8, that is 10^8. Tight even with bucket lookups.

Bidirectional BFS runs two BFS frontiers, one from each end, expanding the smaller of the two at each step, stopping as soon as the two frontiers intersect. The work is O(b^(d/2) + b^(d/2)) = O(b^(d/2)). For b = 10 and d = 8, that is 2 * 10^4. Four orders of magnitude cheaper than single-end BFS.

The geometric intuition: a ball of radius d/2 from each end has total volume 2 * b^(d/2), which is much less than a single ball of radius d (volume b^d). Two small balls beat one big ball.

The implementation discipline: at each step, expand the smaller frontier. This is the half-life of the optimisation. Expanding the larger frontier doubles its size while the other waits; expanding the smaller keeps both frontiers roughly the same size and minimises wasted work.

Step-by-step approach

Single-end BFS with patterns

Read begin, end, dictionary list words.
If end not in set(words): return 0.
If begin == end: return 1.
Build pat -> [words] map: for each word w in words, for each i in 0..L-1, append w to bucket[w[:i] + '*' + w[i+1:]].
BFS: deque of (word, steps), visited set starting with begin. Pop, generate patterns, walk each pattern's bucket; if end in bucket, return steps + 1; otherwise for each unvisited word in the bucket, mark and enqueue with steps + 1. After expansion, clear the pattern's bucket (so it is not re-walked).
Return 0 if the queue empties.

Bidirectional BFS

Same pre-checks and bucket build.
Maintain two sets: front = {begin} and back = {end}. Maintain visited with both.
steps = 1.
While both front and back are non-empty:
- If len(front) > len(back): swap.
- Build next_front. For each w in front, for each pattern of w, for each neighbour in the bucket:
  - If neighbour in back: return steps + 1.
  - If neighbour not in visited: add to visited and next_front.
- front = next_front. steps += 1.
Return 0.

Worked example

begin = hit, end = cog
dict = [hot, dot, dog, lot, log, cog]

Buckets (excerpt): *ot -> [hot, dot, lot], h*t -> [hot], ho* -> [hot], *og -> [dog, log, cog], d*t -> [dot], do* -> [dot, dog], l*t -> [lot], lo* -> [lot, log], c*g -> [cog], co* -> [cog], ...

Single-end BFS:

(hit, 1) -> patterns *it, h*t, hi*. Bucket h*t yields hot. Enqueue (hot, 2). No others.
(hot, 2) -> patterns *ot, h*t, ho*. *ot yields dot, lot. Enqueue both with 3. h*t is empty after the previous step. ho* is empty.
(dot, 3) -> *ot empty. d*t yields nothing new (dot itself only). do* yields dog. Enqueue (dog, 4).
(lot, 3) -> lo* yields log. Enqueue (log, 4).
(dog, 4) -> *og yields log, cog. cog == end. Return 5.

Bidirectional BFS converges faster:

front = {hit}, back = {cog}. steps = 1.
Smaller is either; expand front. Neighbours {hot}. next_front = {hot}. steps = 2.
Now back = {cog} smaller. Expand back. Patterns of cog: *og -> dog, log; c*g -> none new; co* -> none new. next_back = {dog, log}. steps = 3.
Now both are size 2. Expand front = {hot}. Patterns: *ot -> dot, lot. Neither in back. next_front = {dot, lot}. steps = 4.
Expand back = {dog, log}. From dog: *og -> log, cog (cog already in back). d*g -> none new. do* -> dot. dot is in front. Return steps + 1 = 5.

Same answer, fewer cells touched. On larger graphs the gap is dramatic.

Complexity

Bucket build: O(M * L^2).
Single-end BFS: O(M * L^2 + M * L * (avg bucket size)). In the worst case O(M * L * average_branching), comfortable for M = 5000, L = 10.
Bidirectional BFS: same theoretical worst case, much smaller in practice.
Memory: O(M * L^2) for the bucket map plus O(M) for the visited set.

Common pitfalls

Brute-force adjacency build: O(M^2 * L) is the timeout. The wildcard-bucket transform is the saver.
Not pre-checking endWord membership: with end not in the dictionary, the BFS exhausts the entire reachable set before returning 0. The single-line if end not in dict: return 0 is a 100x speedup on adversarial inputs.
Forgetting beginWord == endWord: the spec says "number of words in the sequence", and the shortest sequence is [begin] with one word.
Off-by-one on the return: the answer is the number of words in the path, which equals edges + 1. BFS depth counts edges; you return depth + 1.
Failing to clear visited buckets in single-end BFS: each pattern bucket is walked once, then no longer useful; clearing it avoids re-walking and keeps the constant factor low.
Bidirectional BFS without "always expand the smaller frontier": degrades to single-end BFS in the worst case and may even be slower than single-end because of the double bookkeeping.
Treating beginWord as a member of the dictionary by default: the spec says "begin may or may not be in the dictionary". The BFS starts at begin regardless.

Where this shows up in the enterprise

Google search "Did you mean": an edit-distance graph over the corpus is the implicit graph here. BFS depth = edit distance.
Spotify recommendations: walking the song-similarity graph from a seed track to a target track and reporting the shortest connecting chain is the same algorithm on a vastly larger implicit graph.
Slack channel membership (analogy): "shortest chain of channels through which user A and user Z are connected" runs the same way.
Twitter/X feed graph: "fewest retweets connecting these two accounts" is BFS over the follow graph; bidirectional BFS is the production trick.
Google Maps routing (when edge weights are uniform): "fewest turns" rather than "shortest distance" reduces to BFS. Bidirectional BFS is used at intersections and inside the rendering shortcut.
DNA mutation distance: minimum number of point mutations between two genes; the graph is implicit (all single-letter substitutions), and the algorithm is bidirectional BFS over the bucket structure.
Compiler optimisation passes: the smallest sequence of program-rewrite rules that takes one expression to another is a BFS over an implicit graph of rewrites.
Bazel/Gradle "what is the smallest set of changes needed to flip my build from state A to state B": bidirectional BFS over a state graph.
Network routing in mesh networks (Cloudflare Workers): "fewest hops" routing is BFS; bidirectional BFS is the optimisation used in large planet-scale meshes.

The pattern is implicit graph + shortest path + bidirectional BFS for scale.

Variants you'll see elsewhere

word-ladder-ii: return all shortest transformation sequences. The BFS is the same; you also need to reconstruct paths, which adds a layer of bookkeeping.
minimum-genetic-mutation: same algorithm, fixed 8-character DNA strings.
open-the-lock: same algorithm on 4-digit lock combinations; the implicit graph has 10,000 nodes.
bus-routes: BFS over a graph where the nodes are stops and the edges are co-membership in a route.
shortest-path-in-binary-matrix: BFS over a 2-D grid with 8-connectivity.

The unifying pattern: the graph is too big or expensive to materialise, so generate neighbours on the fly with a function. Bidirectional BFS halves the search depth wherever the graph is large.

How parikshan turns this into a learning loop

This is the cluster's flagship Hard problem. The recommended flow takes a week, not a day:

Day 1, practice mode: brute-force adjacency + BFS. Submit. Watch it time out on the stress test. Internalise the bottleneck.
AI training: Explain the wildcard-bucket pattern. The agent should articulate "two words are one-edit neighbours iff they share an L-1-character pattern".
Day 2, practice mode: pattern-bucket BFS. Submit. It passes. Note the wall-clock.
AI training: Explain bidirectional BFS. The agent should articulate the b^d vs 2 * b^(d/2) argument with numbers.
Day 3, practice mode: bidirectional BFS. Submit. It passes, and the wall-clock is 3-10x smaller than single-end. The number is the reward for the optimisation.
Exam mode: redo from blank under 30 minutes. The integrity score on this problem is the cluster's gold standard.

The reason this problem is in the cluster is to force the lesson that the algorithm is not the bottleneck; the neighbour function is. That lesson recurs in every implicit-graph problem in your career.

In the AI-integrated workspace

When you direct an agent to write word-ladder, the failure modes range from "fine but slow" to "subtly wrong":

The agent builds explicit adjacency with O(M^2 * L): passes the public tests, times out on the stress test. Demand the wildcard-bucket optimisation by name.
The agent's bucket build is O(M * L) instead of O(M * L^2): it tries to deduplicate cleverly and ends up missing some neighbours. The correct build is straightforward: for each word, for each position, generate the pattern.
The agent's bidirectional BFS does not swap to the smaller frontier: degrades to single-end. Eyeball the inner loop.
The agent forgets that endWord may not be in the dictionary: it BFS-exhausts the reachable set and returns 0. Correct but slow. The pre-check is one line.
The agent's BFS depth counts edges and returns depth, not depth + 1: off by one on every input.
The agent treats beginWord as a dictionary member or treats it as having to be in the dictionary: the spec is explicit; the agent's interpretation is often wrong.

The audit checklist for this code, in 30 seconds:

Is the neighbour function O(L * bucket) per word, not O(M * L)?
Is bidirectional BFS swapping to the smaller frontier?
Are the endWord not in dict and begin == end short-circuits at the top?
Does the return value count words, not edges?

A reviewer who can run that audit in one breath is the reviewer who ships scalable implicit-graph search. This problem is hard not because the algorithm is hard but because the engineering taste required to choose the right optimisation is. parikshan's drill is to compress that taste into a 30-second reflex. When you can read a BFS over an implicit graph and immediately ask "where is the bucket and is it bidirectional?", you have arrived.