edit-distance
Difficulty: Hard · Topic: 2-D Dynamic Programming · Practice it on parikshan: /practice/edit-distance/start
The story
A customer types "recieve" into the Google search box. The autocomplete service has 100 milliseconds to suggest "receive" before the customer presses Enter. How does it pick that correction out of a dictionary of half a million words? It computes the edit distance between the typed query and each candidate, returns the one with the smallest distance, and ranks ties by frequency. The same algorithm corrects misspelled SKUs at Stripe's merchant dashboard, fuzzy-matches addresses at Salesforce, aligns DNA reads at Illumina, and powers the "did you mean ...?" line at the bottom of every modern search bar.
Edit distance, also called Levenshtein distance, is the boss of the 2-D DP cluster. The state space is the same rectangle you saw in unique-paths and longest-common-subsequence, but on a mismatch you have three transitions instead of two: insert, delete, replace. Every interesting string-comparison problem either is edit distance or reduces to a tiny variation of it.
What's actually being asked
Given two strings s and t, return the minimum number of single-character operations required to transform s into t. The three allowed operations are:
- Insert a character into
sat any position. - Delete a character from
sat any position. - Replace a character in
swith any other character.
Each operation costs 1. The minimum total cost is the edit distance.
Examples:
s = "horse", t = "ros"→ 3. One sequence: replacehwithr(horse → rorse), deleter(rorse → rose), deletee(rose → ros).s = "intention", t = "execution"→ 5.s = "abc", t = "abc"→ 0. Nothing to do.s = "", t = "abc"→ 3. Three insertions.s = "abc", t = ""→ 3. Three deletions.
The naive approach
Recurse on the last characters of each string.
def dist(i, j, s, t):
if i == 0: return j # insert j chars
if j == 0: return i # delete i chars
if s[i-1] == t[j-1]:
return dist(i-1, j-1, s, t)
return 1 + min(
dist(i-1, j-1, s, t), # replace
dist(i-1, j, s, t), # delete from s
dist(i, j-1, s, t), # insert into s
)
Correct, exponential. The recursion tree branches three ways at every mismatch. For two 12-character strings this already calls the function around 5 x 10^5 times, and most of those calls solve the same subproblem repeatedly. For two 100-character strings, you would wait years.
The key insight
The state is a pair of prefix lengths. Once you fix (i, j), the minimum cost to convert s[:i] into t[:j] depends only on (i, j). There are at most (n+1)(m+1) such pairs. Solve each one once.
The four DP questions:
- State:
dp[i][j]= minimum edits to transforms[:i]intot[:j]. - Base case:
dp[0][j] = j(insertjcharacters into the empty string).dp[i][0] = i(deleteicharacters froms[:i]). - Transition:
- If
s[i-1] == t[j-1]:dp[i][j] = dp[i-1][j-1]. No cost; the last characters already line up. - Else:
dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1]).
- If
- Answer:
dp[n][m].
The three transitions, in plain English
These three are the heart of the problem. Internalise each one as an operation on s:
- Replace,
dp[i-1][j-1] + 1. Turn the last character ofs[:i]into the last character oft[:j]. Both prefixes shrink by one. - Delete,
dp[i-1][j] + 1. Drop the last character ofs[:i]. Thesprefix shrinks; thetprefix is unchanged. - Insert,
dp[i][j-1] + 1. Append the last character oft[:j]to the end ofs[:i]. Thesprefix is unchanged; thetprefix shrinks because that target character has now been placed.
When the last characters already match, no operation is needed at that position; that is the "free" diagonal. This is the entire reason the recurrence has a separate match branch.
This is what the topic primer means when it calls edit distance the boss of the cluster. You inherit the rectangular state space from unique-paths, you inherit the match-or-mismatch shape from longest-common-subsequence, and you add a third transition on top.
Step-by-step approach
Iterative DP. Build a (n+1) x (m+1) table.
def edit_distance(s, t):
n, m = len(s), len(t)
dp = [[0] * (m + 1) for _ in range(n + 1)]
for i in range(n + 1): dp[i][0] = i
for j in range(m + 1): dp[0][j] = j
for i in range(1, n + 1):
for j in range(1, m + 1):
if s[i-1] == t[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1])
return dp[n][m]
Each row only needs the previous row and the current row's cell to the left, so collapse the table to two rolling rows of length m + 1:
def edit_distance(s, t):
n, m = len(s), len(t)
if n == 0: return m
if m == 0: return n
prev = list(range(m + 1))
curr = [0] * (m + 1)
for i in range(1, n + 1):
curr[0] = i
for j in range(1, m + 1):
if s[i-1] == t[j-1]:
curr[j] = prev[j-1]
else:
curr[j] = 1 + min(prev[j-1], prev[j], curr[j-1])
prev, curr = curr, prev
return prev[m]
O(n m) time, O(m) space.
Worked example
s = "horse", t = "ros". Fill the 6 by 4 table.
"" r o s
"" [ 0 1 2 3 ]
h [ 1 1 2 3 ]
ho [ 2 2 1 2 ]
hor[ 3 2 2 2 ]
hors[ 4 3 3 2 ]
horse[5 4 4 3 ]
Read the corners:
dp[1][1]:h != r. Min of (dp[0][0]=0replace,dp[0][1]=1delete,dp[1][0]=1insert) plus 1 = 1.dp[2][2]:o == o. Copy diagonal:dp[1][1] = 1.dp[3][1]:r == r. Copy diagonal:dp[2][0] = 2.dp[5][3]:e != s. Min of (dp[4][2]=3,dp[4][3]=2,dp[5][2]=4) plus 1 = 3.
Return dp[5][3] = 3. ✓
Trace one optimal edit sequence by walking back from (5, 3):
(5, 3) = 3came fromdp[4][3] + 1(deletee). Step to(4, 3).(4, 3) = 2came from match (s == s). Step to(3, 2).(3, 2) = 2came fromdp[2][2] + 1(deleter). Step to(2, 2).(2, 2) = 1came from match (o == o). Step to(1, 1).(1, 1) = 1came fromdp[0][0] + 1(replacehwithr). Done.
Sequence: replace h → r, keep o, delete r, keep s, delete e. Three edits, matching the example.
Complexity
- Time: O(n m). Every cell is filled once.
- Space: O(n m) for the full table, O(min(n, m)) with rolling rows.
For the constraint |s|, |t| ≤ 1000, that is 10^6 cells. In C this finishes in milliseconds; in Python the inner loop is the bottleneck. The reference solution encodes both strings to bytes for fast byte-integer equality and unrolls the min of three values into manual if comparisons, both of which materially shave time inside the 10^6-iteration loop.
Common pitfalls
- Off-by-one in the base case.
dp[0][j]must equalj, not0. Forgetting the base row produces correct-looking output ons == tand wrong output everywhere else. - Wrong direction for insert vs delete. The recurrence is symmetric (
edit(s, t) == edit(t, s)), but the operations onsare not.dp[i-1][j]is delete-from-s,dp[i][j-1]is insert-into-s. If you trace back an edit script, get this right or your operations will be backwards. - Naive recursion without memo. Exponential in the input size. Always add
@lru_cacheor write the iterative form. - Confusing edit distance with Hamming distance. Hamming distance is defined only for equal-length strings and counts position-wise mismatches. Edit distance handles unequal lengths and allows insertions and deletions.
- Forgetting that ties exist. When two of the three transitions yield the same minimum, both correspond to valid edit scripts. Don't assume a unique optimal sequence.
Where this shows up in the enterprise
- Google search spell-correction ("Did you mean ...?"): edit distance against the dictionary, ranked by frequency.
- Microsoft Word and Office 365 spell-check: same idea at desktop scale.
- Stripe and Adyen merchant SKU matching: when a merchant types a typo'd SKU into the dashboard, edit distance proposes the nearest valid one.
- Salesforce customer-record deduplication: matching
"John Smith, Acme Inc"against"Jon Smith, Acme"across CRM imports. - Illumina and Oxford Nanopore DNA aligners: pairwise sequence alignment is edit distance with biology-tuned costs (transitions cheaper than transversions, gap-opening cost separate from gap-extension).
- GitHub Copilot and Claude Code diff narration: explaining what changed line by line uses an LCS or edit-distance pass under the hood.
- Linux
diff,git diff,rdiff-backup: every Unix-derived diff engine in the world. - Amazon Alexa and Apple Siri intent matching: aligning recognised speech against the intent vocabulary.
- PostgreSQL extensions (
pg_trgmfor trigram similarity, related fuzzy-match extensions): production-grade fuzzy search at database level.
The pattern is universal: any time you need a "minimum operations to align these two sequences" answer, you want edit distance or a small variation on it.
Variants you'll see elsewhere
one-edit-distance: return whether the distance is exactly 1. Same recurrence, early-exit.delete-operation-for-two-strings: only delete operations allowed. Equal ton + m - 2 * LCS(s, t).minimum-ascii-delete-sum: weight each deletion by the character's ASCII value. Same shape, different costs.- Wagner-Fischer with custom costs: independent costs for insert, delete, replace. Bioinformatics aligners do exactly this.
damerau-levenshtein: also allow swapping adjacent characters. Add one more transition:dp[i-2][j-2] + 1whens[i-1] = t[j-2]ands[i-2] = t[j-1]. The autocomplete "teh" → "the" case.regular-expression-matching/wildcard-matching: same(i, j)table, transitions encode the pattern's*and?semantics.shortest-common-supersequence: produce the shortest string that contains bothsandtas subsequences. Length isn + m - LCS, construction walks the same DP table.
Once you internalise the three transitions, the entire string-alignment family becomes one technique with different weightings.
How parikshan turns this into a learning loop
Edit distance is hard, and the integrated loop is built to compensate for that.
- Read this article twice. Once for the recurrence, once for the three transitions in English. The English description matters more than the code.
- Solve in practice mode with the AI chat off. Before writing code, sketch the 6 by 4 table on paper for the
"horse" → "ros"example. Fill it in by hand. Trace back one optimal sequence. You should not start typing until you have done this. - Submit. The bank's dynamic private tests include both-empty, single-character-mismatch, the classic
"kitten" → "sitting"(distance 3), and a 1000 by 1000 stress case. If you missed the base case, the empty-string tests fail instantly. - AI training: Optimise the solution. The assistant should produce the two-row rolling version. In exam mode each request costs a slice of your integrity score. Spend deliberately; for a hard problem the cost is worth paying once you've made an honest attempt.
- AI training: Find Bugs on a deliberately broken version. Comment out the match-branch and submit. Watch the assistant identify that you are paying for a replace operation on every diagonal step.
The dispute flow has a meaningful role on edit distance. If a 1000 by 1000 stress test TLEs in your Python solution and you believe the reference solution does too, dispute it; a human reviewer can re-examine the time limit. Edit distance is the cluster's hardest problem, and the platform expects scrutiny on the harness here.
In the AI-integrated workspace
Edit distance is the problem where the AI agent's output most often looks right and is subtly wrong. The classic failure modes:
- Wrong transition assignment. The agent maps delete-from-
stodp[i][j-1]instead ofdp[i-1][j]. The answer comes out right (the recurrence is symmetric in(i, j)) but the trace-back produces an incorrect edit script. If you ever ask the agent to "print the operations", this bug surfaces. - Wrong base case direction. The agent writes
dp[0][j] = 0anddp[i][0] = 0. Output is0on every input where one string is empty. Caught by the public("", "abc")test only if the agent ran it. - Charging for the match case. The agent always adds
1and takes theminof four options including the diagonal, even on a match. Wrong by a factor of 2 or so on aligned inputs. - Recursive solution without memo. Looks elegant, times out on anything above 12 characters.
The 2027 engineer's discipline for edit distance:
- State the recurrence in English before the agent codes. Read out the three transitions slowly: replace = diagonal + 1, delete = up + 1, insert = left + 1, free match = diagonal copy.
- Tell the agent to implement that recurrence iteratively with the rolling-row optimisation if memory is a concern.
- Verify on
("horse", "ros")by hand. Fill the 6 by 4 table mentally; spot-check three cells. - Ask the agent to trace back one optimal sequence. This catches transition-direction bugs that pure length-tests miss.
- Stress on edge cases. Both empty, one empty, equal strings, reversed strings, identical-length-zero-overlap strings.
That five-step audit turns a probabilistic generation into a verified solution. parikshan's integrated loop is designed to drill this exact habit, because in your future job at Google's spell-check team or Illumina's variant-caller team, you will direct the AI agent rather than approve its first guess. Edit distance is the dojo where you learn that discipline.