parikshan

edit-distance

Difficulty: Hard  ·  Topic: 2-D Dynamic Programming  ·  Practice it on parikshan: /practice/edit-distance/start

The story

A customer types "recieve" into the Google search box. The autocomplete service has 100 milliseconds to suggest "receive" before the customer presses Enter. How does it pick that correction out of a dictionary of half a million words? It computes the edit distance between the typed query and each candidate, returns the one with the smallest distance, and ranks ties by frequency. The same algorithm corrects misspelled SKUs at Stripe's merchant dashboard, fuzzy-matches addresses at Salesforce, aligns DNA reads at Illumina, and powers the "did you mean ...?" line at the bottom of every modern search bar.

Edit distance, also called Levenshtein distance, is the boss of the 2-D DP cluster. The state space is the same rectangle you saw in unique-paths and longest-common-subsequence, but on a mismatch you have three transitions instead of two: insert, delete, replace. Every interesting string-comparison problem either is edit distance or reduces to a tiny variation of it.

What's actually being asked

Given two strings s and t, return the minimum number of single-character operations required to transform s into t. The three allowed operations are:

  1. Insert a character into s at any position.
  2. Delete a character from s at any position.
  3. Replace a character in s with any other character.

Each operation costs 1. The minimum total cost is the edit distance.

Examples:

  • s = "horse", t = "ros" → 3. One sequence: replace h with r (horse → rorse), delete r (rorse → rose), delete e (rose → ros).
  • s = "intention", t = "execution" → 5.
  • s = "abc", t = "abc" → 0. Nothing to do.
  • s = "", t = "abc" → 3. Three insertions.
  • s = "abc", t = "" → 3. Three deletions.

The naive approach

Recurse on the last characters of each string.

def dist(i, j, s, t):
    if i == 0: return j           # insert j chars
    if j == 0: return i           # delete i chars
    if s[i-1] == t[j-1]:
        return dist(i-1, j-1, s, t)
    return 1 + min(
        dist(i-1, j-1, s, t),     # replace
        dist(i-1, j, s, t),       # delete from s
        dist(i, j-1, s, t),       # insert into s
    )

Correct, exponential. The recursion tree branches three ways at every mismatch. For two 12-character strings this already calls the function around 5 x 10^5 times, and most of those calls solve the same subproblem repeatedly. For two 100-character strings, you would wait years.

The key insight

The state is a pair of prefix lengths. Once you fix (i, j), the minimum cost to convert s[:i] into t[:j] depends only on (i, j). There are at most (n+1)(m+1) such pairs. Solve each one once.

The four DP questions:

  1. State: dp[i][j] = minimum edits to transform s[:i] into t[:j].
  2. Base case: dp[0][j] = j (insert j characters into the empty string). dp[i][0] = i (delete i characters from s[:i]).
  3. Transition:
    • If s[i-1] == t[j-1]: dp[i][j] = dp[i-1][j-1]. No cost; the last characters already line up.
    • Else: dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1]).
  4. Answer: dp[n][m].

The three transitions, in plain English

These three are the heart of the problem. Internalise each one as an operation on s:

  • Replace, dp[i-1][j-1] + 1. Turn the last character of s[:i] into the last character of t[:j]. Both prefixes shrink by one.
  • Delete, dp[i-1][j] + 1. Drop the last character of s[:i]. The s prefix shrinks; the t prefix is unchanged.
  • Insert, dp[i][j-1] + 1. Append the last character of t[:j] to the end of s[:i]. The s prefix is unchanged; the t prefix shrinks because that target character has now been placed.

When the last characters already match, no operation is needed at that position; that is the "free" diagonal. This is the entire reason the recurrence has a separate match branch.

This is what the topic primer means when it calls edit distance the boss of the cluster. You inherit the rectangular state space from unique-paths, you inherit the match-or-mismatch shape from longest-common-subsequence, and you add a third transition on top.

Step-by-step approach

Iterative DP. Build a (n+1) x (m+1) table.

def edit_distance(s, t):
    n, m = len(s), len(t)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    for i in range(n + 1): dp[i][0] = i
    for j in range(m + 1): dp[0][j] = j
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if s[i-1] == t[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1])
    return dp[n][m]

Each row only needs the previous row and the current row's cell to the left, so collapse the table to two rolling rows of length m + 1:

def edit_distance(s, t):
    n, m = len(s), len(t)
    if n == 0: return m
    if m == 0: return n
    prev = list(range(m + 1))
    curr = [0] * (m + 1)
    for i in range(1, n + 1):
        curr[0] = i
        for j in range(1, m + 1):
            if s[i-1] == t[j-1]:
                curr[j] = prev[j-1]
            else:
                curr[j] = 1 + min(prev[j-1], prev[j], curr[j-1])
        prev, curr = curr, prev
    return prev[m]

O(n m) time, O(m) space.

Worked example

s = "horse", t = "ros". Fill the 6 by 4 table.

       ""  r   o   s
   "" [  0   1   2   3  ]
   h  [  1   1   2   3  ]
   ho [  2   2   1   2  ]
   hor[  3   2   2   2  ]
   hors[ 4   3   3   2  ]
   horse[5   4   4   3  ]

Read the corners:

  • dp[1][1]: h != r. Min of (dp[0][0]=0 replace, dp[0][1]=1 delete, dp[1][0]=1 insert) plus 1 = 1.
  • dp[2][2]: o == o. Copy diagonal: dp[1][1] = 1.
  • dp[3][1]: r == r. Copy diagonal: dp[2][0] = 2.
  • dp[5][3]: e != s. Min of (dp[4][2]=3, dp[4][3]=2, dp[5][2]=4) plus 1 = 3.

Return dp[5][3] = 3. ✓

Trace one optimal edit sequence by walking back from (5, 3):

  • (5, 3) = 3 came from dp[4][3] + 1 (delete e). Step to (4, 3).
  • (4, 3) = 2 came from match (s == s). Step to (3, 2).
  • (3, 2) = 2 came from dp[2][2] + 1 (delete r). Step to (2, 2).
  • (2, 2) = 1 came from match (o == o). Step to (1, 1).
  • (1, 1) = 1 came from dp[0][0] + 1 (replace h with r). Done.

Sequence: replace h → r, keep o, delete r, keep s, delete e. Three edits, matching the example.

Complexity

  • Time: O(n m). Every cell is filled once.
  • Space: O(n m) for the full table, O(min(n, m)) with rolling rows.

For the constraint |s|, |t| ≤ 1000, that is 10^6 cells. In C this finishes in milliseconds; in Python the inner loop is the bottleneck. The reference solution encodes both strings to bytes for fast byte-integer equality and unrolls the min of three values into manual if comparisons, both of which materially shave time inside the 10^6-iteration loop.

Common pitfalls

  • Off-by-one in the base case. dp[0][j] must equal j, not 0. Forgetting the base row produces correct-looking output on s == t and wrong output everywhere else.
  • Wrong direction for insert vs delete. The recurrence is symmetric (edit(s, t) == edit(t, s)), but the operations on s are not. dp[i-1][j] is delete-from-s, dp[i][j-1] is insert-into-s. If you trace back an edit script, get this right or your operations will be backwards.
  • Naive recursion without memo. Exponential in the input size. Always add @lru_cache or write the iterative form.
  • Confusing edit distance with Hamming distance. Hamming distance is defined only for equal-length strings and counts position-wise mismatches. Edit distance handles unequal lengths and allows insertions and deletions.
  • Forgetting that ties exist. When two of the three transitions yield the same minimum, both correspond to valid edit scripts. Don't assume a unique optimal sequence.

Where this shows up in the enterprise

  • Google search spell-correction ("Did you mean ...?"): edit distance against the dictionary, ranked by frequency.
  • Microsoft Word and Office 365 spell-check: same idea at desktop scale.
  • Stripe and Adyen merchant SKU matching: when a merchant types a typo'd SKU into the dashboard, edit distance proposes the nearest valid one.
  • Salesforce customer-record deduplication: matching "John Smith, Acme Inc" against "Jon Smith, Acme" across CRM imports.
  • Illumina and Oxford Nanopore DNA aligners: pairwise sequence alignment is edit distance with biology-tuned costs (transitions cheaper than transversions, gap-opening cost separate from gap-extension).
  • GitHub Copilot and Claude Code diff narration: explaining what changed line by line uses an LCS or edit-distance pass under the hood.
  • Linux diff, git diff, rdiff-backup: every Unix-derived diff engine in the world.
  • Amazon Alexa and Apple Siri intent matching: aligning recognised speech against the intent vocabulary.
  • PostgreSQL extensions (pg_trgm for trigram similarity, related fuzzy-match extensions): production-grade fuzzy search at database level.

The pattern is universal: any time you need a "minimum operations to align these two sequences" answer, you want edit distance or a small variation on it.

Variants you'll see elsewhere

  • one-edit-distance: return whether the distance is exactly 1. Same recurrence, early-exit.
  • delete-operation-for-two-strings: only delete operations allowed. Equal to n + m - 2 * LCS(s, t).
  • minimum-ascii-delete-sum: weight each deletion by the character's ASCII value. Same shape, different costs.
  • Wagner-Fischer with custom costs: independent costs for insert, delete, replace. Bioinformatics aligners do exactly this.
  • damerau-levenshtein: also allow swapping adjacent characters. Add one more transition: dp[i-2][j-2] + 1 when s[i-1] = t[j-2] and s[i-2] = t[j-1]. The autocomplete "teh" → "the" case.
  • regular-expression-matching / wildcard-matching: same (i, j) table, transitions encode the pattern's * and ? semantics.
  • shortest-common-supersequence: produce the shortest string that contains both s and t as subsequences. Length is n + m - LCS, construction walks the same DP table.

Once you internalise the three transitions, the entire string-alignment family becomes one technique with different weightings.

How parikshan turns this into a learning loop

Edit distance is hard, and the integrated loop is built to compensate for that.

  1. Read this article twice. Once for the recurrence, once for the three transitions in English. The English description matters more than the code.
  2. Solve in practice mode with the AI chat off. Before writing code, sketch the 6 by 4 table on paper for the "horse" → "ros" example. Fill it in by hand. Trace back one optimal sequence. You should not start typing until you have done this.
  3. Submit. The bank's dynamic private tests include both-empty, single-character-mismatch, the classic "kitten" → "sitting" (distance 3), and a 1000 by 1000 stress case. If you missed the base case, the empty-string tests fail instantly.
  4. AI training: Optimise the solution. The assistant should produce the two-row rolling version. In exam mode each request costs a slice of your integrity score. Spend deliberately; for a hard problem the cost is worth paying once you've made an honest attempt.
  5. AI training: Find Bugs on a deliberately broken version. Comment out the match-branch and submit. Watch the assistant identify that you are paying for a replace operation on every diagonal step.

The dispute flow has a meaningful role on edit distance. If a 1000 by 1000 stress test TLEs in your Python solution and you believe the reference solution does too, dispute it; a human reviewer can re-examine the time limit. Edit distance is the cluster's hardest problem, and the platform expects scrutiny on the harness here.

In the AI-integrated workspace

Edit distance is the problem where the AI agent's output most often looks right and is subtly wrong. The classic failure modes:

  1. Wrong transition assignment. The agent maps delete-from-s to dp[i][j-1] instead of dp[i-1][j]. The answer comes out right (the recurrence is symmetric in (i, j)) but the trace-back produces an incorrect edit script. If you ever ask the agent to "print the operations", this bug surfaces.
  2. Wrong base case direction. The agent writes dp[0][j] = 0 and dp[i][0] = 0. Output is 0 on every input where one string is empty. Caught by the public ("", "abc") test only if the agent ran it.
  3. Charging for the match case. The agent always adds 1 and takes the min of four options including the diagonal, even on a match. Wrong by a factor of 2 or so on aligned inputs.
  4. Recursive solution without memo. Looks elegant, times out on anything above 12 characters.

The 2027 engineer's discipline for edit distance:

  1. State the recurrence in English before the agent codes. Read out the three transitions slowly: replace = diagonal + 1, delete = up + 1, insert = left + 1, free match = diagonal copy.
  2. Tell the agent to implement that recurrence iteratively with the rolling-row optimisation if memory is a concern.
  3. Verify on ("horse", "ros") by hand. Fill the 6 by 4 table mentally; spot-check three cells.
  4. Ask the agent to trace back one optimal sequence. This catches transition-direction bugs that pure length-tests miss.
  5. Stress on edge cases. Both empty, one empty, equal strings, reversed strings, identical-length-zero-overlap strings.

That five-step audit turns a probabilistic generation into a verified solution. parikshan's integrated loop is designed to drill this exact habit, because in your future job at Google's spell-check team or Illumina's variant-caller team, you will direct the AI agent rather than approve its first guess. Edit distance is the dojo where you learn that discipline.

edit-distance