parikshan

longest-common-subsequence

Difficulty: Medium  ·  Topic: 2-D Dynamic Programming  ·  Practice it on parikshan: /practice/longest-common-subsequence/start

The story

The platform team at GitHub is debugging a flaky three-way merge. When two engineers edit the same file on two branches, the merge engine has to decide, line by line, which lines are common to both edits, which were inserted by one side, and which were deleted. The mathematical core of that decision is one question: given two sequences of lines, what is the longest sequence of lines that appears in both, in the same relative order? That is the longest common subsequence, or LCS.

The same DP runs inside Microsoft's spell-check, inside diff on every Linux machine you have ever touched, inside DNA aligners at Illumina, inside the change-detection engine at Google Docs. Once you see this problem in one place, you will see it everywhere two ordered sequences have to be compared.

What's actually being asked

Given two strings s and t, return the length of their longest common subsequence. A subsequence is obtained by deleting zero or more characters from a string without changing the order of the rest. It is not required to be contiguous.

Examples:

  • s = "abcde", t = "ace" → 3. The LCS is "ace".
  • s = "abc", t = "abc" → 3. The LCS is the whole string.
  • s = "abc", t = "def" → 0. No character in common.
  • s = "AGGTAB", t = "GXTXAYB" → 4. The LCS is "GTAB".

The naive approach

Recurse on the last characters. If they match, the LCS includes that character plus the LCS of the prefixes. If they do not, the LCS is whichever is larger of "drop the last character of s" or "drop the last character of t":

def lcs(i, j, s, t):
    if i == 0 or j == 0: return 0
    if s[i-1] == t[j-1]: return 1 + lcs(i-1, j-1, s, t)
    return max(lcs(i-1, j, s, t), lcs(i, j-1, s, t))

Correct, exponential. At every mismatch the recursion branches in two, and the same (i, j) pair is solved many times. For two 20-character strings this already takes seconds. For two 100-character strings it is hopeless.

The key insight

State the recurrence in English, then translate to code.

Examine the last character of each prefix.

  • Match. If s[i-1] == t[j-1], that character can sit at the tail of the LCS. The rest of the LCS is whatever was best for the prefixes s[:i-1] and t[:j-1]. So dp[i][j] = dp[i-1][j-1] + 1.
  • Mismatch. At least one of the two ending characters is not in the LCS. Either drop s[i-1] or drop t[j-1], take whichever yields more: dp[i][j] = max(dp[i-1][j], dp[i][j-1]).

That recurrence, plus dp[0][*] = dp[*][0] = 0 (empty prefix matches nothing), is the entire problem.

Why is the match case greedy? When the ending characters agree, there is always an optimal LCS that uses that matching pair at the end. (If some optimal LCS did not use the matched pair, you could swap it in at the end without making the LCS shorter; the formal argument is one paragraph of inductive bookkeeping.) So we never need a "match but skip" transition: three cases would be redundant.

Step-by-step approach

Bottom-up DP. Build a (n+1) x (m+1) table where n = |s|, m = |t|.

def lcs(s, t):
    n, m = len(s), len(t)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if s[i-1] == t[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[n][m]

Each row only depends on the previous row and the current row's cell to the left, so the table collapses to two 1-D rows:

def lcs(s, t):
    n, m = len(s), len(t)
    prev = [0] * (m + 1)
    curr = [0] * (m + 1)
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if s[i-1] == t[j-1]:
                curr[j] = prev[j-1] + 1
            else:
                curr[j] = max(prev[j], curr[j-1])
        prev, curr = curr, prev
    return prev[m]

O(n m) time, O(m) space.

Worked example

s = "abcde", t = "ace". Fill the 6 by 4 table (rows indexed by s prefix length, cols by t prefix length).

        ""  a   c   e
   ""[   0   0   0   0 ]
   a [   0   1   1   1 ]
   ab[   0   1   1   1 ]
   abc[  0   1   2   2 ]
   abcd[ 0   1   2   2 ]
   abcde[0   1   2   3 ]
  • dp[1][1]: s[0]='a' == t[0]='a', so dp[0][0] + 1 = 1.
  • dp[3][2]: s[2]='c' == t[1]='c', so dp[2][1] + 1 = 2.
  • dp[5][3]: s[4]='e' == t[2]='e', so dp[4][2] + 1 = 3.

Return dp[5][3] = 3. ✓

Complexity

  • Time: O(n m). Every cell of the table is filled once.
  • Space: O(n m) for the full table, O(min(n, m)) with rolling rows.

For the constraint |s|, |t| ≤ 1000, that is 10^6 cells. Trivial in C; in Python the inner loop is the bottleneck. The reference solution encodes both strings to bytes and compares integer byte values inside the inner loop, which roughly halves the runtime by skipping Python-string allocation in the hot path.

Common pitfalls

  • Off-by-one between prefix length and index. dp[i][j] is the LCS of the first i characters of s and the first j of t. So when looking up characters you compare s[i-1] and t[j-1], not s[i] and t[j]. This single off-by-one is the most frequent LCS bug.
  • Wrong base case for empty strings. Both dp[0][j] and dp[i][0] must be zero. A subsequence of an empty string is empty.
  • Returning the LCS string instead of its length. Read the spec. To recover the string itself you must keep the full table and walk it back from (n, m).
  • Confusing subsequence with substring. "Substring" means contiguous; "subsequence" does not. "ace" is a subsequence of "abcde" but not a substring of it. The DP changes if you require contiguity.

Where this shows up in the enterprise

  • Git merge at every git host (GitHub, GitLab, Bitbucket): the engine behind every three-way merge is an LCS DP on lines.
  • Microsoft Word "track changes" and Google Docs revision history: diffing two versions of a document line by line uses LCS.
  • DNA sequence alignment at Illumina, Oxford Nanopore, 23andMe: global alignment is LCS with weighted match/mismatch costs.
  • GNU diff and git diff: the workhorse algorithm behind code review on the entire open-source world.
  • Plagiarism detection at Turnitin and Moss: LCS on tokenised programs detects copied code even after renaming.
  • Speech recognition at Amazon Alexa and Google Assistant: aligning hypothesised text against a reference uses an LCS-like DP.
  • Compiler error recovery at LLVM and GCC: aligning the parser's actual token stream against the expected grammar uses an LCS-like step.

The pattern is universal: any time two ordered sequences need to be compared and you want a "longest common ordered match" answer, you want LCS.

Variants you'll see elsewhere

  • edit-distance: same (i, j) state, three transitions on a mismatch (insert, delete, replace) instead of two.
  • longest-common-substring: require contiguity. Recurrence becomes dp[i][j] = dp[i-1][j-1] + 1 on match and 0 on mismatch.
  • shortest-common-supersequence: same table, answer is n + m - lcs(s, t).
  • delete-operation-for-two-strings: minimum deletions to make s == t, equal to n + m - 2 * lcs(s, t).
  • longest-palindromic-subsequence: LCS of s with its reverse. Same DP, no new code.

Once you internalise the (i, j) table, this entire family becomes one technique with different transition rules.

How parikshan turns this into a learning loop

LCS is where the integrated loop pays its highest dividends. The pattern from this point on:

  1. Read this article. Read the recurrence twice; the English description is more important than the code.
  2. Solve in practice mode with the AI chat off. Before writing code, sketch a 6 by 4 table on paper for the "abcde" / "ace" example. Fill it in by hand. You should not start typing until you have done this.
  3. Submit. The bank's dynamic private tests include empty strings, equal strings, and a 1000 by 1000 stress case. Off-by-one errors on the base case fail the empty-string tests immediately.
  4. AI training: Optimise the solution. The assistant should produce the two-row rolling version. Each AI request costs a slice of your integrity score, so spend deliberately.
  5. Move to edit-distance in exam mode. The recurrence shape is identical; only the transition rules change. Solving it cold confirms you have generalised, not memorised.

If you genuinely believe your O(n m) algorithm should not TLE on a stress test, dispute it. A human reviewer can recalibrate the time limit if the bank's reference is off; this is the safety valve.

In the AI-integrated workspace

LCS is the canonical 2-D DP, and it is exactly where AI agents most often produce plausible-looking-but-wrong code. The classic failure mode: the agent writes a recursion that branches in three on a mismatch (replace, insert, delete-from-s, delete-from-t) when only two are needed, and gets the right answer by accident on small inputs while inflating the constant.

The 2027 engineer's discipline:

  1. State the recurrence in English before the agent codes. "dp[i][j] is the LCS length of s[:i] and t[:j]. On match: dp[i-1][j-1] + 1. On mismatch: max(dp[i-1][j], dp[i][j-1])."
  2. Tell the agent to implement that recurrence iteratively. Not "find the longest common subsequence"; spell out the table.
  3. Verify on the 6 by 4 example by hand. Always. Cross-check at least three cells, including a match cell, a mismatch cell, and a boundary cell.
  4. Ask the agent to collapse memory. "Reduce to two rolling rows of length m + 1."
  5. Stress on edge cases. Empty strings on either side. Identical strings. Strings with no common character. Strings where one is a subsequence of the other.

That five-step audit turns a probabilistic answer into a deterministic, verified one. parikshan's loop is engineered to train exactly that habit, so when you walk into a code review at GitHub or a sequencing pipeline at Illumina, you will direct the AI agent rather than approve its first guess.