parikshan

implement-strstr

Difficulty: Easy  ·  Topic: Math, Matrix & Heaps  ·  Practice it on parikshan: /practice/implement-strstr/start

The story

A security engineer at Cloudflare scans incoming HTTP request bodies for known-bad signatures: SQL injection patterns, exploit payloads, leaked-credential markers. The scanner takes one request body and one signature and asks the simplest possible question: "does this substring appear in this string, and if so, where?". A million times per second, across every edge node. The answer powers WAF rules, malware detectors at Akamai, regex-free routing in nginx, and the Ctrl+F you press in VS Code.

That is implement-strstr. It is the entry point to string matching, an area of algorithms that goes very deep, but the entry-level version is short enough to fit in ten lines.

What's actually being asked

Given haystack and needle, return the index of the first occurrence of needle in haystack, or -1 if needle is not present. By convention, an empty needle returns 0.

The naive approach

For each position i in haystack where the needle could still fit (that is, i from 0 to len(haystack) - len(needle)), compare the substring haystack[i:i+m] against needle. Return the first matching i, or -1 if you walk off the end. Worst case O((n - m + 1) * m), which for the problem's 10^5 bound is around 10^10 in the most adversarial input, theoretically too slow.

In practice it is comfortably fast because the vast majority of starts mismatch on the first character. The average case is close to O(n) for typical text. The judge's time limit is permissive enough that the naive solution passes. The right move in production code is to call the built-in haystack.find(needle), which uses a fast C implementation; the right move in an interview is to write the naive version explicitly and then mention that KMP gives a hard O(n + m) guarantee.

The key insight

Substring search has two layers. The first is "where could it start?", which is the loop over i. The second is "does it start here?", which is the inner comparison. Naive code does the inner comparison from scratch every time; smarter algorithms remember what they learned from the previous mismatch to skip impossible starts.

For the problem at this scale, the naive answer is the answer. The cleaner solution in Python is one line.

Step-by-step approach

Brute-force version:

  1. If needle is empty, return 0.
  2. Let n = len(haystack), m = len(needle).
  3. For i from 0 to n - m (inclusive):
    • If haystack[i:i+m] == needle, return i.
  4. Return -1.

Built-in version (the one you should ship):

return haystack.find(needle)

str.find returns the index of the first occurrence or -1. By design, it returns 0 for an empty needle, matching the spec.

Worked example

haystack = "sadbutsad", needle = "sad"
  • i=0, "sad" == "sad", return 0.
haystack = "leetcode", needle = "leeto"
  • i=0, "leetc" != "leeto" (mismatch at index 4).
  • i=1, "eetco" != "leeto".
  • i=2, "etcod" != "leeto".
  • i=3, "tcode" != "leeto".
  • Loop exits (i would be 4 but n - m = 8 - 5 = 3). Return -1.

Even the brute-force inner comparison short-circuits at the first character mismatch, which is why it runs fast on real text where most starts disagree immediately.

Complexity

  • Time: O((n - m + 1) * m) worst case, O(n + m) average case for typical inputs.
  • Space: O(1) for the index loop, O(m) if you build the slice object on every iteration. In Python, slicing is fast enough that the slice cost is dominated by the comparison.

For the problem bounds (n, m <= 10^5), the worst case is theoretical: it would require an input like "aaaaa...aab" searched for "aaaab", which is constructible but rare. KMP gives a hard O(n + m) guarantee for those cases.

A word on KMP

The Knuth-Morris-Pratt algorithm precomputes, for each prefix of needle, the longest proper prefix that is also a suffix. Using this failure function, the main loop never reconsiders a character of haystack: a mismatch tells you exactly how far you can shift needle without missing an occurrence. The result is O(n + m).

KMP is worth knowing as a fact: when an interviewer asks about the worst case, mention KMP. Do not derive it on a whiteboard unless asked. The entry point to string matching is the brute-force loop; KMP, Boyer-Moore, Rabin-Karp, and Aho-Corasick are the optimisations.

Common pitfalls

  • Missing the empty-needle case: the contract says return 0 for an empty needle. str.find already does this, but a hand-rolled loop will return -1 if n - m + 1 is computed as n + 1 and the loop iterates without ever finding a match. Test it.
  • Off-by-one on the upper bound: the loop runs i from 0 to n - m inclusive, which is n - m + 1 iterations. Writing range(n) and slicing past the end of haystack silently returns shorter slices in Python, which then never equal needle. That hides the bug instead of crashing on it.
  • Comparing character-by-character in a hand loop: in Python, haystack[i:i+m] == needle is implemented in C and is faster than a hand-written Python loop, even though it allocates a slice. Use the built-in comparison.
  • Returning a boolean: read the spec. The output is an index or -1.

Where this shows up in the enterprise

  • Cloudflare WAF and AWS WAF: signature matching against request bodies, headers, and URLs. Multi-pattern variants use Aho-Corasick (parallel needle search) but the single-pattern call is strstr.
  • Splunk, Datadog, Loki: log search. The first thing the user types is a substring; the index returns matching lines.
  • Antivirus engines (ClamAV, Windows Defender): malware signature scanning at file ingest is multi-pattern substring search.
  • VS Code, Vim, IntelliJ "Find in File": Ctrl+F and Ctrl+Shift+F. The non-regex case is strstr; the regex case is an NFA simulation.
  • DNA sequence search at the Broad Institute: aligning short reads against a reference genome starts with exact substring search; mismatches are then layered on with dynamic programming.
  • Email spam filters: detecting known phishing phrases ("update your account password") in message bodies.

The pattern is universal: any time the question is "does X appear in Y", strstr is the kernel underneath.

Variants you'll see elsewhere

  • find-all-occurrences: return every starting index, not just the first. Same loop, append to a list instead of returning.
  • repeated-substring-pattern: detect whether a string is built by repeating a shorter pattern. Solvable with one strstr call against (s + s)[1:-1].
  • rotate-string: check whether s2 is a rotation of s1. One strstr call against s1 + s1.
  • regex-matching: jump from substring to pattern matching with . and *. Dynamic programming.
  • aho-corasick: search for many needles simultaneously in one pass. The foundation of modern WAFs and antivirus engines.

How parikshan turns this into a learning loop

After you read this article and click through to the practice link, the parikshan flow integrates four layers:

  1. The editor evaluates against public tests and a suite of per-session dynamic private tests. For strstr those include an empty needle, a needle equal to the haystack, a needle longer than the haystack (returns -1), and matches at the very end of the haystack.
  2. Hints are graduated: the first nudges you toward the upper bound on starting indices; the third spells out the brute-force loop; the fourth mentions KMP. Each hint costs integrity points.
  3. The AI assistant in exam mode can Explain, Review, Find Bugs, Add Comments, or Optimise. Each is a small integrity hit. In practice mode it is free; use it to ask why str.find is faster than a Python loop and what KMP buys you.
  4. The dispute flow routes contested verdicts to a human reviewer.

If you can ship the one-line str.find solution and articulate when it would break (an adversarial input that forces the brute-force inner loop to run to its worst case), the pattern is yours.

In the AI-integrated workspace

In a real codebase, an agent will write strstr calls or regex calls. Your job is to:

  1. Recognise when the problem is exact substring search (the cheap case) versus regex or fuzzy matching (the expensive case). Tell the agent which one applies.
  2. Push back when the agent reaches for regex on a fixed-pattern search. Regex is slower than strstr and has unbounded worst case on pathological patterns. Cloudflare's 2019 outage was caused by a regex with quadratic backtracking; the lesson is to use strstr when you can.
  3. Verify that the empty-needle case is handled. AI agents often forget the convention strstr("", "") == 0.
  4. Insist on a multi-pattern algorithm (Aho-Corasick or a Bloom filter pre-filter) when the workload is many needles and one haystack. The agent will default to a loop of strstr calls; that is correct but quadratic in the number of needles.

Engineers who can pick the right string-matching primitive for the workload save orders of magnitude of cost at scale. Engineers who reach for regex for every search end up writing the post-mortem.

implement-strstr