parikshan

partition-labels

Difficulty: Medium  ·  Topic: Greedy  ·  Practice it on parikshan: /practice/partition-labels/start

The story

You work on the chat archive feature at Slack. A long thread is being split across several monthly digest emails for compliance retention. Marketing wants each digest to be self-contained: no participant who posted in one digest may appear in another. The constraint is unusual but firm, auditors must be able to assess each digest as a closed unit. You are handed an array of participant IDs ordered by post time, and asked to find the most digests you can produce.

That is partition-labels. The same shape recurs in Datadog log-stream sharding (no trace ID spans two shards), in Stripe payment batching (no merchant straddles two settlement runs), and in YouTube ad-break placement (no advertiser bookends two ad pods). The problem definition is striking in its simplicity: cut a sequence into as many pieces as possible so that no element appears in more than one piece.

What's actually being asked

Given a string s of lowercase letters, partition it into as many parts as possible so that each letter appears in at most one part. Return the size of each part in order. The partition that maximises the number of parts is unique.

The naive approach

You could enumerate every cut position and check the "no shared letters" constraint per cut. With n positions and a quadratic verification, that is O(n^3). For n = 10^5 that is 10^15, hopeless. Even a smarter version with prefix sets is O(n^2). Naive: correct, useless.

The key insight

A part that starts at index start must contain the entire territory of every letter it includes. That is, for any letter c inside the part, the part must extend at least to last[c], where last[c] is the last index in the whole string at which c appears. Otherwise c would also occur in a later part, violating the rule.

So precompute last[c] in one pass. Then in a second pass, walk with two pointers start and end. For each index i, expand end = max(end, last[s[i]]). When i catches up to end, every letter inside [start, end] has had its territory absorbed; no future letter equal to any inside the window can appear. Emit end - start + 1 as the size of the closed part, advance start to i + 1, repeat.

Why this is optimal (maximum number of parts): cutting earlier than i == end would split some letter's territory, which the rules forbid. Cutting later would unnecessarily merge two valid parts and reduce the count. Greedy here is provably correct by structural argument: the closing point is forced by the territory constraint, and the algorithm closes at the earliest legal moment every time.

The greedy proof: closing at the earliest forced point

Formally, suppose an optimal partition has k parts and the greedy partition has k'. We show k' >= k. Walk both partitions left to right. At each step, the greedy partition closes its current part as soon as legally possible (when i == end). The optimal partition can close no earlier (it would violate the territory rule) and may close later. So the greedy partition's part boundaries are pointwise at or before the optimal's. By induction, greedy ends with at least as many parts as optimal. Since optimal is, by definition, maximal, k' = k. QED.

This is a textbook exchange argument applied to closing points: shifting any boundary of the optimal partition to match the greedy partition never reduces the count.

Step-by-step approach

  1. First pass: build last = {c: i} for the last index each character appears. In Python, last = {c: i for i, c in enumerate(s)} overwrites earlier indices, leaving the maximum, exactly what we want.
  2. Second pass: initialise start = 0, end = 0, result = [].
  3. For each i, c in enumerate(s):
    • end = max(end, last[c]).
    • If i == end: result.append(end - start + 1); start = i + 1.
  4. Print the result joined by single spaces.

Two passes. O(n) time. O(1) extra memory because the alphabet has 26 letters.

Worked example

s = "ababcbacadefegdehijhklij"

First pass, last occurrences:

  • a: index 8
  • b: index 5
  • c: index 7
  • d: index 14
  • e: index 15
  • f: index 11
  • g: index 13
  • h: index 19
  • i: index 22
  • j: index 23
  • k: index 20
  • l: index 21

Second pass:

  • i=0 (a): end = max(0, 8) = 8.
  • i=1 (b): end = max(8, 5) = 8.
  • i=2 (a): end = 8.
  • i=3 (b): end = 8.
  • i=4 (c): end = max(8, 7) = 8.
  • i=5 (b): end = 8.
  • i=6 (a): end = 8.
  • i=7 (c): end = 8.
  • i=8 (a): end = 8. i == end → emit 8 - 0 + 1 = 9. start = 9.
  • i=9 (d): end = max(8, 14) = 14.
  • i=10 (e): end = max(14, 15) = 15.
  • i=11 (f): end = max(15, 11) = 15.
  • i=12 (e): end = 15.
  • i=13 (g): end = max(15, 13) = 15.
  • i=14 (d): end = 15.
  • i=15 (e): end = 15. i == end → emit 15 - 9 + 1 = 7. start = 16.
  • i=16 (h): end = max(15, 19) = 19.
  • ...
  • i=23 (j): end = 23. i == end → emit 23 - 16 + 1 = 8. start = 24.

Output: 9 7 8. Matches the spec.

Complexity

  • Time: O(n), two passes.
  • Space: O(1), the last map holds at most 26 entries.

Optimal: any algorithm must read every character at least once.

Common pitfalls

  • Closing on i == last[s[i]] instead of i == end. The former closes too early, it ignores letters seen earlier whose territory extends further.
  • Off-by-one on emitted size: it is end - start + 1, not end - start. Inclusive endpoints.
  • Forgetting to reset start: emitting the right size but advancing start to i rather than i + 1 produces overlapping parts. Standard off-by-one.
  • Joining with the wrong separator: spec asks for single spaces, no trailing space. ' '.join(map(str, result)).

Where this shows up in the enterprise

  • Slack / Microsoft Teams: chat archive sharding for compliance, where no user spans two retention buckets.
  • Datadog / New Relic: log-stream sharding where no trace ID may cross a shard boundary.
  • Stripe / Adyen: settlement run boundaries where no merchant straddles two payouts.
  • YouTube / Hulu: ad-pod boundaries where no advertiser bookends two consecutive breaks (anti-collision rule).
  • Genome assembly: scaffold boundaries where no gene fragment spans two scaffolds.
  • AWS Athena partitioning: query result chunking where no row-key family crosses chunks.
  • CDN cache rotation: warm-cache batches where no asset prefix straddles batches.

The recurring shape is "anti-collision partitioning": you want as many groups as possible, but each group must fully contain certain elements. Once you see that abstraction, the algorithm transfers directly.

Variants you'll see elsewhere

  • minimum-window-substring: find the smallest window containing all required letters. Same last-occurrence intuition but flipped.
  • merge-intervals: greedy merging where intervals are explicit ranges; same kind of "open until you must close" sweep.
  • meeting-rooms: how many rooms are needed concurrently. Same two-pointer shape on sorted start/end events.
  • partition-array-for-maximum-sum: DP variant; greedy fails because the cost function is not monotone.

The first three are siblings; the last is the warning shot that greedy is not universal, when the cost depends on aggregate properties of the partition, you need DP.

How parikshan turns this into a learning loop

The dynamic private tests vary string length and alphabet diversity to catch shortcuts: all-same-character (one part), all-distinct (n parts of size 1), worst-case interleaved characters (one massive part). The four-tier hint ladder is unusually generous, hint 1 is essentially free and names the last-occurrence idea. That is intentional: the algorithm is elegant once you see it, and the platform wants you to see it, not just memorise it. Practice with AI free-form chat in practice mode; switch to exam mode and aim for the eight-line solution from memory in under three minutes.

In the AI-integrated workspace

Two failure modes in generated code:

  1. Closing too early. The agent writes if i == last[c] instead of if i == end. Public tests pass because the first part is often closed correctly by accident; private tests with interleaved letters expose the bug.
  2. Missing proof. As with all greedy algorithms, the agent will produce the right code with no justification for why closing at i == end gives the maximum number of parts (rather than just a valid partition). Always ask for the exchange argument or the structural proof. Without it, you have a function that works on the training distribution and might fail elsewhere.

The senior-engineer move on greedy problems is always the same: demand the proof in the docstring. If the agent cannot produce the exchange argument or the structural argument, the code is unverified, and you stress-test on a 5-character worst case before merging. On partition-labels the worst case is something like "abacb", confirm the algorithm produces exactly one part of size 5, not two parts.