parikshan

encode-decode-strings

Difficulty: Medium  ·  Topic: Arrays & Hashing  ·  Practice it on parikshan: /practice/encode-decode-strings/start

The story

A network engineer at Cloudflare is designing a compact framing format for a new gRPC-like internal protocol. Each request carries a list of header strings; they can contain colons, newlines, commas, quotes, anything a hostile user can type. The wire format must be self-describing, parseable from a forward-only byte stream, and survive arbitrary content. A clever-seeming "comma-separated with backslash escaping" version went into a prototype and then fell over the first time a payload contained a comma; the redesign committee chose length-prefixed framing the next day. That is the protocol that ships.

That is encode-decode-strings. It is the smallest exam-ready example of "designing a serialisation that survives adversarial content". The discipline you learn here is the same one behind Netstrings, BSON, every binary RPC, and most file formats.

What's actually being asked

Design two functions:

  • encode(arr): take an array of strings and return a single string.
  • decode(blob): take that single string and return the original array, byte for byte.

The strings can contain spaces, tabs, digits, letters, punctuation including ':', but never literal newlines. Use the length-prefix scheme: each string is encoded as <len>:<bytes>, concatenated with no separators between consecutive encodings.

The naive approach

','.join(arr). Fails the moment a string contains a comma. Try '|'.join(arr): fails on '|'. Try '###'.join(arr): fails on '###'. Any single fixed delimiter has the same disease.

The next naive variant: escape the delimiter. Replace ',' inside strings with '\\,', replace '\\' with '\\\\'. Works, but requires a stateful parser that examines every character one at a time, tracks escape state, and undoes the escapes during decoding. Possible, fragile, and the bugs that escape review tend to be very hard to find because they only surface on byte-exact adversarial inputs.

The key insight

Stop trying to use delimiters at all. If the decoder knows in advance how many bytes to read for the next string, the contents of those bytes can be anything, including the would-be delimiter. The decoder is no longer parsing the content; it is just copying a known number of bytes.

So emit <len>:<bytes> for each string. The first ':' after the current position marks the end of the length token; everything from there to + len bytes is the next string's payload; advance past that and repeat. The colons inside payload bytes are opaque; the decoder never inspects them, because it already knows where the payload ends.

This is the same trick that Netstrings, Pascal strings, BSON, ASN.1 BER, and most binary protocols use: length-prefix the content. It survives every adversarial content scenario without escaping.

Step-by-step approach

Encoding:

def encode(arr):
    return ''.join(f'{len(s)}:{s}' for s in arr)

Decoding:

def decode(blob):
    out = []
    i = 0
    n = len(blob)
    while i < n:
        j = blob.index(':', i)
        L = int(blob[i:j])
        out.append(blob[j + 1: j + 1 + L])
        i = j + 1 + L
    return out

The decoder maintains a single position i. At each iteration: find the next colon starting from i, parse the digits between i and the colon as the length L, take the next L characters verbatim, advance i. Repeat until i reaches the end.

Worked example

arr = ["hello", "world", "ai"]

Encoded: '5:hello5:world2:ai'.

Decoding trace:

stepinext ':'Lpayloadi after
1015"hello"7
2785"world"14
314152"ai"18

i = 18 = len(blob), loop ends. Decoded array: ["hello", "world", "ai"]. Identical to input.

Now the adversarial case: arr = ["a:b"]. Encoded: '3:a:b'. Decoder finds the first colon at index 1, parses L = 3, takes the next 3 bytes "a:b". The colon inside "a:b" is part of the payload and never confuses the decoder, because by then the decoder is no longer searching for delimiters.

The empty-string case: arr = [""]. Encoded: '0:'. Decoder finds colon at index 1, L = 0, takes 0 bytes, advances i = 2 = len(blob). Done.

Complexity

  • Time: O(total_chars) for both encode and decode. The decode blob.index(':', i) advances monotonically, so total work over all iterations is linear in the blob.
  • Space: O(total_chars) for the encoded blob and the decoded list.

The naive approach of "comma-join then escape-decode" can be made to run in linear time, but the implementation is harder to keep correct, and adversarial payloads can blow up the state machine. The length-prefix variant is strictly simpler and equally fast.

Common pitfalls

  • Searching for the colon naively instead of using index(':', i): if you scan the whole blob from 0 each time, the decoder is O(total^2). Always pass the starting index.
  • Slicing past the end: when the blob is malformed (length says 10, but only 7 bytes follow), Python will silently truncate the slice and the decoder will keep going. Validate j + 1 + L <= len(blob) if the spec demands robust failure.
  • Mistaking the format for "decimal digits only": payload can contain digits. The decoder must read up to the first colon for the length, not "while characters are digits". Otherwise a payload that starts with a digit (after a previous block) gets re-consumed as a length.
  • Off-by-one in i = j + 1 + L: forgetting the +1 for the colon itself leaves the decoder re-reading the colon. Forgetting the + L re-reads the payload.
  • Encoding the length in any non-base-10 format: the spec says decimal. Hex or any other base will mismatch the grader.
  • Treating bytes vs characters carelessly: in languages where multi-byte UTF-8 characters live, "length" must be defined as bytes (or characters) consistently across encoder and decoder. Choose one and stick to it.
  • Adding a separator between encodings: the spec says no separator. Inserting one breaks decode(encode(arr)) == arr.

Where this shows up in the enterprise

  • gRPC / Protocol Buffers length-prefix framing: every Protobuf message on the wire is length-prefixed.
  • Apache Kafka record batches: a record's key and value are each length-prefixed; readers know exactly how many bytes to consume.
  • AWS Kinesis stream records: same shape.
  • Redis RESP protocol: bulk strings are encoded as $<len>\r\n<bytes>\r\n, the same length-prefix idea with a CRLF terminator.
  • HTTP/2 frame format: each frame has a length, type, and payload; the parser is a length-driven copy, not an escape-driven scan.
  • AMQP / RabbitMQ message frames: length-prefixed body.
  • Apache Avro / Parquet column chunks: each column chunk has a header with a size; readers seek and read precisely that many bytes.
  • Git pack files: each object inside a pack starts with a header that includes the size, then the deflated content; the index lets git jump to any object in O(1).
  • TCP itself: the TCP/IP header includes a length field; the receiver consumes exactly that many bytes for the payload. The pattern is everywhere because it works.

Variants you'll see elsewhere

  • serialize-and-deserialize-binary-tree: encode a tree structure as a string; uses recursive length-prefix or sentinel-based framing.
  • serialize-and-deserialize-bst: same shape, optimised for the BST invariant.
  • csv-parsing with quoted commas: real CSV uses quoting plus quote-doubling; harder than length-prefix but solves the same family of problems for a fixed-format protocol.
  • tokenise-and-parse a custom expression language: same parse-position discipline applied at the token level.
  • varint encoding (Protocol Buffers): length itself is encoded variable-length, shaving bytes for small payloads.
  • network-packet-fragmentation reassembly: length-prefixed chunks plus a reassembly buffer; same primitive.

How parikshan turns this into a learning loop

Read this article, click into practice. The editor runs your code against public tests plus per-session dynamic private tests generated fresh each attempt, so memorising a gist fails. Hints are graduated: free first hint, small penalty for the second, larger for the third. The AI training assistant in exam mode offers Explain, Review, Find Bugs, Add Comments, Optimise, each with a small integrity cost so you spend AI deliberately. In practice mode AI chat is free, the right place to argue with yourself about why delimiters cannot survive adversarial payloads. After the verdict, the dispute flow sends contested results to a human reviewer in one click. Attempt in practice mode first, then redo in exam mode untouched. When you can write encode and decode in five minutes and explain why escaping is the wrong path, the pattern is internalised.

In the AI-integrated workspace

An AI agent will almost certainly reach for a delimiter-and-escape solution first, because it looks like the "Pythonic" answer and several Stack Overflow answers go that way. Your job is to:

  1. Reject delimiter-based solutions on principle. The agent does not know that the production protocol committee made the same decision in the story above; you do. Push for length-prefix.
  2. Verify the encode and decode roundtrip on adversarial inputs. Generated unit tests will exercise letters and digits, not ':', not '\n', not empty strings. Demand all of those.
  3. Catch the 'while character is digit' length-parser, which superficially works but breaks when a payload happens to start with a digit. Insist on blob.index(':', i).
  4. Catch the O(n^2) decoder where blob.index(':') is called without the starting index. The unit tests will pass; the production benchmark will fall off a cliff at 100 KB blobs.
  5. For variants (serialise binary tree, etc.), the agent will sometimes invent a custom delimiter and re-introduce the escape-state-machine bug. Steer to length-prefix every time.

A senior engineer reads "design a serialisation for arbitrary strings" and immediately reaches for length-prefix framing. The candidate who can do that productively designs robust protocols on the first pass. The candidate who cannot ships a comma-escape parser, watches it break the first day a customer's data contains a comma, and rewrites it under deadline pressure. parikshan trains the first kind.

encode-decode-strings