clone-graph

Difficulty: Medium · Topic: Graphs · Practice it on parikshan: /practice/clone-graph/start

The story

You join a platform team that maintains an internal service-dependency tracker. Every microservice is a node, every call relationship is an edge. The platform team wants to give product teams a sandboxed copy of the graph so they can simulate changes ("if I remove service X, what breaks?") without touching the production map. The sandbox must be a deep copy: no shared references, no leakage, no possibility of a sandbox mutation surfacing in the live model.

This is clone-graph. The same algorithm clones Datadog service maps, AWS CloudFormation dependency trees, npm/pip dependency snapshots before a rollback, and Slack channel membership graphs handed to a billing simulator. Wherever a graph leaves one process and enters another, this code runs.

What's actually being asked

You are given a directed graph as n adjacency lines. Build a deep copy: a brand-new set of Node objects with brand-new edge lists pointing only at the new nodes. Then serialise the clone in sorted adjacency form.

The clone is "deep" in the strict sense: not a single Node reference from the original may appear anywhere in the output graph.

The naive approach

Build the new graph by walking the original recursively. Every time you reach an unvisited original, allocate a clone and recurse on its neighbours. The temptation is to allocate fresh on every encounter. That is the bug: a single original Node reached twice (very common in a cycle or a shared neighbour) is then represented by two separate clone objects in the output, the edges become inconsistent, and the deep-copy invariant is broken.

The fix is one data structure. The memo IS the cloning.

The key insight

Maintain a map original -> clone. The first time you encounter an original u, allocate its clone and record it in the map. On every subsequent encounter (from any neighbour walking inward), look up the map; do not allocate. For every edge u -> v in the original, append cloned[v] to cloned[u].neighbors. The map is the bookkeeping that turns a tree-recursion-style "walk into a fresh allocation" into a graph traversal with shared identity.

The map plays three roles at once:

De-duplication: cycles and shared neighbours collapse to a single clone.
Visited set: the standard graph-traversal "have I seen this node?" check.
Lookup table: when copying edges, cloned[v] IS the answer to "what is v's clone?".

That triple role is why "the memo IS the cloning" is the right slogan. Without the map you would need a separate visited set, a separate clone list, and a separate original-id -> clone mapping. With the map, you have one structure that does all three.

Step-by-step approach

Parse input: build n Node objects as the originals; populate each one's neighbors from its adjacency line.
Build the clone:
- Initialise cloned: dict[Node, Node] = {}.
- For each original start in index order: if start is already in cloned, skip. Otherwise allocate cloned[start] = Node(start.val) and BFS from start.
- Inside the BFS: pop u, fetch cu = cloned[u]. For each neighbour v of u: if v is not in cloned, allocate cloned[v] = Node(v.val) and enqueue v. Then unconditionally append cloned[v] to cu.neighbors. The unconditional append is what preserves parallel edges and self-loops.
Serialise: for each label i, look up its clone, sort the clone's neighbour values, print space-joined.

Worked example

Input (a triangle 0 -> 1 -> 2 -> 0):

Allocate originals O0, O1, O2. Add edges from the adjacency lines.
BFS from O0: cloned[O0] = C0. Pop O0. Neighbour O1: allocate C1, enqueue, append C1 to C0.neighbors. Pop O1. Neighbour O2: allocate C2, enqueue, append C2 to C1.neighbors. Pop O2. Neighbour O0: already in cloned, do not enqueue, but DO append C0 to C2.neighbors. BFS empties.
Output: 3\n1\n2\n0\n.

Note the third step: O0 is already cloned but we still appended C0 to C2.neighbors. That preserves the cycle. Skip that append and the clone has 2 edges instead of 3.

Complexity

Time: O(n + E). Each original is allocated once, each edge is walked once.
Space: O(n + E) for the clone, plus O(n) for the memo and BFS queue.
Sorting each clone's adjacency at output time: O(E log E) worst case.

Common pitfalls

Allocating clones inside the per-neighbour loop without a memo: produces duplicate clones for shared neighbours. The map is non-negotiable.
Gating the append on the seen-check: only the allocation should be gated. The edge append must fire on every neighbour visit, otherwise parallel edges and back-edges in cycles are lost.
Single BFS from node 0 only: misses disconnected components. Wrap the BFS in an outer scan over all n labels.
Self-loops via cloned[u].neighbors.append(cloned[u]): works automatically as long as cloned[u] is allocated before you iterate u's neighbours. Allocate-then-walk, never walk-then-allocate.
Printing the input verbatim: the spec requires sorted output. Input may be unsorted on purpose. A "cat stdin" cheese fails on the unsorted-input test.

Where this shows up in the enterprise

Datadog service maps: capturing a current snapshot of inter-service calls for an incident post-mortem requires a deep copy so subsequent traffic does not mutate the snapshot.
AWS CloudFormation dependency graphs: rollback simulations deep-clone the stack DAG, mutate the clone, render a diff for an operator to approve.
Bazel and Gradle build DAGs: incremental-build planners clone the current target graph, project a hypothetical change onto the clone, compute the affected set.
npm and pip resolvers: snapshot the lockfile graph before attempting a dependency upgrade; restore the snapshot on rollback.
kubernetes Pod scheduling: a scheduler simulator clones the placement graph before running each "what if" experiment.
Slack channel membership: billing simulators deep-clone the channel-member graph before applying retention policies.
Twitter/X feed graph: A/B test orchestrators clone the follow graph, mutate a fraction of edges, and re-run the recommendation algorithm on the clone.

Anywhere a graph traverses a process boundary, leaves a serialisation layer, or feeds a simulator, this algorithm runs. It is the workhorse of "give me a sandboxed copy of the live system".

Variants you'll see elsewhere

copy-list-with-random-pointer: the linked-list cousin; the same memo trick.
serialize-and-deserialize-binary-tree: write the graph to a string and parse it back; deep-copy as a side effect.
copy-random-binary-tree: the binary tree generalisation of the linked-list problem.
clone-n-ary-tree: a strict subset of clone-graph (no cycles), and the same code works unchanged.

The memo-keyed-on-original is the unifying pattern.

How parikshan turns this into a learning loop

clone-graph is the first problem in this cluster where the AI training assistant adds more than it costs in integrity points. Recommended flow:

Practice mode, no AI: write a first cut. You will almost certainly produce duplicate clones for shared neighbours on your first try.
Run the unsorted-input test: it will fail if you printed the input verbatim. That failure teaches the deep-copy vs. round-trip distinction.
AI training: Find Bugs on your buggy version. The agent's response should mention the missing memo or the missing sort.
Exam mode: redo from scratch under the clock. The integrity score will tell you whether you internalised the memo pattern or just memorised your fix.

The memo trick will transfer directly to the next problem in your career that involves "deep copy with cycles", which appears more often than you would expect.

In the AI-integrated workspace

In production, you will direct an agent to deep-clone a graph many times. The agent's failure modes are predictable:

It will use Python's copy.deepcopy: which is correct, slow, and opaque. For typed Node objects with controlled invariants, an explicit BFS with a memo is more auditable, faster, and easier to extend (e.g., "also re-key edges by a new ID scheme during the copy").
It will skip the outer scan over disconnected components: a single BFS from node 0 misses the rest. Ask the agent explicitly: "what if the graph is disconnected?".
It will gate the edge append on the seen-check: silently dropping parallel and back edges. Inspect the inner loop.
It will sort the input lines instead of sorting the clone's adjacency: technically passes the public test, then fails the unsorted-input private test.

A reviewer who can articulate "memo is the cloning, allocation is gated, edge append is unconditional, sort the clone's adjacency at print time, outer scan over all labels" in one breath will catch every one of these failures in 30 seconds. That paragraph is the skill. parikshan exists to compress it into your fingers.