parikshan

System: File Sync

Reading time: ~22 minutes  ·  Prerequisites: Arrays & Hashing, Graphs  ·  Capstone: file-sync design problem (parikshan systems track). Build a client that handles three devices, two offline, rejoining without losing edits.

A file-sync service is a deceptively simple promise to the user: "the bits on your laptop, your phone, and the cloud are the same, eventually, without you thinking about it." Every word in that promise hides a hard distributed systems problem. "The same" means conflict resolution. "Eventually" means a clock and a causality model. "Without you thinking about it" means the system has to be correct under network partitions, half-written files, and devices that disappear for weeks.

This primer walks through how Dropbox, Google Drive, and Microsoft OneDrive actually solve this, and where the patterns from the algorithm bank earn their keep.

The product behind the system

Three reference systems anchor the discussion. Numbers are public as of 2024-2025.

Dropbox stores hundreds of petabytes of data across billions of files for around 700 million registered users, on a custom exabyte-class storage system called Magic Pocket. Every file is split into 4 MB blocks, hashed with SHA-256, and deduplicated globally. If you and your colleague both upload the same PDF, the bytes are stored exactly once.

Google Drive holds the metadata for billions of files inside the same Spanner-backed metadata fabric that powers Gmail and Docs. Drive uses Reed-Solomon erasure coding to drive raw bytes down to roughly 1.5x storage overhead, well below the 3x of plain triple replication.

Microsoft OneDrive sits on top of Azure Blob Storage and uses the same chunking trick (originally called BITS, now part of the OneDrive sync client). It serves over 250 million commercial users via Microsoft 365 and is the substrate behind SharePoint Online and Teams file storage.

All three solve the same problem at slightly different scales, and the architectural shape of all three is recognisably the same.

What the requirements actually are

Functional requirements that any serious file-sync service must hit:

  1. Upload, download, rename, move, and delete files of any size, with resume-on-failure for partial uploads.
  2. Sync state across multiple devices per user; a new device must "catch up" to the current cloud state.
  3. Detect and resolve conflicts when two devices edit the same file while offline.
  4. Maintain version history; let the user roll back a file or restore a deleted one.
  5. Share files and folders with other users and external links, with permission scoping.
  6. Show progress and "this file is in the cloud" UI affordances (the Dropbox green check, the OneDrive cloud icon).

Non-functional requirements that get harder as you scale:

  1. Bandwidth efficiency: never re-upload a byte that already exists on the server.
  2. Storage efficiency: deduplicate across users (one viral PDF must not cost 10 million copies).
  3. Latency: a change on one device should appear on another in under 5 seconds on a fast network.
  4. Durability: 11 nines (the de facto S3 promise) for stored bytes.
  5. End-to-end encryption optional but expected for premium tiers (proton Drive made this mandatory).
  6. Cross-platform: Windows, macOS, Linux, Android, iOS, web, all sharing the same protocol.

The architect's framing

Six components, every one of them earning its slot:

  1. Client (sync engine): chunks files, computes hashes, watches the filesystem, applies remote changes, resolves conflicts. This is the hardest piece of code in the system, harder than the cloud side.
  2. Block service: a content-addressed key-value store. The key is the SHA-256 of a 4 MB chunk; the value is the (compressed, encrypted) chunk itself.
  3. Metadata service: the authoritative database of "what files exist, at what version, owned by whom, made of which block hashes". Sharded heavily; the bottleneck of the whole system.
  4. Notification service: a long-poll or WebSocket fan-out that tells every connected device "your account changed, fetch the new metadata". Dropbox calls this the "Notification Server"; OneDrive uses Azure SignalR.
  5. Edge / CDN: blocks are served from a CDN, often Cloudflare or the provider's own edge, so a download of a popular file does not hit origin storage.
  6. Conflict resolver: usually lives in the client, but the metadata service arbitrates. Strategy is almost always "last-writer-wins with a conflict copy preserved" rather than true merge.

Dataflow for a single-file edit:

+------------+         +-----------------+         +----------------+
|  Client A  | --(1)-->|   Block Service | <------>| Magic Pocket / |
|  (laptop)  |         |  (chunk upload) |         |  S3 / Azure    |
+------------+         +-----------------+         +----------------+
      |                                                     ^
      |  (2) metadata commit                                |
      v                                                     |
+-------------------+      +-----------------+              |
|  Metadata Service |<---->|  Sharded MySQL/ |              |
|   (file -> blocks)|      |  Spanner cluster|              |
+-------------------+      +-----------------+              |
      |                                                     |
      |  (3) notify other devices                           |
      v                                                     |
+-------------------+     +----------+      +----------+    |
| Notification Hub  |---->| Client B |----->|  CDN /   |----+
| (long-poll / WS)  |     | (phone)  |<-----| edge     |
+-------------------+     +----------+      +----------+
                              (4) fetch metadata delta, then
                                  fetch only the changed blocks

The asymmetry to notice: writes are heavy and synchronous (chunk, hash, upload, commit, fan out), but reads on the receiving device are deltas and almost free. That asymmetry is what makes the whole architecture viable at billions-of-files scale.

The trade-offs we will name

Block size 4 MB vs variable-length chunks. Fixed 4 MB blocks (Dropbox's choice) are simple and make hashing parallelisable. They also break down on shifted content: prepend one byte to a large file and every block hash changes, so the whole file re-uploads. Variable-length content-defined chunking (rsync, restic, Borg backup) uses a rolling hash to find natural chunk boundaries that survive shifts. rsync is the canonical reference; restic and Borg made it production. Dropbox stayed fixed because the use case is "user saves a Word doc", not "user prepends one byte to a 10 GB log". Pick the chunker to match the dominant edit pattern.

Conflict resolution: last-writer-wins vs operational transforms vs CRDTs. Dropbox and OneDrive do last-writer-wins with a filename (conflicted copy YYYY-MM-DD).docx file preserved. Google Docs runs Operational Transformation server-side so two users can type the same sentence and see real merges. CRDTs (used by Figma, Linear, the Yjs ecosystem) are the modern alternative: every edit commutes, no central arbiter needed. The trade-off is that LWW is simple and works on opaque file bytes; OT and CRDTs require the system to understand the file format. You only build OT/CRDT when the file format is a structured document you control.

Strong vs eventual consistency on metadata. Google Drive uses Spanner, which gives external consistency at the cost of cross-region commit latency. Dropbox sharded MySQL and accepted eventual consistency within a small window in exchange for higher throughput and lower cost. The bet was right: users do not notice 200 ms of metadata lag, and Dropbox stayed cheaper to run than Drive per byte for years.

Push vs pull notifications. Long-polling (Dropbox's original choice) is cheap and works through every firewall, but burns one connection per device per account. WebSockets reduce overhead but break in some corporate networks; Apple Push and FCM solve mobile but not desktop. Real systems use a mix: push where it works, fall back to long-poll where it does not.

End-to-end encryption vs deduplication. If every user encrypts their blocks with their own key, two users uploading the same PDF produce different ciphertexts and dedup fails. Solutions are awkward (convergent encryption, which has known attacks against low-entropy files like "all-zeros.bin"). E2EE for files is hard precisely because dedup pays for the whole business model. Apple's iCloud Advanced Data Protection chose E2EE and sacrificed dedup; Proton Drive made the same call.

Where the algorithms from this bank actually appear

Hashing (Arrays & Hashing primer). Every block is keyed by SHA-256. Every metadata lookup is a hash-map probe. The "remember what you've seen" pattern is literally the foundation of dedup: the block service is just a hash set with values.

Graphs (the Graphs primer). A user's filesystem is a tree, but the global view (one file shared with many users, many devices syncing same workspace) is a DAG. Permission propagation through a shared folder is a graph traversal; "show me everyone who can read this file" is reachability. Google Drive's sharing model is explicitly graph-shaped.

Merkle trees (not in the bank, but worth naming). A Merkle tree of block hashes lets a client and server compare two versions of a large file in O(log n) by walking the tree top-down, descending only into subtrees whose root hashes differ. Git, Dropbox, IPFS, BitTorrent, and Apple's Find My network all use this. The pattern is the same as "binary search but on a hash tree."

Rolling hash (sliding window pattern, primer 03). Content-defined chunking uses a Rabin-Karp rolling hash to detect chunk boundaries. The algorithm walks the file byte by byte, maintaining a hash of the last 64 bytes, and cuts a chunk every time the hash matches a target pattern. This is rsync's core trick and a direct application of the sliding-window pattern.

Intervals (primer 13). Version history is a sequence of (timestamp, snapshot) intervals; "restore the file as of last Tuesday" is an interval lookup. Dropbox's Time Machine-style restore is an interval query against the version log.

Sketch implementation

A minimal client-side sync engine, ~50 lines of pseudo-code:

CHUNK_SIZE = 4 * 1024 * 1024  # 4 MB

def chunk_and_hash(file_path):
    """Split file into 4MB blocks, return list of (hash, bytes)."""
    blocks = []
    with open(file_path, 'rb') as f:
        while chunk := f.read(CHUNK_SIZE):
            h = sha256(chunk).hexdigest()
            blocks.append((h, chunk))
    return blocks

def upload_file(file_path, server):
    blocks = chunk_and_hash(file_path)
    file_hashes = [h for h, _ in blocks]

    # Ask server which blocks it does not have yet.
    missing = server.which_missing(file_hashes)

    # Upload only the missing ones.
    for h, data in blocks:
        if h in missing:
            server.put_block(h, encrypt(compress(data)))

    # Commit metadata: filename -> ordered list of block hashes.
    server.commit_metadata(file_path, file_hashes, version=now_ms())

def apply_remote_change(local_path, remote_meta):
    """Server says this file is now at version v with these block hashes."""
    local_meta = load_local_metadata(local_path)
    if local_meta.version > remote_meta.version:
        # Local has newer edits; this is a conflict.
        rename_to_conflict_copy(local_path)

    needed = [h for h in remote_meta.block_hashes
              if h not in local_block_cache]
    blocks = [decompress(decrypt(server.get_block(h))) for h in needed]
    reassemble(local_path, remote_meta.block_hashes, blocks)
    save_local_metadata(local_path, remote_meta)

def watch_loop():
    while True:
        events = filesystem_watcher.poll()  # inotify / FSEvents / ReadDirectoryChangesW
        for ev in events:
            if ev.kind == 'modified':
                upload_file(ev.path, server)
        remote = notification_channel.poll(timeout=30)
        for change in remote:
            apply_remote_change(change.path, change.meta)

What the sketch elides: retry with backoff on every network call, partial-upload resumption (Dropbox uses HTTP Range and a session token), permission checks, the conflict UI, encryption key management, the file-system watcher's reliability (Windows is famously flaky here), and the cold-start full-scan that runs on first install. Each of those is a multi-engineer quarter of work.

What breaks at scale

Tier 0 (single user, single VM): nothing breaks. A bash script that rsyncs every minute is a complete solution. The whole architecture above is overkill.

Tier 1 (thousands of users): the metadata DB becomes the hot spot. One table called files with a primary key of (user_id, path) is fine until two users in the same shared folder hammer it. First mitigation: read replicas. Second mitigation: cache layer (memcached / Redis) in front of metadata.

Tier 2 (millions of users): shard the metadata DB by user_id. Dropbox famously did this with their "Edgestore" abstraction. Shared folders now cross shards, which is where the system gets ugly: a shared folder's ACL must be consistent across all participants. Solution is usually a separate "sharing" service with eventual consistency.

Tier 3 (hundreds of millions of users): the block storage is the bottleneck. Object stores like S3 have rate limits per prefix; you have to spread block hashes across prefixes so traffic does not concentrate. This is exactly when Dropbox built Magic Pocket: S3 was costing them more than building their own.

Tier 4 (billions of files, global): the notification fan-out becomes the bottleneck. A celebrity Dropbox account with 10 million viewers cannot wake up 10 million clients on every change without a thundering herd. Solution is to coalesce notifications (only send "something changed in folder X" not "file Y at byte 1234 changed") and have clients pull deltas on a backoff. WhatsApp's multi-device sync solves the same problem with the same coalescing trick.

Permanent failure mode at every tier: clock skew between devices. If laptop A is 30 seconds ahead of laptop B and both edit the same file offline, the server's "last writer wins by timestamp" picks the wrong winner. Real systems use a hybrid logical clock or a version vector instead of wall-clock time. Spanner uses TrueTime; Dropbox uses an internal logical clock per file.

What an interview / staff-engineer review will ask

Q1: How do you handle a 50 GB file? A: Resumable chunked upload. Client computes block hashes in parallel, server tells client which blocks it does not have (deduplication kicks in if any are global duplicates), client uploads the missing ones with HTTP Range requests and a session token so a dropped connection resumes from the last successful block. The metadata commit happens once at the end, atomically.

Q2: Two users edit the same file offline; how do you resolve? A: Last-writer-wins by version vector, not wall-clock. The "loser" version is preserved as a sibling file named Document (Alice's conflicted copy 2026-03-12).docx. The application layer (Word, Google Docs, Figma) can implement real merge on top if the format permits. Do not pretend you can merge opaque binary blobs.

Q3: A folder is shared with 100,000 users. How do you propagate a delete? A: You do not fan out 100,000 notifications synchronously. The delete commits to the metadata service, and each user's client pulls deltas on its next poll (push notifies only that "something changed", not what). For very large shared folders the system marks them as "polled" rather than "pushed" and accepts a 30-second lag.

Q4: How do you prevent a malicious user from filling your storage with random 4 MB blocks? A: Rate-limit uploads per user, charge by quota not by blocks, require an authenticated metadata commit before a block is permanently kept (orphan blocks are GC'd after a few hours), and detect bots by upload entropy.

Q5: Why content-addressed storage (hash as key) instead of UUID per block? A: Three reasons. (1) Automatic deduplication: same content = same key, store once. (2) Integrity: downloading a block and hashing it tells you it is correct, no separate checksum needed. (3) Cacheability: a CDN can cache a content-addressed block forever because the content can never change under that key. The cost is that you cannot "update" a block; you have to write a new one and update metadata.

In the AI-integrated workspace

File sync is the foundation under every "your AI agent can see your files" product. Cursor, Claude Code, GitHub Copilot Workspace, and Continue all assume that the file you are editing is the file the agent sees. When sync is slow or wrong, the agent reads stale bytes and confidently edits the wrong version. The class of bug is hard to detect because the agent's output looks plausible; the user has to notice that the line numbers do not match.

Two AI-era patterns the original Dropbox protocol did not anticipate:

First, structured edits from agents arrive as patches, not full files. The next generation of sync engines will treat agent patches as first-class operations, applying them server-side before fan-out. This is closer to OT than to LWW. Microsoft's "Loop" and Notion's AI block editing already work this way.

Second, AI assistants need read-only consistent snapshots. An agent that calls "list every Python file in this project" should see a coherent snapshot, not a half-synced state. Sync engines are starting to expose "give me the version of this folder as of timestamp T" APIs explicitly to satisfy this.

For agent-generated patches, the integrity question is the same one parikshan asks of student solutions: did this code come from a human, an AI, or a sync collision? File sync metadata, if logged thoroughly, is forensic gold for answering it later.

Variants and adjacent systems

Backup services (Backblaze, Arq, Borg): same chunking, same dedup, weaker latency requirements, much longer retention. Backblaze stores 5+ exabytes on the same content-addressed pattern.

Source control (Git, Mercurial, Perforce): Git is a sync engine with explicit version control on top. Every commit is a Merkle tree of blob hashes. Same primitives, different UX.

CDN file distribution (Cloudflare R2, Fastly Compute): one-way sync from origin to edge, no conflict resolution needed. Simpler problem; same hashing and chunking still apply.

Mobile photo sync (iCloud Photos, Google Photos): media-specific optimisations like perceptual hashing for dedup of near-duplicate images, plus aggressive client-side compression. Same architecture, plus an ML layer.

Distributed filesystems (NFS, GlusterFS, Ceph, JuiceFS): posix-compatible interface on top of object storage. Sacrifice the offline-first model for filesystem semantics. Different trade-off; same building blocks.

End-to-end encrypted variants (Proton Drive, Tresorit, Sync.com): kill dedup across users in exchange for the server never seeing plaintext. Smaller user base, premium pricing pays for the lost storage savings.

The shared insight across the variants: file sync is the most concrete instance of "distributed state with conflict resolution" most engineers will ever build. Once you can defend the design above, the patterns transfer to every other replicated-state system you will touch.