parikshan

Case: YouTube Built Vitess to Shard MySQL Without Throwing MySQL Away

Era: 2010 (creation), 2018 (CNCF graduation timeline begins), 2019 (CNCF graduated)  ·  Author / source: Vitess project documentation (vitess.io), CNCF announcements, Wikipedia "Vitess"  ·  Read alongside: sharding, MySQL operational realities, database proxies, online resharding

The situation

YouTube in 2008-2010 was the canonical "MySQL won't scale to this" workload. Video uploads, view counts, comments, ratings, channels, subscriptions, all classic OLTP data, growing at a rate that no single MySQL primary could absorb. Per Vitess's documentation, the team faced two specific scaling walls:

  1. Write capacity. "Write traffic became too high for the master database to handle, requiring YouTube to shard data."
  2. Read capacity. Even after spreading reads across replicas, "read-only traffic was still high enough to overload the replica database."

The pragmatic choice in 2010 was to shard. The hard part was: shard how, and how do you not break the rest of the application while you do it?

Most teams that sharded MySQL in that era did it inside the application. The application code knew which shard to talk to for which key. This works, but it concentrates database knowledge in every service that touches the data, and resharding becomes a coordinated multi-service deployment. YouTube's initial approach, per the documentation, was that "the application layer was modified so that before executing any database operation, the code could identify the right database shard to receive that particular query." Workable, expensive over time.

The deeper problem: MySQL's connection model assumed a small number of long-lived connections from a small number of application hosts. YouTube needed thousands of application hosts talking to hundreds of MySQL shards, and naive connection multiplication crashed MySQL well before the disk did.

The options on the table

  1. Application-level sharding (status quo). Each service knows its shards. Cheap to start, expensive to evolve.
  2. Move to a different database (Cassandra, HBase, Spanner-style). Throws away MySQL skills and integration, introduces a new operational discipline at peak growth.
  3. A proxy in front of MySQL that hides sharding from the application. The application talks to the proxy as if it were a single MySQL; the proxy routes queries, manages connections, handles resharding.
  4. Application-level sharding plus a connection pooler. Solves the connection storm but not the application-coupling problem.
  5. Custom forked MySQL with built-in sharding. Maximum control, maximum maintenance burden.

What they chose, and why

Option 3. Vitess. The first version began at YouTube in 2010. Per the Vitess project's own history, the proxy approach "introduces a proxy between the application and the database to route and manage database interactions." From the application's perspective, Vitess looks like a MySQL endpoint; from MySQL's perspective, Vitess looks like a well-behaved client with bounded connection counts.

The reasoning for choosing this design over the alternatives:

  • Preserve MySQL. The database underneath is still real MySQL. All the operational knowledge, tools, backups, replication, and edge-case bug fixes that MySQL has accumulated over decades remain usable.
  • Hide sharding from the application. Applications send queries that look like single-database SQL. Vitess routes them based on the vindex (sharding function) configured per table. When you reshard, you change the Vitess configuration; you do not redeploy the application.
  • Connection pooling. Vitess multiplexes many application connections onto a bounded set of MySQL connections, which is what made it tractable to put thousands of application hosts behind hundreds of MySQL primaries.
  • Online resharding. Per the project documentation, Vitess "enables live resharding with minimal read-only downtime" and "data migrations into Vitess, and resharding operations within Vitess are all accomplished with near-zero downtime." That is the operational property that makes the design survive growth.
  • Query rewriting and caching. Per project documentation, Vitess uses "query rewriting, caching and connection pooling" so concurrent connections "scale significantly higher than traditional MySQL implementations."

YouTube ran Vitess in production through their growth and continued to use it after the Google acquisition. Vitess was open-sourced and donated to the Cloud Native Computing Foundation, where it became "a Cloud Native Computing Foundation graduated project," the foundation's highest maturity tier.

What they gave up

  • Direct MySQL semantics in all cases. Some SQL constructs that work on a single MySQL do not work, or work differently, on a sharded one. Cross-shard joins, cross-shard transactions, and global secondary indexes are either restricted, slower, or operationally heavier in Vitess than in a single MySQL.
  • A simpler operational topology. Vitess introduces its own control-plane components (VTGate, VTTablet, vtctld, the topology service). The team must learn Vitess as well as MySQL.
  • A guarantee that the proxy is never the bottleneck. A bug or saturation in the proxy layer can become a system-wide outage. Treating Vitess as critical infrastructure is a real shift.
  • Schema simplicity. Sharding-friendly schema design (good vindex choice, low-cardinality sharding keys avoided) is a discipline. A team that ignores it ends up with hot shards.

How it played out

Vitess became one of the most prominent open-source database scaling tools in the post-2015 era. Per the project site, adopters include Slack, GitHub, Uber, YouTube, Shopify, Square, Pinterest, and Etsy. Slack publicly migrated its MySQL fleet to Vitess in the late 2010s for the same scaling reasons YouTube originally hit. GitHub uses Vitess for parts of its MySQL infrastructure.

Vitess achieved CNCF graduation status, joining a small group of mature open-source infrastructure projects governed by neutral foundations. The project continues to evolve. A 2024 changelog entry notes "Vitess Now Supports Recursive CTEs: A Step Closer to Full MySQL Compatibility," which captures the project's ongoing posture: stay as compatible with MySQL as possible so the proxy stays invisible.

The architectural pattern Vitess crystallized, "a SQL-compatible proxy that hides sharding from the application," became a template. Later databases like CockroachDB and YugabyteDB pursue the same end (sharded SQL behind a standard interface) by different means (native distributed SQL rather than a proxy over many MySQL primaries).

Where it ties to this bank's patterns

  • [[sharding-strategies]]: the broader topic, including range, hash, and directory-based sharding.
  • [[database-proxies]]: the architectural class Vitess defines.
  • [[online-schema-change]]: the property that makes the proxy approach survivable at scale.
  • [[connection-pooling]]: a sub-component but a load-bearing one. MySQL fails first on connection count for many workloads.
  • Problem links: any system-design problem at "single primary RDBMS isn't enough" scale, including social graphs, e-commerce order systems, and metrics stores.

What a candidate should take away

  1. Sharding-as-proxy is the dominant pattern for evolving from single-MySQL to sharded MySQL. It preserves application code and operational knowledge. Pure rewrites are rare for a reason.
  2. Connection multiplication is the first wall, not disk I/O. A single MySQL crashes on connection counts long before it crashes on bytes. Connection pooling at the proxy fixes the first wall.
  3. Resharding is the long-term test. A sharding design that works at the initial shard count and cannot be redistributed without downtime is a sharding design that will fail.
  4. Hide the sharding decision behind an interface that looks like SQL. Application teams should not need to know what shard a row lives on.
  5. A proxy is critical infrastructure. It does not relax operational rigor; it concentrates it.

What an AI agent would not have got right

  • An AI asked to "scale MySQL" will propose either application-level sharding or "move to a NoSQL database," because those are the two dominant blog patterns. The proxy approach is harder to discover without explicit prompting.
  • It will not anticipate the connection-count failure mode. The first sketch will assume connection counts are free, and the system will fail in a way that does not look like a database problem.
  • It will treat resharding as a one-time event rather than a continuous capability. Vitess's value compounds because resharding is operationally cheap; designs that treat it as a migration project misallocate effort.
  • It will probably suggest cross-shard transactions as a first-class feature. Vitess supports them grudgingly, and they remain operationally expensive. AI advice tends to assume distributed transactions are cheap.
  • It will under-cost the discipline required around vindex choice. The first sharding key sketch will be "primary key hash," which often produces hot shards on real workloads (where some users are 10000x more active than the median).

Sources