Case: Instagram's Early Architecture, PostgreSQL Plus Pragmatic Caching
Era: 2010 to 2012 (the years documented by the early Instagram engineering writeups) · Author / source: Instagram Engineering Blog (instagram-engineering.com), "What Powers Instagram: Hundreds of Instances, Dozens of Technologies" (2012); High Scalability summary, "Instagram Architecture: 14 Million Users, Terabytes of Photos" · Read alongside: feed delivery, PostgreSQL at scale, Redis-backed lists, async work queues
The situation
Instagram launched in October 2010 with a tiny team (three engineers at the time of the architecture writeup) and reached 14 million users "in just over a year," with "150 million photos" by August 2011, "amassing several terabytes" of media data. By any historical measure of consumer product growth, that is steep, and the system had to be built and re-built in flight by a team smaller than a typical hackathon group.
The technical realities that shaped the design:
- The data was photo-heavy, but the queries were timeline-heavy. Photos are large and static; timelines are small per-row but enormous in number and rapidly mutating.
- The team had real Django and PostgreSQL experience, not the systems experience of a Twitter or Facebook infrastructure team. Every architectural choice had to be one or two engineers could operate.
- Cost mattered. Instagram ran on AWS from the start and could not afford the level of infrastructure spend that Twitter had built up.
- Discovery and recommendation were not yet the product; chronological feeds from followed accounts were. That simplifies the read path significantly.
The architecture, as Instagram engineering documented it in 2012:
- Web tier: "25+ Django application servers" running Gunicorn.
- Primary data store: "PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances," with "Twelve PostgreSQL replicas run in a different availability zone" using "master-replica setup using Streaming Replication."
- Caches: Redis and memcached, "6 instances" of memcached at the time of writing.
- Search: "Apache Solr powers the geo-search API."
- Async work: "Gearman" with "200 Python workers" handling tasks including feed fanout.
- Object storage and delivery: "S3" for photos, "Amazon CloudFront as the CDN."
The options on the table
For the feed, the question was push vs. pull:
- Pull-based (Twitter-style "fetch on read"). When a user opens the app, query for the most recent posts from everyone they follow, merge in time order. Cheap on write, expensive on read (especially for users following many accounts), bursty under thundering-herd conditions.
- Push-based (fan-out on write). When user A posts, write a feed entry into each follower's feed list. Expensive on write (one write per follower), cheap on read (read your own list).
- Hybrid (push for most, pull for celebrities). Push the feed for users with normal follower counts; for users with very large followings (so-called "celebrity" accounts), do a pull-style query and merge at read time. This is the approach Twitter described later.
- Pure denormalized timeline in a NoSQL store. Cassandra-style wide-row timelines. Possible, but at Instagram's team size, not the right operational bet.
For storage:
- PostgreSQL, sharded. Sticky to known tooling.
- NoSQL primary. Faster on some access patterns, but the team did not have the operational depth.
- MySQL with sharding (a la YouTube). Comparable but a different ecosystem.
What they chose, and why
Sharded PostgreSQL as the primary store, with Redis-backed feed lists and Gearman as the async queue for fanout. The system grew up around what the small team could operate confidently.
The pattern is documented explicitly in the Instagram engineering writeup and the High Scalability summary:
- PostgreSQL is the source of truth. Twelve quadruple-extra-large memory instances at the time of writing, with replication for read scale and availability.
- Sharding strategy. Instagram developed their own sharding scheme. The companion writeup, "Sharding & IDs at Instagram," describes how they encode shard, time, and sequence into 64-bit IDs so the ID itself locates the row.
- Feed delivery via async fanout. Gearman with 200 Python workers handles "feed fan-out," writing into the recipient's feed structure when a new post is created.
- Redis and memcached for hot reads. Memcached fronts general data; Redis holds list structures (feeds, followers).
- S3 plus CloudFront for media. No attempt to home-grow image storage at that team size. Pay the AWS bill, get the durability and the CDN.
The reasoning behind the choices, distilled from the writeups:
- Use what you can operate. "Optimize for minimal operational burden but also for reliability."
- PostgreSQL is well understood, predictable, and the team had the experience. The cost of moving to a less-familiar system would have eaten the engineering budget.
- Async fanout is the right shape for a chronological feed where most users have modest follower counts. The write amplification is real but bounded.
- Caches are the load shock absorber. Memcached takes the read pressure; PostgreSQL only sees misses.
What they gave up
- A pure push model that handles celebrities cleanly. When a user with a million followers posts, fanout writes a million entries. At the team's documented scale this was tolerable; it required mitigation as accounts grew larger.
- A unified query model. Multiple stores (Postgres, Redis, memcached, Solr, S3) means multiple consistency stories. A photo write that updates Postgres has to invalidate or update memcached, and the feed write is on a separate path.
- The ability to retroactively re-rank. A chronological feed in Redis lists is hard to re-rank later; the data structure itself encodes "newest first." Instagram eventually moved to a ranked feed, and that transition was a substantial re-architecture, not a config change.
- Vendor independence. S3, CloudFront, and EC2 are AWS-deep. The team accepted lock-in because the alternative (running their own image storage and CDN) was infeasible at the scale-vs-team ratio.
How it played out
Instagram was acquired by Facebook in April 2012 for approximately one billion US dollars. The acquisition followed essentially the architecture described above. Post-acquisition, the data tier eventually migrated onto Meta infrastructure, but the early-Instagram blueprint, "Django plus sharded Postgres plus Redis plus Gearman plus S3 plus CloudFront," became a widely studied template for small-team consumer apps.
Two specific Instagram engineering posts became canon: the architecture overview ("What Powers Instagram") and "Sharding & IDs at Instagram," the latter describing the time-encoded shard-aware 64-bit ID. The ID-encoding technique was adopted by many later systems. The general lesson, "use tools you know, and pay attention to where the operational burden compounds," remained relevant well into the 2020s.
The eventual move to ranked feeds, recommendations, and Reels demanded different infrastructure (real-time ML inference, far more storage for engagement signals, different consistency tradeoffs). That is a different case study; the early chronological-feed architecture remains the textbook small-team consumer-app design.
Where it ties to this bank's patterns
- [[feed-delivery-models]]: pull vs. push vs. hybrid.
- [[fanout-on-write]]: the pattern Instagram used in the early architecture.
- [[id-design-for-sharded-systems]]: time-encoded, shard-aware 64-bit IDs.
- [[postgres-at-scale]]: how far a well-operated, replicated, sharded Postgres can carry a consumer app.
- [[cache-aside]]: the Memcached pattern around the primary store.
- Problem links: any system-design problem on news feeds, follower graphs, or social timelines.
What a candidate should take away
- Operational competence is part of the architecture. A team that knows Postgres deeply will build a better system on Postgres than on a "better" database they do not know.
- Fanout-on-write is the right default for chronological feeds. Pull-on-read becomes the right answer only when reads are rarer than writes, which is unusual.
- Encode information in IDs. A time-and-shard-aware ID gives you locality, ordering, and routing for free.
- Caches are not optional. A primary store that serves every read is a primary store that has not yet seen real load.
- Pay AWS for the boring parts. S3 and a CDN at the start let a three-person team ship a global product. That is not lock-in; that is leverage.
What an AI agent would not have got right
- An AI asked to "design Instagram" will reach for a NoSQL database (Cassandra, DynamoDB) by default because the long-form text corpus equates "social feed" with "wide-column store." The actual Instagram design was sharded Postgres.
- It will suggest pull-based feeds because "fanout is expensive" is a familiar phrase. Instagram's choice was the opposite: fanout-on-write, async via Gearman.
- It will not propose time-encoded shard-aware IDs. The default suggestion is "use UUIDs," which throws away locality.
- It will probably not mention memcached, treating it as legacy. In the actual system, memcached was the difference between a survivable read load and a crippled one.
- It will under-cost the celebrity-account problem and not warn that fanout-on-write breaks at very large follower counts. AI advice tends to assume uniform follower distributions, which is wrong for every real social product.
Sources
- Instagram Engineering, "What Powers Instagram: Hundreds of Instances, Dozens of Technologies": https://instagram-engineering.com/what-powers-instagram-hundreds-of-instances-dozens-of-technologies-adf2e22da2ad
- High Scalability summary of early Instagram architecture: https://highscalability.com/instagram-architecture-14-million-users-terabytes-of-photos/
- Instagram Engineering, "Sharding & IDs at Instagram": https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c