Top Social Media Sites and the Databases That Power Them
Social media platforms operate at a scale that fundamentally breaks the assumptions of conventional database design. A single Twitter post can be seen by 100 million people. Facebook stores hundreds of petabytes of photos. WhatsApp delivers 100 billion messages per day. TikTok's recommendation engine processes billions of signals in real time. The database architectures behind these platforms represent some of the most sophisticated engineering in the world — built through necessity, hard-won lessons, and occasional catastrophic failures.
This guide examines the database choices made by the world's top social media platforms: what they use, why they chose it, the scale they operate at, and what developers can learn from their experiences.
Facebook: From MySQL to a Custom Graph Database
The MySQL Foundation
Facebook started with MySQL — the same database powering most LAMP-stack applications of the mid-2000s. As Facebook grew, MySQL became increasingly difficult to scale. Facebook did not abandon MySQL; instead, they invested heavily in making it work at scale. They contributed to MySQL's InnoDB storage engine, built the MySQL Router, and developed extensive tooling around sharding, replication, and online schema changes.
TAO: The Social Graph Store
The most important database innovation at Facebook is TAO (The Associations and Objects), a distributed data store purpose-built for the social graph. TAO stores two types of data: objects (users, posts, photos — identified by a 64-bit ID) and associations (friendships, likes, shares — directed edges between objects). TAO provides a simple API: assoc_get(), assoc_count(), obj_get().
TAO stores its data in MySQL and uses a two-tier cache (memcached-based) in front of it. Read replicas handle the vast majority of traffic. Facebook has published that TAO handles billions of requests per second across thousands of servers in multiple data centers.
RocksDB
Facebook open-sourced RocksDB in 2013, a high-performance embedded key-value store built on Google's LevelDB. RocksDB is optimized for SSDs and high write throughput using a Log-Structured Merge (LSM) tree. Facebook uses RocksDB extensively as the storage engine for multiple internal systems. It has since been adopted by dozens of other databases including TiKV (the storage engine behind TiDB), CockroachDB, and Apache Flink.
Cassandra
Facebook's Inbox used Apache Cassandra, which was actually created at Facebook by Avinash Lakshman (one of the Dynamo paper authors) and Prashant Malik, before being open-sourced in 2008. Cassandra's masterless architecture and tunable consistency made it suitable for Facebook's globally distributed messaging use cases.
Instagram: PostgreSQL and Cassandra at Scale
PostgreSQL as the Primary Database
Instagram — now part of Meta — was built on PostgreSQL and famously scaled it much further than most engineers believed possible. When Instagram was acquired by Facebook in 2012 for $1 billion, they had 13 employees and were serving tens of millions of users from a handful of PostgreSQL instances. Their approach: aggressive use of sharding (partitioning data across many PostgreSQL instances) combined with a Python-based sharding library called django-sharding.
Instagram's engineering team published extensively about their PostgreSQL experience, noting that PostgreSQL's reliability, ACID compliance, and rich feature set made it the right foundation even at significant scale, as long as sharding was implemented correctly.
Cassandra for Time-Series Data
For activity feeds, Instagram adopted Cassandra. The feed is inherently a time-series use case: you want all posts from the accounts you follow, ordered by time. Cassandra's ability to efficiently write and read time-ordered data across many users made it a natural fit for this access pattern.
Django ORM and Data Access
Instagram ran on Django — one of the world's largest Django deployments — and used Django's ORM to abstract database access. This is notable because it demonstrates that high-level ORMs can work at extraordinary scale when the underlying database design and sharding strategy are sound.
Twitter/X: MySQL, Manhattan, and FlockDB
MySQL at the Core
Twitter began with MySQL and maintained it as a core storage layer for years. Early Twitter was infamous for its whale error page — the service went down constantly under load, partly due to database scaling challenges. Twitter engineers eventually implemented aggressive read replica patterns and sharding to scale MySQL.
Gizzard: The Sharding Framework
Twitter built Gizzard, an open-source sharding framework for MySQL, to partition their data. Gizzard sits in front of MySQL instances and routes writes to the correct shard based on a partitioning strategy. This allowed Twitter to scale horizontally without replacing MySQL.
Manhattan: Twitter's Internal Key-Value Store
Manhattan is Twitter's distributed key-value store, built to replace Cassandra for their most demanding use cases. Manhattan provides strong consistency, multi-data-center replication, and efficient compaction. It powers Twitter's home timeline, which must deliver billions of tweets per day to millions of concurrent users. Manhattan stores tweets in a form that can be assembled into timelines quickly, using a fan-out write model where new tweets are written to the timelines of all followers at write time.
FlockDB: The Graph Database
Twitter open-sourced FlockDB in 2010 — a distributed graph database storing social relationships (who follows whom). FlockDB is optimized for the specific query patterns of social graphs: "give me everyone who follows user X" or "give me everyone user X follows." It uses MySQL as its storage backend but adds a layer that makes graph traversals efficient.
Blob Store
For media (images, videos), Twitter built a distributed blob storage system on top of commodity hardware. As the platform evolved to emphasize video, this system was significantly expanded, eventually integrating with cloud storage.
LinkedIn: Espresso, Voldemort, and Oracle
Oracle for Core Data
LinkedIn's early infrastructure was built on Oracle — a common choice for enterprise applications requiring strong ACID guarantees and reliability. Oracle powered LinkedIn's core member profile data for years, and the company invested significant resources in Oracle performance tuning and data modeling.
Voldemort: The Key-Value Store
Voldemort (named after the Harry Potter villain) is LinkedIn's open-source distributed key-value store, inspired by Amazon's Dynamo paper. LinkedIn built Voldemort to handle high-throughput reads for member profile data that needed to be served with low latency across multiple data centers. Voldemort uses consistent hashing and configurable replication to achieve fault tolerance.
Espresso: LinkedIn's Internal NoSQL Database
Espresso is LinkedIn's internally-developed distributed NoSQL database. Unlike Voldemort (which is schema-less), Espresso is document-oriented and schema-aware. It uses MySQL as the storage layer but adds LinkedIn-specific features: secondary indexes, replication to Kafka for change data capture, and integration with LinkedIn's data infrastructure.
Espresso powers many of LinkedIn's most critical features including member profiles, connections, and messaging. LinkedIn's engineering blog has published extensively about Espresso's design, noting that using MySQL as the underlying storage engine provided proven durability while the Espresso layer added the distributed, document-oriented capabilities they needed.
Kafka: The Nervous System
Apache Kafka was created at LinkedIn in 2011 by Jay Kreps, Neha Narkhede, and Jun Rao, and has since become the de facto standard for distributed event streaming. LinkedIn uses Kafka extensively for activity streams, change data capture from their databases, and real-time data pipelines. Kafka's creation is one of the most significant contributions any company has made to the open-source infrastructure ecosystem.
YouTube: Vitess/MySQL and Bigtable
MySQL and the Vitess Problem
YouTube was built on MySQL and faced scaling challenges that led to the creation of Vitess, an open-source database clustering system. Vitess sits in front of MySQL and provides sharding, connection pooling, query routing, and schema management. Rather than replacing MySQL, Vitess turns MySQL into a horizontally scalable system capable of handling YouTube's enormous traffic.
Vitess is now a CNCF (Cloud Native Computing Foundation) graduated project and is used by companies including Slack, PlanetScale, and many others who need horizontal MySQL scalability.
Bigtable for Metadata
YouTube uses Google Bigtable — Google's proprietary wide-column store — for video metadata, view counts, and other high-write-rate data. Bigtable can handle millions of reads and writes per second, making it suitable for the extreme traffic surges when popular videos are shared virally. View count updates for viral videos represent one of the most demanding real-time update workloads in existence.
TikTok: MySQL, Redis, and Distributed Systems at Viral Scale
The Scale Challenge
TikTok's challenge is distinctive: it must serve personalized video feeds to 1+ billion monthly active users, with a recommendation engine that updates in near-real-time based on engagement signals. A video can go from 1,000 views to 100 million views within 24 hours, demanding infrastructure that can scale elastically.
MySQL for Structured Data
TikTok uses MySQL for core structured data — user accounts, video metadata, and social graph data — sharded across many instances. For their ByteDance parent company's scale in China, they have developed extensive internal infrastructure around MySQL sharding.
Redis for Caching and Real-Time Features
Redis is central to TikTok's real-time features: caching hot video metadata, tracking real-time engagement signals (likes, shares, comments as they happen), implementing rate limiting, and powering ephemeral social features like live stream interactions. Redis's sub-millisecond read latency is critical for feed generation performance.
Recommendation Engine Infrastructure
TikTok's recommendation engine — widely considered the best in the social media industry — relies on a purpose-built feature store and real-time ML infrastructure. User engagement signals are processed through distributed stream processing systems and fed into ranking models that run inference at millisecond latency. The specifics of this infrastructure are closely guarded, but TikTok runs on a foundation of Kafka-like streaming, distributed feature computation, and GPU clusters for model serving.
Reddit: PostgreSQL and Cassandra
From MySQL to PostgreSQL
Reddit began with MySQL but migrated to PostgreSQL early in its history (one of the earlier major MySQL-to-PostgreSQL migrations in the industry). Reddit's engineering has published about preferring PostgreSQL for its standards compliance, feature richness, and the reliability they experienced at scale.
Cassandra for Activity Data
Reddit uses Cassandra for high-volume activity data including vote tracking, comment activity, and front-page ranking computations. Reddit's voting system must process millions of upvotes and downvotes per day, with vote counts used in real time to rank posts. Cassandra's high write throughput and efficient time-series queries make it suitable for this workload.
WhatsApp: Mnesia and Erlang
A Radically Different Stack
WhatsApp is architecturally unlike every other platform on this list. It is built on Erlang — a programming language and runtime designed by Ericsson for telecom systems requiring extreme reliability and concurrency. At the time of Facebook's acquisition in 2014, WhatsApp was handling 50 billion messages per day with just 32 engineers — an extraordinary ratio of scale to team size.
Mnesia: Erlang's Built-in Database
WhatsApp uses Mnesia, Erlang's built-in distributed database, for session state, message routing tables, and user connection data. Mnesia is a soft real-time database with support for transactions and distributed operation, built into the Erlang runtime. For the specific access patterns of a messaging system — looking up which server a user's connection is on, routing messages to the right process — Mnesia's performance characteristics are excellent.
Long-term message storage (for cloud backup and history) uses additional storage backends, but the core routing and delivery infrastructure runs on Mnesia and Erlang's actor model, which allows millions of concurrent connections per server.
Pinterest: MySQL, HBase, and Redis
MySQL with Aggressive Sharding
Pinterest stores its core data (pins, boards, users, follows) in MySQL with a sharding approach similar to Instagram's. Pinterest has published about sharding 8 MySQL databases into thousands of shards to handle their scale — one of the more detailed public writeups of MySQL sharding in production.
HBase for Time-Series and Activity Data
Apache HBase — a Hadoop-based wide-column store modeled on Google Bigtable — handles Pinterest's time-ordered activity feeds and large-scale data that doesn't fit cleanly into MySQL's model. HBase's ability to store billions of rows efficiently makes it suitable for user activity logs and aggregated engagement metrics.
Redis at Every Layer
Redis is extensively used at Pinterest for caching frequently accessed data (pin metadata, user data, trending content), implementing rate limiting, and powering real-time features. Pinterest has contributed to Redis open source and has published detailed performance analyses of their Redis deployments.
Snapchat: Google Cloud Spanner and Bigtable
Going All-In on Google Cloud
Snapchat is notable for building their infrastructure almost entirely on Google Cloud — making them one of the most prominent examples of a major consumer application running on a single public cloud provider. Their database choices reflect Google Cloud's offerings.
Google Cloud Spanner
Google Cloud Spanner is a globally distributed, externally consistent relational database. It provides the ACID guarantees of a relational database with the horizontal scalability of a distributed system — a combination that was previously considered theoretically impossible (the CAP theorem) until Google demonstrated it was achievable with TrueTime GPS-based clock synchronization. Snapchat uses Spanner for data that requires strong consistency at global scale.
Google Cloud Bigtable
Snapchat uses Bigtable for high-throughput time-series data — stories, snap delivery metadata, and activity logs. Bigtable's multi-dimensional sparse data model handles the access patterns of ephemeral messaging (looking up whether a snap has been delivered, opened, or expired) efficiently.
Key Lessons for Developers
1. There Is No Single Right Database
Every major social media platform uses multiple databases — a relational database for structured data, a wide-column store for time-series and activity data, a key-value store for caching, and specialized systems for graphs, search, and analytics. The lesson is to choose the right tool for each access pattern rather than forcing one database to solve all problems.
2. MySQL and PostgreSQL Scale Further Than You Think
Instagram, Pinterest, Reddit, and Twitter all scaled relational databases to extraordinary levels through sharding, read replicas, caching, and good data modeling. Don't abandon relational databases prematurely — with the right architecture, they handle far more than most applications will ever need.
3. Caching Is Not Optional at Scale
Every platform in this list uses Redis or memcached (or both) as a caching layer. The ratio of reads to writes in social media is extremely high — the same popular post is read millions of times but written once. Caching hot data is not premature optimization; it is the architecture that makes high-read workloads sustainable.
4. Build vs. Buy
Several platforms (Facebook/TAO, LinkedIn/Espresso, Twitter/Manhattan) built custom databases when existing solutions couldn't meet their needs. But most organizations — even large ones — are better served by investing in understanding and correctly deploying existing databases. Custom database engineering is extraordinarily expensive and complex.
5. Consistency vs. Availability Trade-offs Are Explicit Design Decisions
Social media platforms routinely sacrifice strong consistency for availability. It's acceptable if a like counter shows 4,999 instead of 5,000 for a fraction of a second. This eventual consistency trade-off allows these systems to remain available and performant at global scale. Design your consistency requirements explicitly rather than assuming you need strong consistency everywhere.
Conclusion
The database architectures powering social media platforms are engineering achievements that push the boundaries of what distributed systems can do. From Facebook's TAO and RocksDB to LinkedIn's Kafka and Espresso, from WhatsApp's remarkable Erlang/Mnesia stack to Snapchat's all-in Google Cloud Spanner approach — each platform reflects years of hard-won experience, specific access pattern requirements, and pragmatic engineering decisions made under pressure.
For developers building data-intensive applications, these architectures provide invaluable blueprints. The patterns they demonstrate — polyglot persistence, aggressive caching, eventual consistency, sharding, and purpose-built storage for specific access patterns — are the vocabulary of modern large-scale system design.
Olibr Editorial
The world's biggest social media platforms serve billions of users daily, requiring database architectures that push the limits of what is technically possible. Here is how Facebook, Instagram, Twitter, LinkedIn, YouTube, TikTok, and others store and serve data at unprecedented scale.