Glossary Data Systems
Database Sharding
Horizontal partitioning of a database across multiple machines to distribute load beyond a single server's capacity.
What Is Sharding?
Sharding splits a large dataset across multiple database instances (shards). Each shard holds a subset of rows. Together they form the complete dataset.
Sharding Strategies
Range-Based
Shard by key range (e.g., user IDs 1–1M on shard 1, 1M–2M on shard 2). Simple, but creates hotspots if data is skewed.
Hash-Based
Hash the shard key and assign to shard by hash modulo N. Even distribution but range queries span all shards.
Directory-Based
A lookup service maps keys to shards. Flexible but adds a single point of failure.
The Hard Problems
- Cross-shard joins: Avoid at the schema level. Denormalize or use application-level joins.
- Cross-shard transactions: Distributed transactions are expensive. Design to avoid or use sagas.
- Rebalancing: Adding shards requires moving data. Consistent hashing minimizes this.
When Sharding Is Wrong
Start with vertical scaling and read replicas. Sharding is operationally expensive. Only introduce it when you have a genuine single-node bottleneck.