LSM Tree

The B-Tree Write Problem

B-trees are the dominant storage structure for read-heavy databases (PostgreSQL, MySQL InnoDB, SQLite). They support O(log n) reads and writes by maintaining a balanced tree on disk. The problem is the write pattern: updating a B-tree node requires a random disk write to wherever that node lives on disk. Random I/O on spinning disks runs at 100-200 IOPS. SSDs improve this to 10,000-100,000 IOPS but are still an order of magnitude slower than sequential I/O at 500MB/s or more. Write-heavy workloads (time-series data, log ingestion, key-value stores) hit this random I/O ceiling fast.

How the LSM Tree Works

The LSM tree eliminates random writes by making all writes sequential. The write path has two stages:

Memtable: incoming writes go to an in-memory sorted tree (commonly a red-black tree or skip list). The memtable absorbs writes at memory speed. It is also replicated to a write-ahead log (WAL) on disk for durability. If the process crashes, the WAL allows the memtable to be reconstructed.

SSTable: when the memtable reaches a size threshold (typically 64-256MB), it is sorted and flushed to disk as an immutable SSTable file. This flush is a single sequential write. No random I/O.

All subsequent writes go to a new memtable. Flushed SSTables accumulate on disk.

The Read Trade-off

Reads are more expensive than in a B-tree. To find a key, the engine checks the memtable (fast, in-memory), then must check each SSTable from newest to oldest until it finds the key or exhausts all files. Each SSTable check requires reading a Bloom filter (cheap) and then possibly the SSTable index and data (disk I/O). With many SSTables, read amplification grows. This is the fundamental write-read trade-off of LSM trees.

Compaction

Without compaction, the number of SSTables grows without bound and reads degrade proportionally. Compaction merges multiple SSTables into fewer, larger ones, removing deleted keys and duplicate versions. It is a background process that runs continuously.

Two dominant compaction strategies:

Size-tiered compaction: groups SSTables of similar sizes and merges them. Produces fewer, larger files. Efficient for write-heavy workloads because compaction is infrequent. Read amplification is higher because more SSTables exist at any moment.

Leveled compaction: organizes SSTables into levels, each level 10x larger than the previous. Each level contains non-overlapping key ranges. A key exists in at most one SSTable per level. Read amplification is lower (check one SSTable per level). Write amplification is higher because data is rewritten more frequently during compaction. LevelDB and RocksDB default to leveled compaction. Cassandra defaults to size-tiered but supports leveled.

Write and Read Amplification

Write amplification: the ratio of bytes written to disk versus bytes written by the application. LSM trees have write amplification of 10-30x in leveled compaction because data is rewritten across levels. B-trees have write amplification near 1 for the data itself but suffer from page fragmentation.

Read amplification: the number of disk reads per logical read. B-trees have low read amplification (O(log n)). LSM trees have higher read amplification proportional to the number of SSTables, mitigated by Bloom filters and compaction.

When B-tree Beats LSM

If the workload is read-heavy (more reads than writes), a B-tree provides lower read latency with no compaction overhead. PostgreSQL and MySQL are correct choices for OLTP workloads with frequent point reads. LSM trees are the correct choice when write throughput is the primary constraint: time-series ingestion, event logging, key-value storage with high write rates.

Interview Tip

Frame the answer around amplification factors, not "fast writes." The interviewer wants to see: write amplification (sequential flushes are cheap per byte but data is rewritten during compaction, typically 10-30x for leveled compaction), read amplification (multiple SSTables to check on each read, partially mitigated by Bloom filters reducing disk reads by 50-90% for absent keys), and space amplification (deleted keys and overwritten values persist in older SSTables until compaction reclaims them, temporarily consuming 1.1-2x the logical data size). Name RocksDB or LevelDB and their use in Cassandra, HBase, and InfluxDB. The L6-level addition: explain that the choice between size-tiered and leveled compaction is a production tuning decision based on whether the workload is write-dominated or read-write balanced, and that Cassandra allows this to be configured per table so that a write-heavy table and a read-balanced table in the same cluster can use different strategies.