Storage Engines: B+ Trees vs LSM Trees
Two fundamental data structures that power almost every database. This is a foundational overview — we'll dig deeper as we encounter them in later topics.
B+ Tree (Postgres, MySQL, Oracle)
Structure
An N-ary tree (hundreds of keys per node, NOT binary) where:
- Internal nodes = signposts for navigation only (no data)
- Leaf nodes = actual data + linked list connecting leaves
[ 50 | 100 ] ← internal: routing only
/ | \
[10,20,30] → [60,70,80] → [110,120,130] ← leaves: data + linked
Each node = one disk page (~8KB). Hundreds of keys fit per page.
Why databases love it
- A 3-level tree with 500 keys/node indexes 125 million rows
- Point lookup: 3 disk reads (one per level) → O(log n)
- Range scan (
WHERE price BETWEEN 50 AND 100): find start leaf → follow linked list → sequential I/O - Reads are fast — always a short tree traversal
The write problem
Writes require random I/O:
- Find the correct leaf page on disk
- Read the page into memory
- Insert the key (may trigger a page split if full)
- Write the modified page(s) back
- Page splits can cascade upward through the tree
At high write volumes, all this random seeking becomes the bottleneck.
LSM Tree (Cassandra, RocksDB, LevelDB, HBase)
Core idea: "Never update in place. Just append."
B+ Tree write: find page → read → modify → write back (random I/O)
LSM Tree write: append to memory buffer → done (sequential I/O)
Write path
Write "key=42, val=hello"
│
▼
┌─────────────────┐
│ Memtable │ ← in-memory sorted structure (Red-Black tree / Skip List)
│ (fast writes) │
└────────┬────────┘
│ when full, flush to disk
▼
┌─────────────────┐
│ SSTable (L0) │ ← Sorted String Table — immutable file on disk
├─────────────────┤
│ SSTable (L1) │ ← older, merged/compacted SSTables
├─────────────────┤
│ SSTable (L2) │ ← even older, larger
└─────────────────┘
- Write goes to memtable (in-memory, sorted) — instant
- When memtable is full → flush to disk as an immutable SSTable (one sequential write)
- SSTables accumulate at different levels
Read path
Reads are slower — must check multiple places:
- Check memtable (memory)
- Check SSTables newest → oldest (disk)
- Bloom filters help skip SSTables that definitely don't have the key
Compaction
Background process that merges SSTables:
- Removes duplicate/deleted keys
- Reduces number of files to check on reads
- Reclaims disk space
Head-to-Head Comparison
| B+ Tree (Postgres) | LSM Tree (Cassandra) | |
|---|---|---|
| Write | Random I/O, page splits | Sequential I/O, append-only |
| Read | Fast — one tree traversal | Slower — may check multiple SSTables |
| Range scan | Excellent — follow leaf linked list | Good — SSTables are sorted |
| Write amplification | Moderate (page splits) | High (compaction rewrites data) |
| Space amplification | Low (in-place updates) | Higher (old versions until compacted) |
| Best for | Read-heavy, point lookups, OLTP | Write-heavy, append workloads |
| Used by | Postgres, MySQL, Oracle, SQLite | Cassandra, RocksDB, LevelDB, HBase |
graph LR
subgraph "B+ Tree — Read Optimized"
W1["Write: find page<br/>random I/O ❌"] --> T["Balanced Tree<br/>on disk"]
T --> R1["Read: traverse tree<br/>O(log n) ✅"]
end
subgraph "LSM Tree — Write Optimized"
W2["Write: append to<br/>memtable ✅"] --> M["Memtable<br/>(memory)"]
M -->|"flush"| S["SSTables<br/>(disk)"]
S --> R2["Read: check memtable<br/>+ SSTables ❌"]
end
style W1 fill:#4a1a1a,stroke:#ff4444
style R1 fill:#1a4a1a,stroke:#44ff44
style W2 fill:#1a4a1a,stroke:#44ff44
style R2 fill:#4a3a1a,stroke:#ffaa44
Choosing Between Them
| Your workload | Choose |
|---|---|
| Read-heavy, complex queries, transactions | B+ Tree (Postgres) |
| Write-heavy, append-only, time-series | LSM Tree (Cassandra, RocksDB) |
| Mixed workload | Start with B+ Tree, optimize later |
Topics to Revisit Later
- B+ Tree internals: page layout, fill factor, vacuum/reindex in Postgres
- LSM compaction strategies: size-tiered vs leveled
- Bloom filters: how they work and why they help LSM reads
- Write-Ahead Log (WAL): crash recovery for both approaches
- RocksDB: the most popular embeddable LSM engine (used inside many systems)