How Database Indexes Work β Inside B+ Trees and LSM-Trees
How database indexes speed up queries. The mechanics of B+ tree insertion, search, and splits, LSM-Tree write optimization strategies, and how real databases use these structures.
Why Indexes Matter
In a database, a full table scan costs . Searching for a single row in a million-row table could require a million disk reads.
An index solves this problem. With a proper index, lookups complete in . With a B+ tree index, thanks to its high fan-out, even a million-row table needs only about 3β4 disk accesses to find the target row.
This article covers the two core data structures behind database indexing: B+ trees and LSM-Trees.
B+ Tree Fundamentals
From B-Tree to B+ Tree
The B-Tree was invented in 1972 by Rudolf Bayer and Edward McCreight as a self-balancing search tree. Each node holds multiple keys, making it well-suited for disk-based storage.
The B+ tree is a refinement of the B-tree with two key changes:
- Data is stored only in leaf nodes β internal nodes hold only keys (routing information)
- Leaf nodes are linked together β enabling efficient range scans
B+ Tree Properties
A B+ tree of order has these properties:
- Each internal node has at most children
- Each internal node has at least children (except the root)
- An internal node with keys has children
- Leaf nodes hold at most keys
- Leaf nodes hold at least keys
- All leaves are at the same depth (perfectly balanced)
Search Algorithm
Searching in a B+ tree descends from root to leaf:
Search(node, key):
if node is leaf:
return binary_search(node.keys, key)
// Internal node: pick the right child
i = 0
while i < len(node.keys) and key >= node.keys[i]:
i += 1
return Search(node.children[i], key)Keys within each node are sorted, so binary search identifies the correct child in . The tree height is , giving an overall complexity of .
In practice, is typically in the hundreds (node size = disk page size = 4KBβ16KB), so even a million-row table only needs a tree of height 3β4.
Insertion and Splits
Insertion into a B+ tree follows these steps:
- Use the search algorithm to find the appropriate leaf node
- Insert the key into the leaf (maintaining sorted order)
- If the leaf overflows (keys > ), a split occurs:
- The leaf is divided into two
- The median key is promoted to the parent node
- If the parent also overflows, the split propagates upward recursively
- When the root splits, the tree height increases by one
Step through this process in the interactive demo below:
B+ Tree Insert & Range Search
Step through insertions into an order-4 B+ tree, then see a range scan.
Here is the split pseudocode:
Insert(tree, key, value):
leaf = find_leaf(tree.root, key)
insert_into_leaf(leaf, key, value)
if leaf.num_keys > MAX_KEYS:
split_leaf(leaf)
SplitLeaf(leaf):
mid = ceil(leaf.num_keys / 2)
new_leaf = new LeafNode()
// Move upper half to new leaf
new_leaf.keys = leaf.keys[mid:]
leaf.keys = leaf.keys[:mid]
// Update linked list
new_leaf.next = leaf.next
leaf.next = new_leaf
// Promote median to parent
promote_key = new_leaf.keys[0]
insert_into_parent(leaf.parent, promote_key, new_leaf)Range Scan Efficiency
The B+ tree's greatest strength is range scanning. Because leaf nodes are linked, finding the start point and then following the linked list is all it takes:
RangeScan(tree, low, high):
leaf = find_leaf(tree.root, low)
results = []
while leaf != null:
for key in leaf.keys:
if key > high:
return results
if key >= low:
results.append(leaf.data[key])
leaf = leaf.next
return resultsThis is also highly efficient from a disk I/O perspective. Leaf nodes are often laid out contiguously on disk, enabling sequential reads.
Deletion and Merging
When a deletion causes a node's key count to drop below the minimum (), redistribution or merging is needed:
- Redistribution: borrow a key from an adjacent sibling
- Merge: combine with a sibling and remove the separator key from the parent
Merges can propagate upward to the root, potentially reducing the tree height by one.
Clustered vs Non-Clustered Indexes
Clustered Index
In a clustered index, the actual row data is stored in the B+ tree's leaf nodes. The index order equals the physical storage order of the data.
MySQL's InnoDB uses this approach. The primary key becomes the clustered index, and the entire table is organized as this B+ tree:
-- In InnoDB, the primary key becomes the clustered index
CREATE TABLE users (
id INT PRIMARY KEY, -- B+ tree key
name VARCHAR(100), -- Stored in leaf nodes
email VARCHAR(200) -- Stored in leaf nodes
);Secondary indexes (non-primary key indexes) store primary key values in their leaves. A secondary index lookup requires two steps:
- Search the secondary index B+ tree to get the primary key value
- Search the clustered index (primary key B+ tree) to get the actual row data
Non-Clustered Index
PostgreSQL stores table data in a heap (unordered sequence of pages), and index leaf nodes contain row pointers (TID: tuple ID) into the heap.
PostgreSQL index structure:
B+ tree leaf: [key β TID (block_number, offset)]
Heap page: [row data stored within pages]With this design, whether an Index-Only Scan (answering from the index alone) is possible significantly affects performance. When heap access is required, random I/O occurs.
Disk I/O and Page Structure
Why Node Size = Page Size
B+ tree node sizes are designed to match the disk page size (typically 4KBβ16KB) because:
- The minimum unit of disk I/O is a page (block)
- One I/O reads all keys in one node
- Matching the page size eliminates wasted reads
Page size = 16KB, key size = 8B, pointer size = 8B:
Keys per node β 16384 / (8 + 8) β 1000
β Fan-out β 1000
A height-3 B+ tree can index:
1000 Γ 1000 Γ 1000 = 1 billion rows
β 1 billion rows searchable in just 3 disk accessesBuffer Pool
In practice, frequently accessed pages are cached in an in-memory buffer pool. The root node and upper internal nodes are almost always cached, so actual disk accesses are typically just 1β2 reads to reach the leaf.
LSM-Tree β Write-Optimized Approach
The B+ Tree Write Cost Problem
B+ trees excel at reads but have write challenges:
- Even a single key insertion requires updating a page at a random disk location
- Page updates require a Read-Modify-Write cycle (one read + one write)
- Splits require updating multiple pages
- Write amplification is significant
LSM-Tree Design Philosophy
The LSM-Tree (Log-Structured Merge-Tree), proposed by Patrick O'Neil et al. in 1996, converts all writes to sequential writes, dramatically improving write throughput.
Write Path
- Append the write to the WAL (Write-Ahead Log) for durability
- Insert the key into the Memtable (an in-memory sorted structure, typically a skip list or red-black tree)
- When the Memtable reaches a threshold size, flush it to disk as an SSTable (Sorted String Table)
- SSTables are immutable (never modified) β sorted key-value pair files
Read Path
Reads must search multiple levels:
Read(key):
// 1. Search Memtable first (most recent data)
result = memtable.get(key)
if result != null: return result
// 2. Search L0 SSTables newest-first
for sstable in L0_tables (newest first):
result = sstable.get(key) // Bloom filter + binary search
if result != null: return result
// 3. Search L1 and beyond
for level in [L1, L2, ...]:
result = level.get(key) // Each level is sorted, so binary search
if result != null: return result
return NOT_FOUNDEach SSTable has a Bloom filter that probabilistically determines whether a key does not exist in the file (false positive rate typically set below 1%).
Compaction
As SSTables accumulate, read performance degrades, so periodic compaction is performed:
Compaction(level_n, level_n1):
// Merge level_n SSTables with overlapping range in level_n+1
for sstable in level_n.tables:
overlapping = find_overlapping(level_n1, sstable.key_range)
merged = merge_sort(sstable, overlapping)
// Write as new SSTables to level_n+1
write_new_sstables(level_n1, merged)
// Delete old SSTables
delete(sstable, overlapping)There are two main compaction strategies:
| Strategy | Characteristics | Used by |
|---|---|---|
| Size-Tiered | Merges similarly-sized SSTables. Faster writes | Cassandra (default) |
| Leveled | Fixed size ratio between levels (typically 10Γ). Faster reads | RocksDB, LevelDB |
Write Amplification vs Read Amplification
LSM-Trees incur write amplification through compaction. With leveled compaction, a single key gets rewritten at each level (L0βL1βL2), so write amplification is where is the number of levels and is the size ratio. With a size ratio of 10 and 4 levels, effective write amplification can reach about 40Γ.
B+ tree write amplification comes from partial page updates, generally page size / change size (e.g., 16KB / 100B = 160Γ), though buffer pool batching mitigates this.
B+ Tree vs LSM-Tree β When to Use Which
| Aspect | B+ Tree | LSM-Tree |
|---|---|---|
| Point read | , fast | Must search multiple levels, slower |
| Range scan | Leaf linked list, extremely fast | Requires merge reads, slower |
| Write | Random I/O, split cost | Sequential I/O, fast |
| Space efficiency | Page fill ~50β70% | Temporary space needed during compaction |
| Use case | OLTP, read-heavy | Logs, time series, write-heavy |
Selection Guidelines
- Read-heavy with frequent range scans β B+ tree (MySQL, PostgreSQL)
- Write-heavy with logs and event data β LSM-Tree (RocksDB, Cassandra)
- Mixed workloads β Consider tuning or hybrid configurations
Real-World Database Adoption
MySQL (InnoDB)
InnoDB uses B+ treeβbased clustered indexing. The primary key's B+ tree is the table itself, and secondary index leaves store primary key values.
-- Check the query plan with EXPLAIN
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';
-- type: ref, key: idx_email β using secondary index
-- Extra: NULL β secondary lookup to clustered index neededPostgreSQL
PostgreSQL uses a heap table + B+ tree index architecture. The HOT (Heap-Only Tuples) optimization skips index updates when indexed columns aren't modified.
PostgreSQL also offers rich index types beyond B+ trees: GiST (Generalized Search Tree), GIN (Generalized Inverted Index), and BRIN (Block Range Index).
RocksDB
An LSM-Treeβbased storage engine developed by Meta (Facebook), forked from LevelDB and heavily optimized for production use.
- Leveled Compaction by default
- Column Families for logical data separation
- Bloom Filters to optimize read paths
It serves as the internal storage engine for CockroachDB, TiKV (TiDB), MyRocks, and many others.
SQLite
SQLite internally uses two types of B-trees. Table B-trees, used for row storage, store all data in leaf nodes β a B+ treeβlike structure. Index B-trees, used for indexes, hold keys in both interior and leaf nodes β closer to a traditional B-tree. WITHOUT ROWID tables use index B-trees, providing a structure closer to a clustered index. Its simplicity makes it ideal as an embedded database.
Advanced Topics
Write-Optimized B+ Trees (WB-Tree, BΞ΅-Tree)
The BΞ΅-Tree (B-epsilon Tree) sits between B+ trees and LSM-Trees. Each internal node has a buffer for batching writes, improving B+ tree write performance while maintaining read performance. It was used by TokuDB (formerly available in Percona Server for MySQL, now deprecated).
Learned Index
Learned Indexes, proposed by Tim Kraska et al. in 2018, replace B+ tree internal node routing with machine learning models (linear regression, neural networks). When data distribution is predictable, this can achieve faster and more memory-efficient indexing than B+ trees. However, practical adoption still faces challenges and it remains largely a research area.
ALEX (Adaptive Learned Index)
ALEX improves on learned indexes by supporting dynamic inserts and deletes while adaptively updating models as data distribution changes.
Summary
The core insight behind database indexing is designing data structures to match disk I/O characteristics:
- B+ trees minimize random reads through high fan-out and accelerate range scans with linked leaves. Best suited for read-heavy OLTP workloads
- LSM-Trees convert all writes to sequential I/O, maximizing write throughput. Best suited for log and time-series data ingestion
- There are clear trade-offs between the two, and the choice depends on workload characteristics
Next time you look at an EXPLAIN query plan, you'll be able to imagine the B+ tree determining the search path behind the scenes.