A deep-dive into how distributed databases like Cassandra and HBase organize, route, and reclaim data at massive scale.
Imagine you are the office manager of a law firm with ten thousand clients. On day one, you have a single filing cabinet. Every new client folder goes in alphabetically. Life is simple. Reads are fast. Writes are instant.
Then the firm explodes in size. You have a hundred thousand clients and a hundred thousand folders. One cabinet is now a warehouse. Worse, lawyers are constantly updating folders, adding new pages, and occasionally shredding old ones. The single filing cabinet becomes a bottleneck — every paralegal is fighting to access it simultaneously.
So you make two design decisions. First, you partition the warehouse: floor A houses clients A–M, floor B houses clients N–Z. Each floor has its own staff. Now paralegal teams work in parallel without blocking each other. Second, every time a new page arrives, instead of hunting down the original folder to insert it, your staff drops it in a fast-intake tray at the front. At night, a maintenance crew merges all the intake tray papers into the right folders, discarding duplicates and shredded content. This night-time merging is compaction.
This is the intuition behind wide-column stores. Systems like Apache Cassandra, Google Bigtable, and Apache HBase face this exact problem — millions of rows, billions of columns, continuous writes — and they solve it with the same two-pronged strategy. We will explore each strategy in depth.
Before we talk about partitioning and compaction, we need to establish the data model we are working with, because it shapes every design decision.
In a relational database, a table has a fixed schema. Every row has the same columns, and empty values waste space or require nulls. Wide-column stores take a different approach: rows share a row key as their identifier, but each row can carry a completely different set of columns. Columns are grouped into column families — logical buckets that are stored together on disk.
┌──────────────────────────────────────────────────────────────────────────┐
│ TABLE: user_activity │
│ │
│ Row Key │ CF: profile │ CF: events │
│───────────────────┼───────────────────────┼──────────────────────────────│
│ user:1001 │ name: "Alice" │ login:2024-01-01: true │
│ │ age: 29 │ purchase:2024-01-03: "book" │
│───────────────────┼───────────────────────┼──────────────────────────────│
│ user:1002 │ name: "Bob" │ login:2024-01-02: true │
│ │ (no age stored) │ (no purchase recorded) │
└──────────────────────────────────────────────────────────────────────────┘
In this figure, notice that user:1002 has no age and no purchase column. Nothing is stored for absent data — there are no NULLs taking up space. This sparsity is a first-class feature.
The row key is the only built-in index. This is not a limitation — it is a deliberate design choice. Because all data for a row key lives together on disk, a single-key lookup is blazing fast. The trade-off is that arbitrary cross-column queries require table scans. Wide-column stores are query-first systems: you design your schema around the queries, not the entities.
A single filing cabinet, no matter how well organized, has a ceiling. Wide-column stores are built to serve hundreds of terabytes across clusters of dozens or hundreds of machines. We need a strategy to split data across nodes, route writes to the correct node, and find data on reads — all without a central coordinator that becomes a bottleneck.
The simplest idea is to split data by key ranges: node 1 handles rows A–F, node 2 handles G–L, and so on. This is called range partitioning.
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ A - F │ │ G - L │ │ M - Z │
└────────────┘ └────────────┘ └────────────┘
Range partitioning has one great property: range scans are fast. All rows starting with user:2024-01 live on the same node. But it has a brutal failure mode: hotspots. If most of your users have names starting with “S” — or if your row keys are timestamps — all writes pile onto one node. That node sweats while its neighbours sip coffee.
To solve the hotspot problem, we hash the partition key and use the hash value to assign data to nodes. The elegant structure for this is the consistent hash ring. We map the entire hash space — all 64-bit integers — as a ring. Each node is assigned a position on this ring. A partition key is hashed (Cassandra uses the Murmur3 algorithm), and the data is owned by the first node clockwise from that hash value.
┌──────────────────────────────────┐
│ HASH RING │
│ │
│ Node A │
│ (token: 0) │
│ ┌──────┐ │
│ ┌───┤ ├───┐ │
│ │ └──────┘ │ │
│ Node D Node B │
│ (token: (token: │
│ -4.6B) 2.3B) │
│ │ │ │
│ └───┐ ┌───┘ │
│ └──────┘ │
│ Node C │
│ (token: -2.3B) │
└──────────────────────────────────┘
Key "alice" ──► hash = 1,680,000,000 ──► falls between Node B and Node C
──► assigned to Node C (first node clockwise)
In the diagram above, we see four nodes placed at token positions on the ring. A partition key is hashed to a numeric value, placed on the ring, and ownership goes to the next clockwise node. No central lookup table is needed — any node can independently compute who owns any key.
The critical property of consistent hashing is that when a node joins or leaves, only the keys in the affected arc need to be remapped. Add Node E between Node B and Node C, and only Node C’s data in that arc migrates. In a naive modulo-based approach, adding a node reshuffles nearly all data.
Early Cassandra assigned each physical node a single position on the ring, causing two real-world headaches:
jim | -2,245,462,676 | Node D |
| carol | 7,723,358,927 | Node A |
| johnny | -6,723,372,854 | Node D |
| suzy | 1,168,604,627 | Node B |Notice jim and johnny both fall on Node D — with a single token per node, skew is common.
Cassandra’s answer, introduced in version 1.2, is virtual nodes (VNodes). Instead of placing each physical node at one position on the ring, Cassandra assigns each node many small, non-contiguous token ranges — 256 by default.
Physical Ring with VNodes:
── Node A ──┬── Node C ──┬── Node B ──┬── Node A ──┬── Node C ──► ...
│ │ │ │
(VNode 1) (VNode 1) (VNode 1) (VNode 2) (VNode 2)
In this figure, each physical node (A, B, C) owns multiple scattered arc segments. The total arc size per node converges to roughly equal shares — like a pie divided into many thin slices distributed around the ring.
VNodes deliver three concrete benefits:
The hash ring solves cluster-level distribution. But the developer still controls what goes into the partition key. Poor partition key design is the most common cause of production performance problems.
Rule 1: Maximize cardinality. A partition key with only 3 possible values can only spread data across 3 effective shards. user_id (millions of values) is far better than country_code (200 values).
Rule 2: Avoid monotonically increasing keys. Raw timestamps as partition keys create write hotspots even in hash-partitioned systems when used as the sole dimension.
Rule 3: Avoid unbounded partitions. A single partition lives on one node. If one partition key maps to 50 GB of data, that node becomes a bottleneck. Use composite partition keys to bucket data:
# BAD: sensor_id alone leads to unbounded partition growth
# All readings for one sensor end up on one node forever
#
# CREATE TABLE sensor_readings (
# sensor_id UUID,
# ts TIMESTAMP,
# value FLOAT,
# PRIMARY KEY (sensor_id, ts)
# );
# GOOD: bucket by month to cap partition size
# Partition key = (sensor_id, year_month)
#
# CREATE TABLE sensor_readings_bucketed (
# sensor_id UUID,
# year_month TEXT, -- e.g., "2024-01"
# ts TIMESTAMP,
# value FLOAT,
# PRIMARY KEY ((sensor_id, year_month), ts)
# );
# Python simulation of partition key design
def make_partition_key(sensor_id: str, timestamp: str) -> str:
"""
Create a bucketed partition key from sensor_id + year_month.
Caps partition size to one month of readings per sensor.
"""
year_month = timestamp[:7] # e.g., "2024-01" from "2024-01-15T12:00:00"
return f"{sensor_id}:{year_month}"
# Example writes
readings = [
("sensor_42", "2024-01-01T10:00:00", 23.5),
("sensor_42", "2024-01-15T14:00:00", 24.1),
("sensor_42", "2024-02-01T09:00:00", 22.8),
]
for sensor_id, ts, value in readings:
pk = make_partition_key(sensor_id, ts)
print(f" Row → partition_key={pk!r}, clustering_key={ts!r}, value={value}")
# Output:
# Row → partition_key='sensor_42:2024-01', clustering_key='2024-01-01T10:00:00', value=23.5
# Row → partition_key='sensor_42:2024-01', clustering_key='2024-01-15T14:00:00', value=24.1
# Row → partition_key='sensor_42:2024-02', clustering_key='2024-02-01T09:00:00', value=22.8
The (sensor_id, year_month) composite partition key caps partition size to one month of readings per sensor. Queries within a month remain a single-partition read. Queries spanning months need multiple partition reads — a deliberate and predictable trade-off.
Now that we know where data goes, we need to understand how it gets written and stored. This is where the Log-Structured Merge Tree (LSM Tree) comes in — the storage engine at the heart of Cassandra, HBase, RocksDB, and most modern wide-column systems.
Traditional relational databases use B-Trees as their on-disk structure. B-Trees are brilliant for read-heavy workloads: they keep data sorted, support range queries, and are cache-friendly. But they have one weakness at write-intensive scale: random writes. When you update a value in a B-Tree, the database must find the existing page on disk, load it into memory, modify it, and write it back. At millions of writes per second, this random I/O creates a storm of disk head movements.
The LSM Tree’s key insight is: never modify data on disk. Instead, buffer all writes in memory first.
When a write arrives, it goes to two places simultaneously:
No random I/O. No page lookup. No read-modify-write.
*In the above flow, the WAL write is sequential (fast on both HDDs and SSDs), and the MemTable write is a pure memory operation. We have turned random I/O into sequential I/O — the core performance win of the LSM design*.
The MemTable absorbs writes until it reaches a size threshold (e.g., 128 MB). At that point, it is **frozen** (made immutable) and a new MemTable is opened for fresh writes. The frozen MemTable is **flushed to disk** as a new SSTable file.
### SSTables: Immutable Sorted Files
A **Sorted String Table (SSTable)** is an immutable, sorted, compressed file on disk. Because the MemTable was sorted in memory, the resulting SSTable is already sorted by key — no re-sorting step needed during flush.
MemTable (in memory) SSTable (on disk) ┌────────────────────┐ ┌──────────────────────────────┐ │ alice → 29 │ │ Block 1: alice:29, bob:35 │ │ bob → 35 │ flush │ Block 2: carol:22, dave:41 │ │ carol → 22 │ ─────► │ Index: {alice:0, carol:512} │ │ dave → 41 │ │ Bloom Filter: [0,1,0,1,1…] │ └────────────────────┘ └──────────────────────────────┘
*In this diagram, the MemTable (left) is a sorted in-memory structure. On flush, it becomes an SSTable (right) with three components: the data blocks, an index for fast lookup, and a Bloom filter for probabilistic membership testing. We will revisit these during the read path.*
Each SSTable has an **index block** mapping keys to byte offsets, enabling binary search within the file. It also carries a **Bloom filter** — a compact probabilistic structure that can answer "is key X definitely NOT in this file?" with zero false negatives, dramatically reducing unnecessary disk reads.
### The Growing Problem: Too Many SSTables
Here is where the system starts to strain. Each MemTable flush creates a new SSTable. After a few hours of heavy writes, you might have hundreds of SSTables on disk, creating three problems:
**Problem 1 — Read Amplification.** When we read key `alice`, we must check every SSTable that *might* contain her. The Bloom filter helps, but if she has been updated 10 times, her key exists in 10 SSTables. At N SSTables, that is up to N disk reads per query in the worst case.
**Problem 2 — Space Amplification.** Each update to `alice` leaves the old version in an older SSTable. Neither is deleted until someone merges them. Disk space bloats with obsolete versions.
**Problem 3 — Tombstone Accumulation.** Deletes in LSM systems do not erase data. Instead, they write a special marker called a **tombstone**. The tombstone hides the old value during reads, but the old value still occupies disk space. If tombstones pile up, they slow both reads and compaction.
The fix for all three problems is **compaction**.
***
## Part III: Compaction
### The Night-Time Janitor
Compaction is the background process that periodically merges SSTables together. More precisely, compaction:
1. Selects a set of SSTables as candidates.
2. Merges them using a **k-way merge sort** (like merging K sorted arrays).
3. Drops obsolete versions (keeps the newest for each key).
4. Drops tombstones (along with their associated deleted data, once old enough).
5. Writes a new, larger SSTable.
6. Deletes the old SSTables, reclaiming disk space.
BEFORE COMPACTION:
SSTable 1: [alice:v1, bob:v2, dave:v1] SSTable 2: [alice:v2, carol:v1] SSTable 3: [bob:v3, carol:TOMBSTONE]
AFTER COMPACTION (k-way merge, keep latest, drop tombstoned):
SSTable merged: [alice:v2, bob:v3, dave:v1] (carol wiped: carol:v1 deleted by tombstone, tombstone itself dropped)
*In the example above, compaction resolves three SSTables into one. `alice:v1` is superseded by `alice:v2`. `bob:v2` is superseded by `bob:v3`. `carol:v1` is deleted by the tombstone. The output is clean and compact*.[^10]
### The Fundamental Trade-off Triangle
Compaction is not free. Every byte it merges is a byte written to disk *again* — called **write amplification**. We are always balancing three competing metrics:
Write Amplification
▲
│
│ ┌──── Leveled ─────┐
│ │ low space amp, │
│ │ high write amp │
│ └──────────────────┘
│
│ ┌── Size-Tiered ───┐
│ │ low write amp, │
│ │ high space amp │
│ └──────────────────┘
│ ────────────────┼──────────────────────► Space Amplification
│
Read Amplification (the third axis of the trade-off) ```
This triangle illustrates that no strategy eliminates all three problems simultaneously. All compaction strategies are moving along these axes, not off them.
The default in Cassandra and ScyllaDB.
The Idea: Group SSTables by size. When you have enough SSTables of similar size (typically 4), merge them into one larger SSTable. The result grows into the next size tier, and the process repeats. Data cascades downward through size tiers like a waterfall.
TIER 1 (small ~10MB): [A] [B] [C] [D] ──► merge ──► [ABCD ~40MB]
│
TIER 2 (medium ~40MB): [W] [X] [Y] [ABCD] ─► merge ──► [WXYZ-ABCD ~160MB]
│
TIER 3 (large ~160MB): [P] [Q] [WXYZ...] ── ... ──► [huge sealed file]
In this cascade, small SSTables are born from MemTable flushes. Four similar-sized files trigger a merge, yielding a file 4× larger. This larger file eventually meets others its size, triggering another merge. The intuition is that data “falls” down tiers as it ages.
from collections import defaultdict
def select_stcs_candidates(
sstables: list[dict],
min_threshold: int = 4,
bucket_low: float = 0.5,
bucket_high: float = 1.5
) -> list[dict]:
"""
Group SSTables by size into buckets.
Select the first bucket that meets the min_threshold for compaction.
Each SSTable dict: {'name': str, 'size_mb': float}
"""
if not sstables:
return []
sorted_tables = sorted(sstables, key=lambda s: s["size_mb"])
buckets = defaultdict(list)
for table in sorted_tables:
placed = False
for bucket_key in list(buckets.keys()):
avg = sum(t["size_mb"] for t in buckets[bucket_key]) / len(buckets[bucket_key])
if bucket_low * avg <= table["size_mb"] <= bucket_high * avg:
buckets[bucket_key].append(table)
placed = True
break
if not placed:
buckets[table["size_mb"]].append(table) # start a new bucket
# Return the first bucket with enough members
for bucket in buckets.values():
if len(bucket) >= min_threshold:
return bucket[:min_threshold]
return []
# Example
sstables = [
{"name": "ss1", "size_mb": 10},
{"name": "ss2", "size_mb": 11},
{"name": "ss3", "size_mb": 12},
{"name": "ss4", "size_mb": 10},
{"name": "ss5", "size_mb": 85},
{"name": "ss6", "size_mb": 90},
]
candidates = select_stcs_candidates(sstables)
print("Compaction candidates:", [c["name"] for c in candidates])
# Output: Compaction candidates: ['ss1', 'ss4', 'ss2', 'ss3']
# (The four ~10MB files are selected; the 85-90MB files wait for more peers)
Pros of STCS:
Cons of STCS:
TombstoneOverwhelmingException in Cassandra under delete-heavy workloads.The default in RocksDB, LevelDB, and optionally in Cassandra.
LCS was designed to fix the read amplification and space amplification problems of STCS, at the cost of higher write amplification.
The Core Invariant: At every level except L0, no two SSTables have overlapping key ranges. A single read can be answered by checking at most one SSTable per level. This is the guarantee that makes reads predictable.
L0: [A-Z] [A-Z] [A-Z] [A-Z] ← newly flushed, may overlap freely
L1: [A-E] [F-J] [K-O] [P-Z] ← non-overlapping, ≤10MB each
L2: [A-B] [C-D] [E-F] ... [Y-Z]← non-overlapping, ≤100MB each
L3: [A] [B] [C] ... [Z] ← non-overlapping, ≤1000MB each
In this diagram, L0 has four SSTables that all span the full key range A–Z (freshly flushed from MemTables). L1 onwards are partitioned: each SSTable owns a disjoint key range. A read for key “G” checks at most one file per level — the one whose range includes G.
When L0 accumulates too many files (usually 4), a compaction is triggered: all L0 files are merged with all overlapping L1 files, producing new non-overlapping L1 files. If L1 then exceeds its size budget, the cascade continues to L2, and so on.
def find_overlapping_sstables(
sstables_level: list[dict],
input_ranges: list[tuple]
) -> list[dict]:
"""
Given SSTables in a level (each with min_key, max_key),
return those whose key range overlaps any of the input_ranges.
Two ranges [a,b] and [c,d] overlap iff: a <= d AND c <= b
"""
overlapping = []
for sstable in sstables_level:
for (range_min, range_max) in input_ranges:
if sstable["min_key"] <= range_max and range_min <= sstable["max_key"]:
overlapping.append(sstable)
break
return overlapping
# L0 SSTables to compact (all span full range)
l0_ranges = [("A", "Z"), ("A", "M"), ("B", "F")]
# Current L1 SSTables with non-overlapping ranges
l1_tables = [
{"name": "l1_1", "min_key": "A", "max_key": "E"},
{"name": "l1_2", "min_key": "F", "max_key": "J"},
{"name": "l1_3", "min_key": "K", "max_key": "O"},
{"name": "l1_4", "min_key": "P", "max_key": "Z"},
]
overlaps = find_overlapping_sstables(l1_tables, l0_ranges)
print("L1 files to merge with L0:", [s["name"] for s in overlaps])
# Output: L1 files to merge with L0: ['l1_1', 'l1_2', 'l1_3', 'l1_4']
# (All L1 files participate because L0 spans A-Z)
Pros of LCS:
Cons of LCS:
Designed specifically for time-series workloads in Cassandra.
Time-series data has a beautiful property: old data is never updated. Yesterday’s sensor readings will not be modified today. TWCS exploits this.
The Idea: Partition SSTables by the time window of the data they contain. Compact within a window using STCS, but never merge across windows. A window’s SSTables are eventually compacted into a single sealed SSTable and left untouched forever.
WINDOW: Jan 1-7 → compact all Jan 1-7 SSTables → one sealed SSTable ✓
WINDOW: Jan 8-14 → compact all Jan 8-14 SSTables → one sealed SSTable ✓
WINDOW: Jan 15-21 → [actively compacting]
WINDOW: Jan 22+ → [current, receiving writes]
When data ages beyond its TTL (time-to-live), entire SSTable files are dropped — no key-by-key compaction needed. Deletion is essentially free. TWCS achieves remarkably low write and space amplification for time-series, but breaks entirely if you write out-of-order data or update old records. Never use TWCS unless your workload is truly append-only and time-ordered.
| Strategy | Write Amp | Read Amp | Space Amp | Best For |
|---|---|---|---|---|
| STCS | Low | High | High | Write-heavy: IoT, logs, events |
| LCS | High | Low | Low | Read-heavy: user profiles, config |
| TWCS | Very Low | Low* | Very Low | Time-series, append-only data |
*TWCS read amplification is low because data is naturally partitioned by time, reducing the SSTable search space per query.
Now we can trace the full lifecycle of a read request and see how every component works together:
CLIENT REQUEST: GET alice
│
▼
┌──────────────────────────────────────────────────────┐
│ Step 1: Route via Consistent Hashing │
│ hash("alice") = 1,234,567,890 │
│ Token → Node B │
└────────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Step 2: Check MemTable (in-memory) │
│ Is alice in the current MemTable? → No │
└────────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Step 3: Consult Bloom Filters for each SSTable │
│ SSTable 1 Bloom: alice? → Definitely NOT ✗ skip │
│ SSTable 2 Bloom: alice? → Possibly YES ✓ check │
│ SSTable 3 Bloom: alice? → Definitely NOT ✗ skip │
└────────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Step 4: Binary search SSTable 2 index │
│ Index: {alice: byte_offset=2048} │
│ Read data block at offset 2048 │
│ Found: alice → v2 │
└────────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Step 5: Return latest version │
│ Only SSTable 2 matched → return alice:v2 │
└──────────────────────────────────────────────────────┘
In this five-step trace, partitioning (step 1) routes us to the right node, the MemTable (step 2) catches the freshest data, Bloom filters (step 3) eliminate most SSTables without disk reads, the index (step 4) pinpoints the exact byte offset, and version reconciliation (step 5) returns the correct answer. Compaction makes step 3 faster by reducing the total number of SSTables — fewer files means fewer filter checks.
These questions appear regularly in senior and staff-level system design interviews at companies using wide-column stores at scale.
Because they create write hotspots in range-partitioned systems (HBase) and unbalanced access patterns in hash-partitioned systems when combined with sequential clustering keys. Use a high-cardinality prefix — userID:timestamp is far safer than a raw timestamp as a row key.
This is called compaction debt. Symptoms: increasing SSTable count, degrading read latency, and TombstoneOverwhelmingException in Cassandra. Solutions:
At Netflix scale, compaction lag is monitored as a key SLA metric, and separate compaction pools are used for different column families to prevent resource contention.
Bloom filters have no false negatives — if it says “definitely not here,” trust it completely. They can have false positives — “possibly here” — meaning we might check an SSTable that does not contain the key. The false positive rate is tunable: lower FPR means a larger filter (more memory). A 1% FPR is common; pushing to 0.1% roughly doubles the filter’s memory footprint.
HBase favors strong consistency (master-based architecture) and excellent sequential range scans (range-partitioned). Cassandra favors availability and partition tolerance (leaderless) and even write distribution (hash-partitioned). Use HBase when you need ACID-like row-level operations or scan-heavy analytics; use Cassandra when you need multi-region availability and raw write throughput.
Wide-column stores are not magic — they are a carefully negotiated set of trade-offs, each made to serve a specific problem. Partitioning by consistent hashing distributes write load at the cost of natural ordering. The LSM Tree absorbs write bursts at the cost of background compaction work. Compaction strategies each trade one form of amplification for another.
As engineers, our job is to understand which trade-offs are acceptable for our workload:
The filing cabinet analogy holds all the way down: the goal is never a perfect system. It is a system whose imperfections land exactly where you can afford them.