“The problem was never SQL itself. The problem was that the architecture underneath SQL was designed for a world with one server, one city, and one timezone.”
Imagine a bank in the 1980s. It has a single branch, a single vault, and a single teller window. When you walk in, the teller checks the ledger, verifies your balance, and processes your transaction. The system is perfectly consistent. If you withdraw $100, the ledger updates immediately. No one else can see your old balance. This is ACID compliance in its purest, most intuitive form.
Now imagine the bank explodes in popularity. Within years, millions of customers want to transact simultaneously, from New York to Tokyo. The bank has two choices:
Option A: Build a bigger, faster branch. Install a stronger vault, hire more tellers, upgrade the pneumatic tube system. This is vertical scaling — just make the one machine more powerful.
Option B: Open hundreds of smaller branches spread across the globe. Each branch handles local customers. This is horizontal scaling — spread the load across many machines.
Option A has a hard ceiling. There’s only so big a single vault can be. Option B sounds great, but now we have a new problem: What if a customer withdraws from the New York branch at the exact moment someone else is wiring money into their account from Tokyo? Both branches have their own ledgers. How do they stay synchronized? Who arbitrates the truth?
This, in one parable, is the central problem of distributed databases — and the reason NewSQL was born.
In the 1970s, Edgar Codd introduced the relational model, and for the next three decades, relational databases like Oracle, DB2, and MySQL were the undisputed kings of data management. They brought us something precious: ACID guarantees.
# A classic ACID transaction in Python using psycopg2 (PostgreSQL)
import psycopg2
conn = psycopg2.connect("dbname=bank user=admin")
conn.autocommit = False # We control the transaction boundary
cursor = conn.cursor()
try:
cursor.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 1")
cursor.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2")
conn.commit() # <-- Atomicity: Both succeed or neither does
print("Transfer committed.")
except Exception as e:
conn.rollback() # <-- All-or-nothing guarantee
print(f"Transaction failed, rolled back: {e}")
finally:
cursor.close()
conn.close()
This code works beautifully — on a single machine. But in the 2000s, the internet happened. Companies like Google, Amazon, and Facebook were suddenly managing billions of rows, petabytes of data, and millions of concurrent users. The single server hit its physical ceiling.
Researchers at MIT and Carnegie Mellon (including Michael Stonebraker, who later built VoltDB) studied where traditional databases spent their time. The results were startling. On a typical OLTP workload:
Where Traditional SQL Spends Its Time (OLTP Workload)
────────────────────────────────────────────────────
Buffer Pool Management ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 34%
Locking & Latching ▓▓▓▓▓▓▓▓▓▓▓ 25%
Write-Ahead Logging ▓▓▓▓▓▓▓▓ 19%
Actual Useful Work ▓▓▓▓▓▓▓▓ 12%
Other Overhead ▓▓▓▓▓ 10%
────────────────────────────────────────────────────
Only 12% of the database’s effort was on actual work. The rest was managing the machinery of traditional architecture. This was the “broken state” that made a revolution inevitable.
When the scaling crisis hit, engineers at Google (BigTable, 2006) and Amazon (Dynamo, 2007) published landmark papers describing a new approach. The community called it NoSQL. The philosophy was bold: relax the consistency guarantees, and scaling becomes easy.
NoSQL systems adopted the BASE model:
This is like our bank branches operating independently. The New York branch processes withdrawals without calling Tokyo. Fast? Absolutely. But now a clever customer could withdraw their full balance from both branches simultaneously before the ledger syncs. The bank would call that fraud. The database calls it a consistency violation.
Eric Brewer’s CAP Theorem (formalized in 2002) stated that in a distributed system, you can only guarantee two of three properties:
Consistency
/\
/ \
/ \
/ \
/ C + A \ ← Google Spanner aims here
/____________\
/ \
/ \
Availability ─────────── Partition Tolerance
↑ ↑
Cassandra HBase, Zookeeper
DynamoDB
(AP systems) (CP systems)
In the diagram above, we are deliberately placing systems at the vertices to show their primary tradeoff. In practice, all distributed systems must tolerate partitions — so the real choice is between consistency (CP) and availability (AP) when a network partition occurs.
NoSQL systems largely chose AP (Availability + Partition Tolerance) — they stay up and respond, but data can temporarily be inconsistent. This was acceptable for product recommendations or social media “likes.” It was catastrophic for financial transactions, healthcare records, and inventory management.
The industry had traded one problem for another. Millions of lines of application code now had to compensate for missing ACID guarantees, adding complex retry logic, conflict resolution handlers, and “read-your-writes” hacks. There had to be a better way.
The term NewSQL was coined by analyst Matthew Aslett in 2011. The insight was deceptively simple: the problem wasn’t SQL or the relational model — the problem was the single-node architecture underneath it. What if we redesigned the engine from scratch, keeping SQL and ACID, but building it natively for a distributed, multi-node world?
The NewSQL promise:
SQL expressiveness + ACID guarantees + NoSQL-style horizontal scalability
It wasn’t about compromising. It was about rebuilding the bank — not with a single giant branch, but with a network of branches that share a single, globally synchronized ledger.
NewSQL systems fall into three broad architectural families, grouped by how they achieve distributed ACID:
| Family | Core Mechanism | Examples | Best For |
|---|---|---|---|
| New Architecture | Built from scratch with distributed-first design | Google Spanner, CockroachDB | Global, geo-distributed OLTP |
| SQL Engines over NoSQL | SQL layer on top of distributed KV stores | TiDB (over TiKV) | MySQL migration + horizontal scale |
| In-Memory NewSQL | All data in RAM, partitioned, serial execution | VoltDB, MemSQL | Ultra-low-latency OLTP |
We’ll explore each family in depth after we understand the foundational architecture that makes them all work.
The first design principle of all NewSQL systems is shared-nothing architecture. In a shared-nothing cluster, every node owns its own CPU, memory, and disk. There are no shared resources. This is radically different from traditional SQL, where all nodes had to share the same storage.
Traditional SQL (Shared-Everything) NewSQL (Shared-Nothing)
════════════════════════════════ ════════════════════════════════
┌────────┐ ┌────────┐ ┌─────────────────────────────┐
│ Node 1 │ │ Node 2 │ │ CockroachDB Cluster │
└────┬───┘ └───┬────┘ │ │
│ │ │ ┌───────┐ ┌───────┐ │
└────┬────┘ │ │ Node 1│ │ Node 2│ │
┌────▼────┐ │ │CPU+RAM│ │CPU+RAM│ │
│ SHARED │ │ │+Disk │ │+Disk │ │
│ DISK │ ← bottleneck │ └───────┘ └───────┘ │
└─────────┘ │ ┌───────────┐ │
│ │ Node 3 │ │
│ │ CPU+RAM │ │
│ │ +Disk │ │
│ └───────────┘ │
│ No shared storage! │
└─────────────────────────────┘
In the left diagram, both nodes compete for the same disk. A bottleneck at any layer is a bottleneck for the whole system. In the right diagram, each node is fully self-sufficient. Adding a new node to the cluster is as easy as plugging in a new machine.
NewSQL systems don’t store data as relational tables internally. Under the hood, they convert all data into a sorted, distributed key-value store. Each row in a SQL table gets mapped to a key in the KV store, and the global keyspace is divided into contiguous ranges (also called shards or tablets).
# Conceptual illustration: How a SQL table maps to a KV store
# (This is a simplification of CockroachDB's internal encoding)
# SQL Table: accounts
# +----+-------+---------+
# | id | name | balance |
# +----+-------+---------+
# | 1 | Alice | 1000 |
# | 2 | Bob | 500 |
# | 3 | Carol | 2500 |
# Internal KV Representation
kv_store = {
"/table/accounts/1/id": 1,
"/table/accounts/1/name": "Alice",
"/table/accounts/1/balance": 1000,
"/table/accounts/2/id": 2,
"/table/accounts/2/name": "Bob",
"/table/accounts/2/balance": 500,
"/table/accounts/3/id": 3,
"/table/accounts/3/name": "Carol",
"/table/accounts/3/balance": 2500,
}
# These key-value pairs are sorted and split into RANGES
range_1 = {k: v for k, v in kv_store.items() if k <= "/table/accounts/2/"}
range_2 = {k: v for k, v in kv_store.items() if k > "/table/accounts/2/"}
# Each range is then replicated across 3+ nodes for fault tolerance
print(f"Range 1 contains {len(range_1)} entries -> replicated to nodes 1,2,3")
print(f"Range 2 contains {len(range_2)} entries -> replicated to nodes 2,3,4")
This mapping is powerful. The SQL layer at the top translates your SELECT
query into KV operations. The storage layer handles distribution and replication.
The two concerns are completely decoupled.
One of the clearest illustrations of NewSQL’s layered design comes from CockroachDB’s published architecture. Each layer has a single, focused responsibility:
CockroachDB Architecture: Five Clean Layers
══════════════════════════════════════════════
┌──────────────────────────────────────────┐
│ Layer 1: SQL │
│ - Parses and plans SQL queries │
│ - Translates SQL → KV operations │
│ - Supports standard PostgreSQL dialect │
├──────────────────────────────────────────┤
│ Layer 2: Transactional │
│ - Ensures ACID for multi-key changes │
│ - Manages write intents & timestamps │
│ - Handles conflict detection │
├──────────────────────────────────────────┤
│ Layer 3: Distribution │
│ - Routes requests to correct range │
│ - Manages range splits and merges │
│ - Presents cluster as single entity │
├──────────────────────────────────────────┤
│ Layer 4: Replication (Raft) │
│ - Replicates ranges across nodes │
│ - Leader election per range │
│ - Ensures consensus before commit │
├──────────────────────────────────────────┤
│ Layer 5: Storage (Pebble / RocksDB) │
│ - Reads and writes KV data to disk │
│ - Uses LSM Trees for efficient writes │
│ - Supports MVCC timestamps │
└──────────────────────────────────────────┘
Each layer communicates only with its immediate neighbor. This separation of concerns is why NewSQL systems can evolve their storage engine independently from their SQL parser — a design philosophy borrowed from modern operating systems.
This is where the real magic happens. To maintain consistency across distributed nodes, NewSQL systems rely on consensus protocols — algorithms that allow a cluster of machines to agree on a single value, even in the presence of node failures.
Imagine three replicas of the same range:
Node A (Leader) Node B (Follower) Node C (Follower)
balance=1000 balance=1000 balance=1000
│
▼
Client writes: SET balance=900
│
Node A updates ✓
│
[Network partition — Node B and C don't receive the update]
│
Another client reads from Node B
▼
Returns 1000 ← STALE DATA!
Without consensus, we have a split-brain scenario. Two clients see two different truths. This is exactly what NewSQL must prevent.
Raft (developed by Diego Ongaro at Stanford, 2014) was designed explicitly to be easier to understand than its predecessor, Paxos. It is used by CockroachDB, TiDB (via TiKV), and YugabyteDB. Every range in a NewSQL cluster has exactly one Raft group.
Every node in a Raft group is in one of three states:
┌─────────────┐
│ Follower │◄───── All nodes start here
└──────┬──────┘
│ Election timeout expires
│ (no heartbeat from leader)
▼
┌─────────────┐
│ Candidate │ ──── Votes for itself, requests votes
└──────┬──────┘
│ Receives majority votes
▼
┌─────────────┐
│ Leader │ ──── Handles all writes for this range
└─────────────┘
│ Sends periodic heartbeats to followers
▼
┌─────────────┐
│ Follower │ ◄─── Steps down if it sees a higher term
└─────────────┘
Notice in this diagram that the state machine is intentionally simple: three states, clear transitions. This is Raft’s core design philosophy — reduce the state space so failures are easier to reason about.
The write flow in Raft is equally clean:
# Conceptual simulation of a Raft write operation
class RaftCluster:
def __init__(self, nodes: list, quorum_size: int):
self.nodes = nodes
self.leader = nodes
self.quorum_size = quorum_size # typically (n/2) + 1
self.committed_log = []
def write(self, entry: dict) -> bool:
# Step 1: Leader appends to its own log
self.leader["log"].append(entry)
# Step 2: Leader sends AppendEntries RPC to all followers
acknowledgements = 1 # Leader counts itself
for follower in self.nodes[1:]:
if self._send_append_entries(follower, entry):
acknowledgements += 1
# Step 3: Commit only when a majority (quorum) acknowledges
if acknowledgements >= self.quorum_size:
self.committed_log.append(entry)
print(f"[COMMITTED] Entry '{entry}' — {acknowledgements}/{len(self.nodes)} nodes agree")
return True
else:
print(f"[FAILED] Only {acknowledgements} nodes acknowledged — quorum not reached")
return False
def _send_append_entries(self, follower: dict, entry: dict) -> bool:
# In real Raft, this is an RPC call. We simulate latency/failure here.
import random
if follower["alive"] and random.random() > 0.1: # 90% success rate
follower["log"].append(entry)
return True
return False
# Example: 5-node cluster, quorum = 3
nodes = [
{"id": i, "log": [], "alive": True} for i in range(5)
]
# Simulate one node being offline
nodes["alive"] = False
nodes["alive"] = False[^2]
cluster = RaftCluster(nodes, quorum_size=3)
cluster.write({"table": "accounts", "key": "user:1:balance", "value": 900})
The key insight: a write is only committed when a majority (quorum) of nodes acknowledge it. With 5 nodes, we need 3 confirmations. This means the system can tolerate 2 simultaneous node failures and still make progress.
┌─────────────────┬──────────────────────┬──────────────────────┐
│ Property │ Multi-Paxos │ Raft │
├─────────────────┼──────────────────────┼──────────────────────┤
│ Used by │ Google Spanner │ CockroachDB, TiDB │
│ Leader election │ Complex, multi-phase │ Simple timer-based │
│ Log gaps │ Allowed (complex) │ Not allowed (simpler)│
│ Understandability│ Very hard │ Designed to be clear │
│ Performance │ Comparable │ Comparable │
└─────────────────┴──────────────────────┴──────────────────────┘
Google chose Multi-Paxos for Spanner partly for historical reasons and partly because their internal expertise predated Raft. Most modern NewSQL systems default to Raft for its clarity and simpler implementation.
In traditional SQL, when a transaction reads data, it acquires a shared lock. When another transaction writes, it acquires an exclusive lock. These locks conflict — readers block writers, and writers block readers. At scale, under thousands of concurrent transactions, the system grinds to a halt.
NewSQL systems solve this with MVCC. Instead of locking a row during a read, the system keeps multiple versions of the row, each tagged with a timestamp. Readers always read from the version that was committed before their transaction started. They never block writers, and writers never block readers.
MVCC: A Timeline of Row Versions
════════════════════════════════════════════════════════
Row: accounts/user:1/balance
Time ──────────────────────────────────────────▶
T=10 │ version@T10: {balance: 1000} ← Write by Tx1
│
T=20 │ version@T20: {balance: 900} ← Write by Tx2
│
T=30 │ version@T30: {balance: 850} ← Write by Tx3
│
T=35 │ Tx4 begins: "READ AS OF T25"
│ → Returns version@T20 (900) ✓
│ → Does NOT block ongoing writes!
════════════════════════════════════════════════════════
Notice that in this diagram, Tx4 reads a consistent snapshot of the data as it existed at T=25, without interfering with any concurrent write operations. This is the “snapshot isolation” guarantee that makes NewSQL reads non-blocking.
# Simplified MVCC version store simulation in Python
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class Version:
timestamp: int
value: any
is_committed: bool = False
class MVCCStore:
def __init__(self):
self.versions: dict[str, List[Version]] = {}
def write(self, key: str, value: any, timestamp: int):
"""Write a new version of the key with a given timestamp."""
if key not in self.versions:
self.versions[key] = []
v = Version(timestamp=timestamp, value=value, is_committed=False)
self.versions[key].append(v)
self.versions[key].sort(key=lambda x: x.timestamp)
return v
def commit(self, key: str, timestamp: int):
"""Commit the version at the given timestamp."""
for v in self.versions.get(key, []):
if v.timestamp == timestamp:
v.is_committed = True
return True
return False
def read(self, key: str, read_timestamp: int) -> Optional[any]:
"""
Return the most recent COMMITTED version
that was written at or before read_timestamp.
"""
best = None
for v in self.versions.get(key, []):
if v.is_committed and v.timestamp <= read_timestamp:
best = v # Keep the latest one that qualifies
return best.value if best else None
# Demo
store = MVCCStore()
# Two concurrent writes at different timestamps
v1 = store.write("user:1:balance", 1000, timestamp=10)
v2 = store.write("user:1:balance", 900, timestamp=20)
v3 = store.write("user:1:balance", 850, timestamp=30)
# Commit T=10 and T=20, but NOT T=30 (in-flight transaction)
store.commit("user:1:balance", 10)
store.commit("user:1:balance", 20)
# T=30 is intentionally NOT committed yet
# A reader starting at T=25 sees the world as of T=20
balance = store.read("user:1:balance", read_timestamp=25)
print(f"Balance as of T=25: {balance}") # Output: 900
# A reader starting at T=35 still sees T=20 (T=30 is uncommitted)
balance = store.read("user:1:balance", read_timestamp=35)
print(f"Balance as of T=35: {balance}") # Output: 900, NOT 850
MVCC requires that every node can assign globally meaningful timestamps. But in a distributed system, no two clocks are perfectly synchronized. A naive implementation where Node A assigns T=100 and Node B assigns T=99 could silently reverse causal order.
CockroachDB uses Hybrid Logical Clocks (HLC) — a combination of physical wall clock time and a logical counter. The HLC ensures:
Google took timestamp management a step further with TrueTime, the secret
weapon embedded inside Google Spanner. Rather than a single point in time,
TrueTime returns an interval: [earliest, latest] — a bounded range of
uncertainty about the current moment.
TrueTime API
════════════════════════════════════════
Call: TT.now()
Returns: [T_earliest, T_latest]
Example: [16:04:23.001, 16:04:23.007]
───────────────────────────
└──────┬──────────────┘
ε = 6ms uncertainty window
(ensured by GPS + atomic clocks
in Google data centers)
════════════════════════════════════════
If TT.after(t) → True means t has definitely passed
If TT.before(t) → True means t has definitely not passed
Commit Wait:
Before committing Tx at timestamp Ts,
Spanner WAITS until TT.after(Ts) is True.
This guarantees no future transaction
can get a timestamp ≤ Ts.
This is the key insight: by using atomic clocks and GPS receivers in data centers, Google bounds clock uncertainty to ~7ms. When Spanner wants to commit a transaction at timestamp T, it simply waits out the uncertainty window before releasing the commit. This “commit wait” is the price of external consistency — and it’s only milliseconds.
This gives Spanner something extraordinary: external consistency (the strongest form of consistency in distributed systems — stronger even than linearizability). If transaction T1 commits before T2 starts in real-world time, Spanner guarantees that T1’s timestamp is strictly less than T2’s.
Spanner (2012 paper, publicly available via Google Cloud as Cloud Spanner) was the first system to offer globally-distributed, externally-consistent SQL transactions. It is the system NewSQL is measured against.
Google Spanner Architecture
┌─────────────────────────────────────────────────────┐
│ Cloud Spanner │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Zone: US-E │ │ Zone: EU-W │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Spanserver│ │ │ │ Spanserver│ │ │
│ │ │ Paxos │ │ │ │ Paxos │ │ │
│ │ │ Group 1 │ │ │ │ Group 1 │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │
│ │ │ │ │ │
│ │ TrueTime │ │ TrueTime │ │
│ │ (GPS+Atomic)│ │ (GPS+Atomic)│ │
│ └──────────────┘ └──────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Colossus Distributed File System │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Key Characteristics of Spanner:
CockroachDB (open-source, named for its survival instincts) is the closest public analog to Spanner. It takes Spanner’s ideas and makes them deployable on any cloud, or even bare metal.
Key Characteristics of CockroachDB:
# Connecting to CockroachDB is identical to PostgreSQL
import psycopg2
# CockroachDB speaks the PostgreSQL wire protocol
conn = psycopg2.connect(
host="localhost",
port=26257,
database="bank",
user="root",
sslmode="disable"
)
cursor = conn.cursor()
# Create a table — pure standard SQL
cursor.execute("""
CREATE TABLE IF NOT EXISTS accounts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name STRING NOT NULL,
balance DECIMAL NOT NULL CHECK (balance >= 0)
)
""")
# CockroachDB handles distribution and replication transparently
cursor.execute("INSERT INTO accounts (name, balance) VALUES (%s, %s)", ("Alice", 1000))
cursor.execute("INSERT INTO accounts (name, balance) VALUES (%s, %s)", ("Bob", 500))
conn.commit()
# This transfer is ACID-compliant across the entire distributed cluster
def transfer_funds(conn, from_user, to_user, amount):
with conn:
cursor = conn.cursor()
cursor.execute(
"UPDATE accounts SET balance = balance - %s WHERE name = %s",
(amount, from_user)
)
cursor.execute(
"UPDATE accounts SET balance = balance + %s WHERE name = %s",
(amount, to_user)
)
print(f"Transferred ${amount} from {from_user} to {to_user}")
transfer_funds(conn, "Alice", "Bob", 100)
TiDB (developed by PingCAP) takes a different architectural approach. Rather than a monolithic system, it separates concerns into distinct components:
TiDB Architecture: Separation of Compute and Storage
┌─────────────────────────────────────────────────────────┐
│ TiDB Layer │
│ (Stateless SQL Compute Nodes) │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ TiDB 1 │ │ TiDB 2 │ │ TiDB 3 │ ← Scale compute │
│ └────┬───┘ └───┬────┘ └───┬────┘ independently │
└───────┼───────────┼────────────┼───────────────────────┘
│ │ │
└───────────┴────────────┘
│
┌────────▼────────────────────────┐
│ PD (Placement Driver) │
│ Metadata + Scheduling + TSO │
│ (Timestamp Oracle — central │
│ logical clock generator) │
└────────┬────────────────────────┘
│
┌──────────────┴──────────────┐
│ │
┌────▼──────────────┐ ┌──────────▼──────────────┐
│ TiKV │ │ TiFlash │
│ (Row Storage) │ │ (Column Storage) │
│ Raft consensus │ │ MPP for analytics │
│ OLTP workloads │ │ OLAP workloads │
└────────────────────┘ └─────────────────────────┘
The key innovation in TiDB is the presence of both TiKV (row-oriented) and TiFlash (column-oriented) storage layers. Notice that TiDB can route a query to either layer depending on whether it’s transactional (TiKV) or analytical (TiFlash). This is the HTAP capability — Hybrid Transactional and Analytical Processing.
What makes TiDB unique:
VoltDB takes a completely different approach to scaling. Rather than distributing a disk-based system, it keeps everything in RAM and uses aggressive partitioning to eliminate the need for locks entirely.
VoltDB: Single-Threaded Partition Model
Partition 1 Partition 2 Partition 3
(Node 1 / RAM) (Node 2 / RAM) (Node 3 / RAM)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Single │ │ Single │ │ Single │
│ Execution │ │ Execution │ │ Execution │
│ Thread │ │ Thread │ │ Thread │
│ │ │ │ │ │
│ Tx1 → done │ │ Tx3 → done │ │ Tx5 → done │
│ Tx2 → done │ │ Tx4 → done │ │ Tx6 → done │
│ │ │ │ │ │
│ NO LOCKS! │ │ NO LOCKS! │ │ NO LOCKS! │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┴───────────────────┘
Coordinator for cross-partition
transactions (rare case)
Because each partition has only ONE thread that runs transactions to completion serially, VoltDB eliminates the need for locking entirely. There’s no contention. This sounds limiting, but in practice, if your transactions are partitioned correctly, cross-partition coordination is rare.
VoltDB’s overhead eliminations:
The trade-off: your entire dataset must fit in memory across the cluster. It’s ideal for financial trading platforms, digital advertising bid systems, and real-time gaming — all use cases requiring sub-millisecond latency.
NewSQL systems are predominantly CP systems (Consistent + Partition Tolerant). When a network partition occurs, they favor consistency over availability — they will return an error rather than serve stale data.
C (Consistency)
/\
/ \
/ CP \
Spanner ● /──────\ ← NewSQL territory
CockroachDB ● / \
TiDB ● / \
/ \
────────────────────────────────────────
P (Partition Tolerance) A (Availability)
\ /
\ AP /
Cassandra ● \────────/ ← NoSQL territory
DynamoDB ● \ /
Couchbase ● \ /
\ /
\/
The nuance: modern systems like Google Spanner and CockroachDB do everything possible to maximize availability within the CP camp. They use replication factors of 3 or 5 to tolerate node failures while still serving consistent reads.
# A decision framework for choosing your database architecture
def choose_database(
needs_acid: bool,
scale_beyond_single_node: bool,
read_heavy: bool,
write_throughput: str, # "low", "medium", "high", "extreme"
latency_requirement: str # "relaxed", "strict", "sub-millisecond"
) -> str:
if not needs_acid and not scale_beyond_single_node:
return "Traditional SQL (PostgreSQL / MySQL)"
if not needs_acid and scale_beyond_single_node:
return "NoSQL (Cassandra / DynamoDB / MongoDB)"
if needs_acid and scale_beyond_single_node:
if latency_requirement == "sub-millisecond":
return "In-Memory NewSQL (VoltDB)"
if write_throughput == "extreme" and read_heavy:
return "HTAP NewSQL (TiDB) — separate OLTP/OLAP engines"
if latency_requirement == "strict":
# Need true external consistency
return "Google Cloud Spanner (TrueTime guarantee)"
# General distributed OLTP
return "CockroachDB (open-source, multi-cloud, PostgreSQL-compatible)"
return "Evaluate your specific constraints further"
# Test the framework
print(choose_database(
needs_acid=True,
scale_beyond_single_node=True,
read_heavy=False,
write_throughput="high",
latency_requirement="strict"
)) # → Google Cloud Spanner
print(choose_database(
needs_acid=True,
scale_beyond_single_node=True,
read_heavy=True,
write_throughput="high",
latency_requirement="relaxed"
)) # → TiDB (HTAP for mixed workloads)
The following concepts are the most frequently tested in system design and database interviews. We group them by theme rather than just listing them:
Theme 1: Consistency Guarantees
Theme 2: Replication and Fault Tolerance
(n/2) + 1 acknowledgments before commitTheme 3: Scalability Mechanisms
Theme 4: The Trade-offs You Must Articulate
| Trade-off | NewSQL position | Rationale |
|---|---|---|
| Consistency vs. Availability | Prefers consistency | ACID is non-negotiable |
| Latency vs. Durability | Adds ~2-7ms per commit | Quorum write cost |
| Flexibility vs. Schema | Fixed schema | SQL requires structure |
| Operational Simplicity | More complex than single-node | Distributed coordination |
| Cost vs. Scalability | More expensive than NoSQL | But cheaper than outages |
For decades, we ran two separate database ecosystems side-by-side:
Data was periodically extracted from OLTP systems and loaded into OLAP data warehouses via ETL (Extract, Transform, Load) pipelines — introducing hours of latency between a transaction and its availability in analytics.
TiDB’s HTAP architecture eliminates this separation. The same data is available in both row format (TiKV) for transactions and column format (TiFlash) for analytics, with real-time replication between the two:
HTAP Data Flow in TiDB
════════════════════════════════════════════════════════════════
┌─────────────┐
Application │ TiDB SQL │ Client Query Layer
Writes ────────────────►│ Server │
└──────┬──────┘
│
┌────────────▼─────────────┐
│ Raft Replication Log │
└────────────┬─────────────┘
┌─────────────────┴──────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌───────────────────┐
│ TiKV │ │ TiFlash │
│ (Row-oriented) │ │ (Column-oriented) │
│ │ │ │
│ Users Orders │ │ Orders Revenue │
│ ─────────────── │ │ ─────────────── │
│ Alice #1001 │◄─ Async ──────►│ #1001 $99.99 │
│ Bob #1002 │ replication │ #1002 $149.99 │
│ │ │ │
│ Fast INSERTS │ │ Fast column scan │
│ Point reads │ │ Aggregations │
└──────────────────┘ └───────────────────┘
OLTP OLAP
(Transactions) (Analytics)
Both stores are consistent — no ETL pipeline needed!
════════════════════════════════════════════════════════════════
In Figure above, notice that both TiKV and TiFlash receive data from the same Raft replication log — no ETL pipeline, no batch jobs, no hours of delay. An analytical query issued moments after a transaction can see the latest transactional data. This is the HTAP breakthrough.
Let’s ground everything with a concrete interview scenario: a global fintech application processing payments.
Requirements:
The NewSQL Solution:
Global Fintech Architecture with NewSQL
══════════════════════════════════════════════════════════════
US-East EU-West APAC-Singapore
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ App Tier │ │ App Tier │ │ App Tier │
│ (Stateless) │ │ (Stateless) │ │ (Stateless) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌──────▼──────────────────────▼─────────────────────▼──────┐
│ CockroachDB / Cloud Spanner │
│ (Global NewSQL Cluster) │
│ │
│ Range: /accounts/US/ Range: /accounts/EU/ ... │
│ Leader: US-East Leader: EU-West │
│ Replicas: EU, APAC Replicas: US, APAC │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Raft / Paxos Consensus │ │
│ │ (per-range, across all regions) │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
│
┌──────▼────────────────────────┐
│ Real-time Fraud Analytics │
│ (TiFlash / BigQuery Sink) │
│ Live ML scoring, no ETL lag │
└───────────────────────────────┘
══════════════════════════════════════════════════════════════
Why NewSQL here instead of NoSQL?
What you’re trading:
NewSQL is not a silver bullet. It is a carefully engineered response to a specific category of problems: applications that need both the relational expressiveness of SQL and the elastic scale of distributed systems, without sacrificing the ACID guarantees that make data trustworthy.
Understanding NewSQL deeply means understanding:
In a senior engineering interview, when the interviewer asks “what database would you choose for a globally distributed payment system?”, the answer is never “just use Postgres” and never “just use Cassandra.” The answer is a demonstrated understanding of trade-offs — and NewSQL’s elegant reframing of the scalability-consistency tension is exactly the kind of depth that separates strong candidates from exceptional ones.