system-design.io

The Four Kingdoms of NoSQL: Document, Key-Value, Wide-Column, and Graph Databases

Part 1: The Filing Cabinet Problem

Before we write a single line of query syntax, let’s visit a scene most of us know well: a busy government office in the 1970s.

Picture rows upon rows of metal filing cabinets. Every citizen has a file. Every file lives in a specific drawer, in a specific cabinet, labeled by last name, then first name, then year of birth. The system is beautiful — it is orderly, predictable, and perfectly consistent. Any clerk can walk up, follow the rules, and retrieve exactly the right file within seconds.

This is, at its core, a relational database.

But now imagine the city grows. The population triples. Different departments start collecting different kinds of data. The health department wants to attach X-rays. The DMV wants to attach vehicle photos. The tax office wants to attach spreadsheets, some with 3 columns, some with 300. The marriage registry wants to link files between citizens: “This person is related to that person, who is also related to that person.”

The filing cabinet breaks. Not because it was poorly designed — it was perfect for what it was designed to do. It breaks because the nature of the data changed. It became: varied, massive, interconnected, and fast-moving.

This is the origin story of NoSQL.

“Not Only SQL” is not a single database system. It is a family of database philosophies, each built to solve a specific failure mode of the relational model. By the end of this article, we will understand all four members of this family, when to reach for each one, and how to reason about them in a system design interview with confidence.

Part 2: Why the Relational Model Has Edges

To appreciate the NoSQL family, we need to be honest about where SQL databases struggle. We are not dismissing SQL — it remains the gold standard for many applications. But we are identifying the pressure points.

The Three Pain Points

Pain Point 1: Schema Rigidity

In a relational database, the schema — the shape of your tables, the columns, their data types — must be defined before data is inserted. This is called schema-on-write. If your product evolves and you need to add a new field, you run a migration. For large tables with billions of rows, that migration can take hours or days and often requires downtime.

Pain Point 2: Horizontal Scaling Is Expensive

Relational databases were designed to run on a single, powerful machine. Scaling them horizontally (adding more machines) is possible but complex. Features like foreign keys, transactions, and JOINs assume that all the data a query needs lives in the same place. Once you split data across machines (called sharding), maintaining those guarantees becomes a distributed systems nightmare.

Pain Point 3: Not All Data is Tabular

A social network does not naturally look like a table. A user’s activity stream does not fit neatly into normalized rows. A product catalog where each product has wildly different attributes is painful to model in SQL. Forcing non-tabular data into rows and columns introduces impedance mismatch — the friction between how data lives in the real world and how it is stored in the database.

These three pain points are the precise problems that each NoSQL category was designed to address.

Part 3: The Big Picture — Four Kingdoms

Before we dive deep into each type, here is the landscape. Think of these four database types as four different specialists. Each is the best in the world at one thing.

+----------------------------------------------------------+
|                    NoSQL Universe                        |
|                                                          |
|  +---------------+          +------------------+        |
|  | Key-Value     |          | Document         |        |
|  | (Redis)       |          | (MongoDB)        |        |
|  |               |          |                  |        |
|  | key --> value |          | key --> {JSON}   |        |
|  |               |          |                  |        |
|  | SPEED         |          | FLEXIBILITY      |        |
|  +---------------+          +------------------+        |
|                                                          |
|  +---------------+          +------------------+        |
|  | Wide-Column   |          | Graph            |        |
|  | (Cassandra)   |          | (Neo4j)          |        |
|  |               |          |                  |        |
|  | rows x cols   |          | nodes --edges--> |        |
|  | (sparse)      |          | nodes            |        |
|  |               |          |                  |        |
|  | SCALE         |          | RELATIONSHIPS    |        |
|  +---------------+          +------------------+        |
+----------------------------------------------------------+

In this diagram, we can see that each NoSQL type occupies a different “specialty zone.” Key-value databases optimize for raw speed. Document databases optimize for schema flexibility. Wide-column databases optimize for massive-scale writes. Graph databases optimize for relationship traversal. Keep this mental map in mind as we go deeper.

Part 4: Key-Value Databases — The Speed Demon

The Real-World Analogy: A Coat Check

You walk into a restaurant. You hand the attendant your coat. They give you a numbered token: #42. When you leave, you hand back #42 and you get your coat. The attendant doesn’t know or care what is inside your coat. They only know: token maps to item.

This is a key-value database. Blindingly simple. Blindingly fast.

The Atomic Unit

A key-value store has only one concept: a key that maps to a value.

# Conceptual key-value model

store = {}

# Write
store["user:session:abc123"] = {
    "user_id": 99,
    "logged_in_at": "2025-03-21T08:00:00Z",
    "cart_items": 3
}

# Read (O(1) lookup — no scan, no join)
session = store["user:session:abc123"]
print(session)
# {'user_id': 99, 'logged_in_at': '2025-03-21T08:00:00Z', 'cart_items': 3}

The value can be anything: a string, a number, a binary blob, a JSON blob. The database does not inspect or index the value — it just stores it and retrieves it. This constraint is what makes key-value stores so fast. Lookup is O(1) — constant time, regardless of how many records are in the database.

Build-Up: Where This Gets Powerful

Now let’s add a real-world constraint. You are building a web application with 10 million users. Every page load requires checking whether the user is authenticated. If you hit your SQL database for every single request, the latency mounts and your database becomes a bottleneck. This is the session management problem.

The fix: store active sessions in a key-value database like Redis, and only consult the SQL database for deep operations (reading user profiles, writing orders, etc.).

import redis

# Connect to Redis
r = redis.Redis(host='localhost', port=6379, db=0)

# Store a session with a 30-minute TTL (Time-To-Live)
r.setex(
    name="session:abc123",
    time=1800,           # seconds
    value="user_id:99"
)

# Retrieve session on each request
session = r.get("session:abc123")

if session:
    print(f"Authenticated: {session.decode('utf-8')}")
else:
    print("Session expired. Please log in again.")

Notice the setex call — this sets a key with an automatic expiration. Most key-value stores support TTL natively. This makes them perfect for caching, ephemeral data, and rate limiting.

Key-Value Data Structure: The Internal Model

+----------------------------------+
| Key-Value Store (Hash Table)     |
|                                  |
|  Key              Value          |
|  +-----------+  +-------------+ |
|  |session:a1 |->| {user:99}   | |
|  +-----------+  +-------------+ |
|  |cache:home |->| "<html>..." | |
|  +-----------+  +-------------+ |
|  |rate:ip:x  |->| 47          | |
|  +-----------+  +-------------+ |
|  |lock:order1|->| "LOCKED"    | |
|  +-----------+  +-------------+ |
+----------------------------------+
       ^
       | O(1) hash lookup
       | (no table scan, no index walk)

In this diagram, notice that the store is essentially a hash table. Given a key, the database hashes it to find the bucket in memory where the value lives. There is no “WHERE” clause, no filtering — just a direct address lookup. This is why Redis can process millions of operations per second.

When to Reach For It

Use Case	Why Key-Value Works
Session storage	Fast R/W, natural TTL support
Caching (HTML, queries)	Offloads DB, sub-millisecond response
Rate limiting	Atomic increment operations
Leaderboards	Redis sorted sets provide ordered scores
Feature flags	Simple boolean values per user/environment

The Limitation

The simplicity is a double-edged sword. You cannot query by value. You cannot say “give me all sessions created in the last 30 minutes” unless you designed specific keys for that. The key is your only access point. If your access patterns are complex — if you need to filter, sort, or aggregate — a key-value store alone is insufficient.

Representative Databases

Redis — In-memory, supports rich data structures (lists, sets, sorted sets, streams)
Amazon DynamoDB — Managed cloud key-value + document hybrid, built for AWS-scale
Memcached — Pure caching, minimal features, maximum simplicity
Riak — Distributed, highly available, AP-focused (CAP theorem)

Part 5: Document Databases — The Flexible Librarian

The Real-World Analogy: A Filing System for Writers

Imagine a library where every book is unique. Some books have 3 chapters. Some have 30. Some include illustrations, footnotes, and multi-language appendices. Some are tiny pamphlets; some are multi-volume encyclopedias.

A relational database would force all of these into a single, rigid table structure. A document database says: “Store each book as it is. Let each one have its own shape.”

The Atomic Unit: The Document

A document database stores self-contained records called documents. Each document is typically a JSON (or BSON) object — a nested, flexible data structure.

# A MongoDB-style document for a product catalog
product = {
    "_id": "prod_001",
    "name": "Wireless Headphones Pro",
    "brand": "SoundWave",
    "price": 149.99,
    "tags": ["audio", "wireless", "noise-cancelling"],
    "specs": {
        "battery_life_hours": 30,
        "driver_size_mm": 40,
        "bluetooth_version": "5.2"
    },
    "reviews": [
        {"user": "alice", "rating": 5, "text": "Best purchase ever"},
        {"user": "bob",   "rating": 4, "text": "Great, but heavy"}
    ]
}

Notice what just happened. This single document holds a nested object (specs), an array of primitive values (tags), and an array of embedded sub-documents (reviews). In SQL, you would need at least three separate tables — products, product_specs, product_reviews — and JOIN them at query time.

The Schema-on-Read Superpower

In a document database, there is no enforced schema at the database level (unless you choose to add validation). Documents in the same collection can have different shapes. This is called schema-on-read: the application interprets the structure of the data when it reads it, not when it writes it.

# Two documents in the same "users" collection — different shapes are fine
user_basic = {
    "_id": "u001",
    "name": "Alice",
    "email": "alice@example.com"
}

user_full = {
    "_id": "u002",
    "name": "Bob",
    "email": "bob@example.com",
    "address": {
        "street": "123 Oak Ave",
        "city": "Milwaukee",
        "zip": "53202"
    },
    "preferences": {
        "theme": "dark",
        "notifications": True
    },
    "subscription_tier": "pro"
}

# Both documents live in the same "users" collection without conflict

This is a massive win for fast-moving product teams. Adding a new field to your user model no longer requires a schema migration — you just start writing the new field to new documents.

The Internal Model: Collections of Documents

  Database: "ecommerce"
  |
  +-- Collection: "products"
  |     |
  |     +-- Document: { _id: "p001", name: "...", specs: {...} }
  |     +-- Document: { _id: "p002", name: "...", tags: [...] }
  |     +-- Document: { _id: "p003", name: "...", variants: [...] }
  |
  +-- Collection: "orders"
  |     |
  |     +-- Document: { _id: "o001", user_id: "u001", items: [...] }
  |     +-- Document: { _id: "o002", user_id: "u002", items: [...] }
  |
  +-- Collection: "users"
        |
        +-- Document: { _id: "u001", name: "Alice", email: "..." }
        +-- Document: { _id: "u002", name: "Bob",   prefs: {...} }

In this diagram, we can see the hierarchy: one database contains multiple collections, and each collection contains multiple documents. Unlike SQL tables, collections do not enforce a shared schema across their documents. Each document is a self-contained, independently structured unit.

Querying Documents: Rich and Expressive

Document databases support rich queries — far beyond the simple key lookups of a key-value store.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db = client["ecommerce"]
products = db["products"]

# Insert a document
products.insert_one({
    "_id": "p004",
    "name": "Studio Monitor Speakers",
    "price": 299.99,
    "tags": ["audio", "studio", "professional"],
    "specs": {"watts": 50, "frequency_response": "40Hz-20kHz"}
})

# Query: find all audio products under $300
results = products.find({
    "tags": "audio",
    "price": {"$lt": 300}
})

for doc in results:
    print(doc["name"], doc["price"])

# Query: find products and project only specific fields
results = products.find(
    {"specs.watts": {"$gte": 40}},
    {"name": 1, "price": 1, "_id": 0}    # projection
)

We can filter on nested fields (specs.watts), arrays (tags), apply range operators ($lt, $gte), and project only needed fields. This is far more expressive than a key-value store.

Embed vs. Reference: The Core Design Decision

When modeling data in a document database, every engineer faces the embedding vs. referencing dilemma. This is a favorite interview question.

Embed when: sub-data is always read with the parent, it’s small, and it doesn’t need to be queried independently.

Reference when: sub-data is large, frequently updated independently, or shared across multiple parent documents.

# EMBEDDING (good for comments always loaded with a post)
post = {
    "_id": "post_01",
    "title": "Intro to NoSQL",
    "comments": [                        # <-- embedded
        {"author": "alice", "text": "Great read!"},
        {"author": "bob",   "text": "Very helpful"}
    ]
}

# REFERENCING (good for authors used across many posts)
post = {
    "_id": "post_01",
    "title": "Intro to NoSQL",
    "author_id": "author_42"             # <-- reference (like a foreign key)
}

When to Reach For It

Use Case	Why Document Works
Product catalogs	Products have wildly different attributes
CMS / blog platforms	Posts have variable structures
User profiles	Users accumulate different optional data
Real-time apps	Flexible schema supports rapid iteration
Mobile backends	JSON maps naturally to mobile data models

Representative Databases

MongoDB — The market leader; rich query language, aggregation pipelines, Atlas cloud
Firestore (Firebase) — Google’s managed document DB; real-time sync for mobile
CouchDB — HTTP-based, strong offline-first support, built-in sync
Amazon DocumentDB — MongoDB-compatible, AWS-managed

Part 6: Wide-Column Databases — The Scaling Giant

The Real-World Analogy: An Infinite Spreadsheet for Time

Imagine a spreadsheet used by a weather agency to record sensor readings from 10,000 weather stations, every second of every day. The spreadsheet has millions of rows (one per station, per timestamp) and potentially thousands of columns (temperature, humidity, pressure, wind speed, visibility, UV index, etc.). Most cells are empty — not every station measures every metric.

This is a wide-column database: a massive, sparse, multi-dimensional table optimized for enormous datasets and heavy write throughput.

The Name Is Slightly Misleading

Many engineers confuse wide-column stores with columnar databases (like Apache Parquet or Redshift). They are different.

A columnar database physically stores each column on disk as a separate file — great for analytical queries (OLAP).
A wide-column database is row-based but allows each row to have a different, dynamic set of columns — great for operational workloads (OLTP at scale).

Wide-column = wide flexibility, not “columns are the primary storage unit.”

The Atomic Unit: The Row with a Dynamic Column Family

  Table: "sensor_data"
  
  Row Key         | Column Family: "readings"
  ----------------+------------------------------------------------
  "station_001"   | ts:1711001600 -> 72.3F  | ts:1711001601 -> 72.4F
  "station_002"   | ts:1711001600 -> 68.1F  | humidity:1711001600 -> 55%
  "station_003"   | ts:1711001601 -> 101.0F | pressure:1711001601 -> 29.9
  ----------------+------------------------------------------------
  
  Each row can have DIFFERENT columns.
  Missing columns take up NO storage space (sparse).

In this diagram, notice that station_001, station_002, and station_003 each have different columns. There is no “NULL” penalty for missing columns — they simply do not exist in that row. This is the sparse matrix model, and it is what enables wide-column databases to scale to petabytes without wasted storage.

Cassandra: The Wide-Column Champion

Apache Cassandra was created at Facebook to handle the inbox search problem — storing billions of messages across millions of users, with write speeds that SQL databases could not sustain. It is now one of the most widely deployed NoSQL databases in the world, used by Netflix, Apple, and Instagram.

Cassandra models data around partition keys and clustering columns.

# Using the Cassandra Python driver (cassandra-driver)
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement

cluster = Cluster(['127.0.0.1'])
session = cluster.connect()

# Create keyspace (like a database)
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS iot_platform
    WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}
""")
session.set_keyspace('iot_platform')

# Create table optimized for time-series sensor data
session.execute("""
    CREATE TABLE IF NOT EXISTS sensor_readings (
        station_id  TEXT,
        recorded_at TIMESTAMP,
        temperature FLOAT,
        humidity    FLOAT,
        pressure    FLOAT,
        PRIMARY KEY (station_id, recorded_at)
    ) WITH CLUSTERING ORDER BY (recorded_at DESC)
""")

# Write — Cassandra is optimized for fast writes
session.execute("""
    INSERT INTO sensor_readings 
    (station_id, recorded_at, temperature, humidity)
    VALUES ('station_001', toTimestamp(now()), 72.4, 55.3)
""")

# Read the latest 100 readings for a station
rows = session.execute("""
    SELECT recorded_at, temperature, humidity
    FROM sensor_readings
    WHERE station_id = 'station_001'
    LIMIT 100
""")

for row in rows:
    print(row.recorded_at, row.temperature)

The Partition Key: The Most Important Concept

When designing for Cassandra (a common interview topic), the partition key is the most critical decision. All rows with the same partition key are stored together on the same node. Queries that include the partition key are fast. Queries that don’t are slow (full cluster scans).

  Cassandra Cluster (3 Nodes)

  +----------+      +----------+      +----------+
  | Node A   |      | Node B   |      | Node C   |
  |          |      |          |      |          |
  | station_ |      | station_ |      | station_ |
  | 001 data |      | 002 data |      | 003 data |
  | (all     |      | (all     |      | (all     |
  | timestamps)     | timestamps)     | timestamps)
  +----------+      +----------+      +----------+
  
  Query: "Get all readings for station_001"
         --> Goes ONLY to Node A.  (FAST)
  
  Query: "Get all readings above 90°F across all stations"
         --> Must query ALL nodes.  (SLOW - avoid this)

In this diagram, we can see how Cassandra routes queries. When we query by partition key, the request goes directly to the node(s) holding that partition. Cross-partition queries require querying every node in the cluster, which is why Cassandra’s data model must be designed around your read patterns — not the other way around.

Design for Access Patterns, Not for Normalization

This is a critical mindset shift. In SQL, we normalize data to eliminate redundancy. In Cassandra, we denormalize — we duplicate data so that each query pattern has its own optimally structured table.

# Scenario: "Get all orders for a user" AND "Get all orders by date"
# In Cassandra, we create TWO tables — one for each access pattern

# Table 1: Optimized for "orders by user"
CREATE TABLE orders_by_user (
    user_id    UUID,
    created_at TIMESTAMP,
    order_id   UUID,
    total      DECIMAL,
    PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

# Table 2: Optimized for "orders by date"  
CREATE TABLE orders_by_date (
    order_date DATE,
    created_at TIMESTAMP,
    order_id   UUID,
    user_id    UUID,
    PRIMARY KEY (order_date, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

Yes, we store order data twice. Yes, this uses more disk space. But Cassandra is designed to run on commodity hardware in clusters of hundreds of nodes — disk space is cheap, cross-partition query performance is not.

When to Reach For It

Use Case	Why Wide-Column Works
IoT sensor streams	Billions of writes/day, sparse columns
Time-series data	Natural partition by entity + clustering by time
Activity logs / audit trails	Append-heavy, massive scale
Message inboxes (Facebook scale)	High write throughput, by-user queries
Real-time analytics	Fast reads by known partition keys

Representative Databases

Apache Cassandra — Gold standard for high-availability, tunable consistency
Apache HBase — Hadoop-native, strong consistency, large blob support
Google Bigtable — The original wide-column store; powers Google Search, Maps
ScyllaDB — Cassandra-compatible, written in C++ for lower latency

Part 7: Graph Databases — The Relationship Expert

The Real-World Analogy: Six Degrees of Kevin Bacon

There is a famous theory that any two people on Earth are connected by no more than six social relationships. Kevin Bacon, the actor, became the center of a game: how many movies do you need to trace before you can connect any actor to Kevin Bacon?

This problem — “what is the shortest path between node A and node B through a network of relationships?” — is trivially natural in a graph database and brutally painful in a relational database.

The Pain Point That Creates the Need

Let’s make the pain concrete. Suppose we have a SQL database for a social network. We want to find all friends-of-friends for user Alice.

-- Level 1: Alice's direct friends
SELECT friend_id FROM friendships WHERE user_id = 'alice';

-- Level 2: Friends of Alice's friends
SELECT f2.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
WHERE f1.user_id = 'alice';

-- Level 3: Friends of friends of friends
SELECT f3.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
JOIN friendships f3 ON f2.friend_id = f3.user_id
WHERE f1.user_id = 'alice';

Each additional “hop” requires another JOIN. A 6-hop query requires 6 JOINs over potentially billions of rows. Performance collapses exponentially. This is called the JOIN explosion problem, and it is the precise reason graph databases were invented.

The Atomic Units: Nodes and Edges

A graph database stores data as:

Nodes — entities (a Person, a Movie, a Product, a Server)
Edges — relationships between nodes (KNOWS, ACTED_IN, BOUGHT, CONNECTS_TO)
Properties — key-value attributes on both nodes and edges

  (Alice)-[:KNOWS {since: 2020}]->(Bob)
  (Alice)-[:KNOWS {since: 2019}]->(Carol)
  (Bob)  -[:KNOWS {since: 2021}]->(Dave)
  (Carol)-[:KNOWS {since: 2022}]->(Dave)
  
  Visualized:
  
  [Alice] ---KNOWS---> [Bob] ---KNOWS---> [Dave]
     |                                      ^
     +----KNOWS---> [Carol] ---KNOWS--------+
     
  Find shortest path from Alice to Dave: 
     Alice -> Bob -> Dave  (2 hops)  ✓

In this diagram, notice that we can visually see the path between Alice and Dave. A graph database traverses this structure natively — starting at Alice’s node, following edges labeled KNOWS, and arriving at Dave in exactly 2 hops. No JOIN is required; the relationship itself is a first-class citizen in the database.

Neo4j and the Cypher Query Language

Neo4j is the most widely deployed graph database. It uses a query language called Cypher, designed to be visually intuitive — queries look like the graph they describe.

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", 
                              auth=("neo4j", "password"))

with driver.session() as session:
    
    # Create nodes and relationships
    session.run("""
        CREATE (alice:Person {name: 'Alice', city: 'Milwaukee'})
        CREATE (bob:Person   {name: 'Bob',   city: 'Chicago'})
        CREATE (carol:Person {name: 'Carol', city: 'Denver'})
        CREATE (dave:Person  {name: 'Dave',  city: 'Milwaukee'})
        CREATE (alice)-[:KNOWS {since: 2020}]->(bob)
        CREATE (alice)-[:KNOWS {since: 2019}]->(carol)
        CREATE (bob)-[:KNOWS   {since: 2021}]->(dave)
        CREATE (carol)-[:KNOWS {since: 2022}]->(dave)
    """)
    
    # Find all friends-of-friends of Alice (2 hops) — compare to 2 SQL JOINs
    results = session.run("""
        MATCH (alice:Person {name: 'Alice'})-[:KNOWS*2]->(friend_of_friend)
        RETURN friend_of_friend.name AS name
    """)
    
    for record in results:
        print(record["name"])   # --> Dave
    
    # Find shortest path between Alice and Dave
    result = session.run("""
        MATCH path = shortestPath(
            (alice:Person {name: 'Alice'})-[:KNOWS*]-(dave:Person {name: 'Dave'})
        )
        RETURN [node IN nodes(path) | node.name] AS path_names,
               length(path) AS hops
    """)
    
    for record in result:
        print(record["path_names"])   # --> ['Alice', 'Bob', 'Dave']
        print(record["hops"])         # --> 2

The Cypher pattern (alice)-[:KNOWS*2]->(friend) reads exactly like a diagram: “match a path from Alice, following KNOWS edges exactly 2 hops, to a friend.” The * means “any number of hops,” and shortestPath() is built in. In SQL, both of these require significant engineering.

Graph Traversal vs. SQL JOIN: A Performance Story

Here is the core performance insight, beloved in system design interviews.

In a relational database, a JOIN operation must:

Load both tables (or relevant indexes) into memory
Compare every eligible row from one table to every eligible row from another
Scale as O(n × m) in the worst case

In a graph database, a traversal operation:

Starts at a known node (direct pointer lookup)
Follows edges to adjacent nodes (pointer dereference, not table scan)
Each hop is O(degree) — proportional to the number of edges on that node

The difference becomes dramatic at scale. Traversing a 6-hop path in a 15-million-node social graph takes milliseconds in Neo4j and can time out in SQL.

Property Graph vs. RDF: Two Graph Flavors

There are two dominant graph data models. For most interviews and applications, the property graph (used by Neo4j, Amazon Neptune) is the relevant model. For semantic web, knowledge graphs, and AI ontologies, the RDF (Resource Description Framework) model is used. We focus on property graphs here.

When to Reach For It

Use Case	Why Graph Works
Social networks	Friend recommendations, mutual connections
Fraud detection	Pattern matching across transaction networks
Recommendation engines	“Users who bought X also bought Y”
Knowledge graphs (AI/RAG)	Entity relationships for LLM reasoning
Network \& IT topology	Map server/dependency relationships
Access control (RBAC/ABAC)	Role chains and permission inheritance

Representative Databases

Neo4j — The dominant property graph DB; Cypher language; used by LinkedIn, eBay
Amazon Neptune — Managed graph service on AWS; supports Gremlin + SPARQL
ArangoDB — Multi-model: document + graph in one database
TigerGraph — Optimized for real-time deep-link analytics
Dgraph — GraphQL-native graph database

Part 8: The Side-By-Side — Choosing Your Weapon

This is the section that matters most in a system design interview. Knowing what each database is is table stakes. Knowing when to reach for which one is what separates a junior engineer from a senior engineer.

The Decision Matrix

Dimension	Key-Value	Document	Wide-Column	Graph
Data model	key → value	key → JSON doc	row × dynamic cols	nodes + edges
Query power	Key only	Rich (filters)	Partition + range	Traversal
Write throughput	Very high	High	Extremely high	Moderate
Scaling model	Horizontal	Horizontal	Linear horizontal	Vertical / Shard
Schema	None	Optional	Column families	Node/edge labels
Sweet spot	Caching, sessions	Content, APIs	IoT, logs, time-series	Social, fraud
Weakness	No value queries	No deep joins	No ad-hoc queries	Complex writes
Top example	Redis	MongoDB	Cassandra	Neo4j

A Hybrid Architecture: The Real World

In production systems at scale, these databases are rarely used in isolation. A senior engineer thinks in terms of polyglot persistence: choosing the best storage engine for each specific access pattern.

  User Request
      |
      v
  [API Layer (Next.js / Node)]
      |
      +---> [Redis]        - Session check, rate limiting, cache layer
      |
      +---> [MongoDB]      - Fetch user profile, product details
      |
      +---> [Cassandra]    - Write user activity event (1 of billions/day)
      |
      +---> [Neo4j]        - Generate "you might also know" recommendations
      |
      +---> [PostgreSQL]   - Process payment transaction (ACID required)

In this architecture diagram, we see five different databases serving five different concerns within a single application. This is not over-engineering — at scale, each of these databases is running on dedicated infrastructure, handling traffic that would bring a single relational database to its knees. The API layer is the orchestrator that routes each operation to its ideal data store.

Part 9: Interview Playbook

The Question You Will Always Get

“When would you choose NoSQL over SQL, and which NoSQL type?”

Here is a framework for answering it with precision:

Step 1: Identify the data shape. Is the data tabular and relational? → SQL. Is it document-like with variable attributes? → Document DB. Is it a network of relationships? → Graph DB. Is it a massive stream of timestamped events? → Wide-Column. Is it ephemeral access data? → Key-Value.

Step 2: Identify the scale requirement. Millions of rows? SQL handles it fine. Billions of rows with high write throughput? Cassandra. Millions of graph hops per second? Neo4j. Millions of cache hits per second? Redis.

Step 3: Identify the consistency requirement. Financial transactions, inventory management? → ACID → SQL (or NewSQL). Eventually consistent activity feeds, logs? → NoSQL AP model. Session data that can be regenerated? → Key-Value with TTL.

Step 4: Identify the access pattern. In SQL, design around the data. In NoSQL, design around the query. Ask: “What are the top 3 queries this service must answer quickly?” Then choose and model accordingly.

Common Gotchas in Interviews

“NoSQL doesn’t support ACID” — This is outdated. MongoDB supports multi-document ACID transactions since v4.0. Cassandra has lightweight transactions (LWT). Redis supports transactions with MULTI/EXEC.
“NoSQL is schema-less” — More accurately, schema is enforced at the application level rather than the database level. In practice, production systems always enforce a schema via validation libraries or ODMs (Object Document Mappers).
“You can’t do joins in NoSQL” — True in most cases. The answer is that NoSQL trades join capability for scaling capability. You handle “joins” at the application layer or through data modeling (embedding).
Cassandra hot partition — A common Cassandra mistake is choosing a partition key that concentrates traffic on one partition (e.g., using date as a partition key for a high-volume system, so all today’s writes hit one node). The fix: composite partition keys or bucketing.

Part 10: The Mental Model to Take Into the Interview Room

Let’s close with the most important takeaway. NoSQL is not a replacement for SQL. It is a toolkit of specialized instruments, each evolved to solve a problem that the relational model cannot solve elegantly at scale.

When an interviewer asks you to design a system, you should be thinking:

“This service needs speed above all else. → Redis.”
“This service has flexible, hierarchical data. → MongoDB.”
“This service will write 50,000 events per second. → Cassandra.”
“This service needs to traverse 6 degrees of social connections. → Neo4j.”

And you should be comfortable saying: “In this design, I would use PostgreSQL for the financial core, MongoDB for the user profile service, Redis as the caching layer, and Cassandra for the activity event stream. This is polyglot persistence — each tool solving the problem it was built for.”

That sentence, spoken confidently with reasoning attached, is a senior-engineer answer.

Quick Reference Cheat Sheet

KEY-VALUE   : Redis      | cache, session, rate-limit    | O(1) by key
DOCUMENT    : MongoDB    | profiles, catalog, CMS        | Rich JSON queries  
WIDE-COLUMN : Cassandra  | IoT, logs, time-series        | Partition + range
GRAPH       : Neo4j      | social, fraud, recommender    | Relationship traversal