system-design.io

Graph Databases and Query Models

A Curriculum for Software Engineering Interviews

Part 1 — The Library That Forgot Where Things Are

Before we write a single line of query syntax, let’s think about a library.

Imagine a massive library with millions of books. The classic librarian system — a relational one — organizes books into tables: one shelf for authors, one for titles, one for publishers. To find “every book written by an author who was influenced by Kafka and later influenced Haruki Murakami,” the librarian has to consult three different ledgers, flip back and forth, cross-reference IDs, and staple the results together. This process is called a JOIN — and in a digital database, the cost of that stapling grows painfully as the number of connections multiplies.

Now imagine a different kind of library. In this one, each book has a physical string tied to every related book — by influence, by genre, by shared publishers. To answer the same question, you simply grab the Kafka book and follow the strings. No ledgers. No cross-referencing. Just traversal.

That is the fundamental promise of a graph database.

Graph databases are purpose-built for data where the relationships between things are as important as the things themselves. They don’t store relationships as a side effect of foreign keys — they store them as first-class citizens with their own identity, direction, and attributes.

This distinction seems subtle until you encounter data that’s deeply connected: social networks, fraud rings, recommendation engines, supply chains, knowledge graphs. In those domains, the relational model collapses under its own JOIN overhead, and graph databases begin to shine.

Part 2 — Building the Mental Model, Atom by Atom

The Three Primitives

Let’s construct a graph database from scratch by identifying its atomic units. There are exactly three.

Nodes are the entities — the things in your domain. A person. A product. A city. A bank account. In the diagram below, circles represent nodes.

Edges (also called relationships or arcs) are the connections between nodes. Unlike foreign keys in SQL, edges in a graph database are stored as physical pointers. An edge always has:

A direction (from node A to node B)
A type (a semantic label like FOLLOWS, PURCHASED, LIVES_IN)
Optional properties (key-value pairs, just like a node)

Properties are the key-value metadata attached to either a node or an edge. A Person node might have {name: "Alice", age: 29}. A FRIENDS_WITH edge might have {since: "2018-04-11"}.

         name: "Alice"                name: "Bob"
         age: 29                      age: 31
         .-----------.                .-----------.
         |  Person   |                |  Person   |
         |  (Alice)  |                |   (Bob)   |
         '-----------'                '-----------'
               |                            ^
               | FOLLOWS                    |
               | {since: "2021-03-01"}      |
               '----------------------------'

In Figure 1, notice that the FOLLOWS relationship has a direction (Alice follows Bob) and carries its own metadata as a property. The nodes and the relationship are peers — each stored with equal importance.

Together, these three primitives form what is formally called a Labeled Property Graph (LPG) — the data model used by Neo4j, Amazon Neptune, and the emerging GQL ISO standard.

The Atomic Limitation: The Relational Approach to Relationships

Before we go further, we need to understand exactly why the relational model struggles with highly connected data. This is the “broken state” that graph databases exist to fix.

Consider a social network. In a relational database, you store users in a users table and friendships in a friendships table with two foreign keys: user_id_1 and user_id_2. Querying “friends of friends” requires a self-JOIN:

# Pseudocode representing a SQL-style approach in Python
# "Friends of Alice's friends"

query = """
    SELECT DISTINCT u3.name
    FROM users u1
    JOIN friendships f1 ON u1.id = f1.user_id_1
    JOIN users u2 ON u2.id = f1.user_id_2
    JOIN friendships f2 ON u2.id = f2.user_id_1
    JOIN users u3 ON u3.id = f2.user_id_2
    WHERE u1.name = 'Alice'
"""

Now query “friends of friends of friends.” You need three JOINs. Four degrees of separation? Four JOINs. Each additional hop multiplies the query complexity. For a network with millions of nodes, this quickly becomes intractable. The relational database was never designed with traversal as a first-class operation.

Part 3 — Index-Free Adjacency: The Secret Weapon

This is the architectural breakthrough that separates a “real” graph database from a relational database pretending to be one.

In a relational database, when you want to find all friends of Alice, the engine must consult a global index (typically a B-Tree) on the friendships table. That lookup costs O(log N) where N is the total number of friendships in the system. Every hop you take, you pay that cost again.

In a native graph database like Neo4j, each node stores direct physical pointers to its neighboring nodes and edges — no global index needed. Finding Alice’s friends means starting at Alice’s node and following the pointers that are literally stored next to her. Each hop costs O(1) — constant time, regardless of how large the overall graph becomes.

   RELATIONAL MODEL                      GRAPH MODEL (Index-Free Adjacency)
   ─────────────────                     ──────────────────────────────────

   friendships table                     Alice node
   ┌──────────────────────┐              ┌──────────────────────────────┐
   │ user_id_1 │ user_id_2│              │  name: "Alice"               │
   │  1        │   2      │              │  → ptr to Bob node           │
   │  1        │   3      │              │  → ptr to Carol node         │
   │  ...      │  ...     │              │  → ptr to Dave node          │
   └──────────────────────┘              └──────────────────────────────┘
          ↓                                         ↓
   B-Tree global index lookup            Direct pointer dereference
   O(log N) per hop                      O(1) per hop

In Figure 2, the relational model must scan or index into a central table at every hop. The graph model stores relationship pointers directly on the node, making traversal a matter of following a physical address.

This design choice is why graph databases can answer “Who are Alice’s friends-of-friends?” on a billion-node social graph in milliseconds while a relational database is still waiting for its JOINs to complete.

There is one caveat worth noting: to find the starting node (Alice’s node), the graph engine does use an index — typically a B-Tree on node labels and properties. Once you have that starting node, however, all subsequent traversal hops are O(1).

Part 4 — The Two Graph Data Models

So far we’ve been operating inside the Labeled Property Graph (LPG) model. But there’s a second model you need to know for interviews, particularly when knowledge graphs and semantic web technologies come up: RDF (Resource Description Framework).

Model 1: Labeled Property Graph (LPG)

This is the dominant model in production systems. The core philosophy is pragmatic: model your domain as nodes and edges with rich property bags.

Feature	Description
Nodes	Entities with labels and properties
Edges	Directed, typed relationships with properties
Schema	Schema-optional (flexible by default)
Query Language	Cypher (Neo4j), Gremlin (Apache TinkerPop), GQL
Databases	Neo4j, Amazon Neptune, Memgraph, FalkorDB

The LPG model’s greatest strength is its flexibility. You can add a new relationship type between two nodes without migrating a schema or touching existing data. This makes it ideal for domains that evolve quickly.

Model 2: Resource Description Framework (RDF)

RDF comes from the Semantic Web tradition. Instead of nodes and edges, everything is expressed as a triple: Subject → Predicate → Object.

<Alice> <follows> <Bob>
<Alice> <livesIn> <Milwaukee>
<Bob>   <worksAt> <Acme Corp>

Every entity and relationship is a URI — a globally unique identifier. This makes RDF perfect for interoperability: two completely separate organizations can link their data using shared URIs without coordination.

Feature	Description
Unit of storage	Subject-Predicate-Object triple
Identifiers	URIs (globally unique, web-compatible)
Schema	Enforced via OWL/RDFS ontologies
Query Language	SPARQL
Databases	Stardog, GraphDB (Ontotext), Amazon Neptune (RDF)

The RDF model enables reasoning — using ontology rules to infer facts not explicitly stated. If we assert that Alice is a Person and every Person is a Mammal, an RDF reasoner can infer that Alice is a Mammal without you writing that triple explicitly.

Which Model Should You Choose?

Think of it this way: if you’re building an application-facing graph (social network, recommendation engine, fraud detection), choose LPG — it’s more performant and flexible. If you’re building a knowledge integration system (enterprise knowledge graph, biomedical data fusion, semantic search), choose RDF — its formalism and interoperability pay dividends.

Part 5 — Query Languages, One at a Time

Now we get to the heart of what you’ll be tested on in interviews. Let’s walk through the three dominant graph query languages, building complexity as we go.

Query Language 1: Cypher (The Whiteboard Language)

Cypher was created by Neo4j and later open-sourced as OpenCypher, which has been adopted by multiple vendors. Its defining feature is its ASCII-art syntax — Cypher queries literally look like a diagram drawn on a whiteboard.

The grammar rule to remember:

Nodes: (variable:Label {property: value})
Relationships: -[variable:TYPE {property: value}]->
A pattern: (a)-[:KNOWS]->(b)

Let’s build Cypher queries from simple to complex.

Level 1 — Finding a Node:

# Equivalent Cypher (run via Neo4j Python driver)
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def find_person(tx, name):
    result = tx.run(
        "MATCH (p:Person {name: $name}) RETURN p",
        name=name
    )
    return [record["p"] for record in result]

with driver.session() as session:
    persons = session.read_transaction(find_person, "Alice")

The MATCH keyword is Cypher’s SELECT FROM equivalent. The pattern (p:Person {name: $name}) means “find any node labeled Person whose name property equals the given parameter.”

Level 2 — Traversing a Relationship:

def find_friends(tx, name):
    result = tx.run(
        """
        MATCH (p:Person {name: $name})-[:FRIENDS_WITH]->(friend:Person)
        RETURN friend.name AS friend_name
        """,
        name=name
    )
    return [record["friend_name"] for record in result]

Notice how the relationship [:FRIENDS_WITH] is written as an arrow in the middle of the pattern. Reading Cypher is like reading a whiteboard diagram: “Find a Person named Alice who has a FRIENDS_WITH relationship pointing to another Person — return those persons.”

Level 3 — Variable-Length Traversal (Friends of Friends):

def find_friends_of_friends(tx, name, depth=2):
    result = tx.run(
        """
        MATCH (p:Person {name: $name})-[:FRIENDS_WITH*1..2]->(fof:Person)
        WHERE fof.name <> $name
        RETURN DISTINCT fof.name AS name
        """,
        name=name
    )
    return [record["name"] for record in result]

The syntax *1..2 is the variable-length path operator. It says: “traverse this relationship type between 1 and 2 hops.” Change it to *1..6 and you’re querying up to 6 degrees of separation — the kind of query that brings a relational database to its knees but runs naturally in Cypher.

Level 4 — Shortest Path:

def find_shortest_path(tx, source, target):
    result = tx.run(
        """
        MATCH path = shortestPath(
          (a:Person {name: $source})-[:KNOWS*]-(b:Person {name: $target})
        )
        RETURN [node IN nodes(path) | node.name] AS path_names,
               length(path) AS hops
        """,
        source=source, target=target
    )
    return result.single()

shortestPath() is a Cypher built-in that executes a BFS traversal internally. The result gives both the path nodes and the hop count — exactly what an interviewer might ask you to implement for a “LinkedIn degrees of separation” problem.

Query Language 2: Gremlin (The Traversal Machine)

Gremlin is part of the Apache TinkerPop framework and takes a fundamentally different approach: it is procedural and imperative rather than declarative. Instead of describing a pattern you want to match, you write a series of steps that walk the graph.

Think of Gremlin as a cursor navigating the graph. You “stand” at a node, take a step, filter, take another step, filter, and emit results.

# Using the Gremlin Python client (gremlinpython)
from gremlin_python.driver import client, serializer

gremlin_client = client.Client(
    'ws://localhost:8182/gremlin',
    'g',
    message_serializer=serializer.GraphSONSerializersV2d0()
)

# Find friends of Alice
query = "g.V().has('Person', 'name', 'Alice').out('FRIENDS_WITH').values('name')"
result = gremlin_client.submit(query).all().result()
print(result)  # ['Bob', 'Carol', 'Dave']

Let’s break down the Gremlin vocabulary:

Step	Meaning
`g.V()`	Start from all vertices (nodes) in the graph
`.has(label, key, value)`	Filter to nodes with matching property
`.out(type)`	Traverse outgoing edges of the given type
`.in(type)`	Traverse incoming edges
`.both(type)`	Traverse edges in either direction
`.values(key)`	Extract a property value
`.repeat(step).times(n)`	Repeat a traversal step N times

Variable-depth traversal in Gremlin:

# Friends of friends of Alice (2 hops)
query = """
    g.V().has('Person', 'name', 'Alice')
     .repeat(out('FRIENDS_WITH')).times(2)
     .dedup()
     .values('name')
"""
result = gremlin_client.submit(query).all().result()

The .repeat().times() pattern in Gremlin is the equivalent of Cypher’s *1..N variable-length paths. The .dedup() step removes duplicates — an important filter when traversal paths can intersect.

The key philosophical difference: Cypher tells the database what you want; Gremlin tells the database how to get it. This makes Gremlin more verbose but also more controllable for advanced traversal optimization.

Query Language 3: SPARQL (The Semantic Query Language)

SPARQL (pronounced “sparkle”) is the W3C standard query language for RDF data. Since RDF is built on triples, SPARQL queries are essentially requests to match triple patterns against the data.

Where Cypher uses ASCII-art patterns and Gremlin uses traversal steps, SPARQL uses variable binding over triple patterns.

# Using rdflib for SPARQL queries in Python
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, FOAF

# Build an in-memory RDF graph
g = Graph()
EX = Namespace("http://example.org/")

g.add((EX.Alice, RDF.type, FOAF.Person))
g.add((EX.Alice, FOAF.name, Literal("Alice")))
g.add((EX.Alice, EX.follows, EX.Bob))
g.add((EX.Bob,   FOAF.name, Literal("Bob")))
g.add((EX.Bob,   EX.follows, EX.Carol))

# SPARQL: Find people that Alice follows
sparql_query = """
    PREFIX ex: <http://example.org/>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>

    SELECT ?followedName
    WHERE {
        ex:Alice ex:follows ?followed .
        ?followed foaf:name ?followedName .
    }
"""
results = g.query(sparql_query)
for row in results:
    print(row.followedName)  # "Bob"

In SPARQL, ?followed is a variable — think of it as a SQL column alias for a pattern binding. The two triple patterns joined in the WHERE clause effectively traverse from Alice to Bob and then retrieve Bob’s name.

SPARQL’s superpower — CONSTRUCT and inference:

# SPARQL CONSTRUCT creates new triples from existing ones
infer_query = """
    PREFIX ex: <http://example.org/>

    CONSTRUCT {
        ?person ex:indirectlyFollows ?indirect .
    }
    WHERE {
        ?person ex:follows ?intermediate .
        ?intermediate ex:follows ?indirect .
    }
"""
new_graph = g.query(infer_query)

The CONSTRUCT form doesn’t just query — it *generates new triples based on patterns. This is the foundation of semantic reasoning: deriving implicit knowledge from explicit facts.*

The Three Languages Side by Side

 TASK: "Find products bought by Alice's friends"

 ┌──────────────────────────────────────────────────────────────┐
 │ CYPHER (Declarative Pattern)                                 │
 │                                                              │
 │ MATCH (a:Person {name:'Alice'})-[:FRIENDS_WITH]->(f:Person)  │
 │       -[:PURCHASED]->(p:Product)                             │
 │ RETURN p.name                                                │
 └──────────────────────────────────────────────────────────────┘

 ┌──────────────────────────────────────────────────────────────┐
 │ GREMLIN (Traversal Steps)                                    │
 │                                                              │
 │ g.V().has('Person','name','Alice')                           │
 │  .out('FRIENDS_WITH')                                        │
 │  .out('PURCHASED')                                           │
 │  .values('name')                                             │
 └──────────────────────────────────────────────────────────────┘

 ┌──────────────────────────────────────────────────────────────┐
 │ SPARQL (Triple Pattern Matching)                             │
 │                                                              │
 │ SELECT ?productName WHERE {                                  │
 │   ex:Alice ex:friendsWith ?friend .                          │
 │   ?friend  ex:purchased   ?product .                         │
 │   ?product ex:name        ?productName .                     │
 │ }                                                            │
 └──────────────────────────────────────────────────────────────┘

In Figure 3, notice that all three languages express the same semantic intent — traversing two hops from Alice — but with entirely different syntactic philosophies. Cypher reads like a diagram. Gremlin reads like a pipeline. SPARQL reads like a logic formula.

Part 6 — The New Standard: GQL

In April 2024, the database world reached a milestone that hadn’t happened since 1987: a new ISO database language standard was ratified. [web:27] GQL (Graph Query Language) — ISO/IEC 39075 — is now the official international standard for property graph querying. [web:22]

This is significant because it does for graph databases what SQL did for relational ones: it gives the ecosystem a common language that transcends individual vendors.

GQL builds on the foundations of Cypher, PGQL, GSQL, and G-CORE, and is developed by the same ISO working group that oversees SQL. [web:25] This means if you know SQL, GQL’s DDL and DML constructs will feel immediately familiar.

# GQL syntax example (conceptual, supported in Microsoft Fabric, Neo4j, etc.)

# Create a graph
# CREATE GRAPH SocialNetwork

# Insert data (DML)
# INSERT (:Person {name: 'Alice', age: 29}),
#        (:Person {name: 'Bob', age: 31}),
#        (:Person {name: 'Alice'})-[:FOLLOWS]->(:Person {name: 'Bob'})

# Query (pattern matching, familiar to Cypher users)
# MATCH (a:Person {name: 'Alice'})-[:FOLLOWS]->(b:Person)
# RETURN b.name

For interview purposes: GQL’s emergence signals that the graph database ecosystem is maturing into enterprise-grade standardization. Mentioning GQL in a senior-level system design discussion demonstrates awareness of the current state of the industry.

Part 7 — Graph Traversal Algorithms and Their Costs

Understanding query models isn’t just about syntax — it’s also about knowing what happens underneath. When a graph database executes a traversal, it’s running well-known graph algorithms that you’ve likely studied in DSA.

BFS and DFS Under the Hood

      BFS (shortest path, level-by-level)
      ─────────────────────────────────
               [Alice]
              /        \
          [Bob]        [Carol]        ← Level 1
          /   \            \
      [Dave] [Eve]        [Frank]     ← Level 2

      Visit order: Alice → Bob → Carol → Dave → Eve → Frank


      DFS (deep traversal, path exploration)
      ──────────────────────────────────────
               [Alice]
              /
          [Bob]
          /
      [Dave]
          \
         [Eve]       ← Explores one branch fully before backtracking

In Figure 4, BFS explores level-by-level and is used for shortest-path queries. DFS explores depth-first and is used for detecting cycles or exploring all paths. Cypher’s shortestPath() uses BFS internally, while Gremlin’s .repeat() traversal defaults to DFS unless configured otherwise.

Time Complexity of Graph Operations

Operation	Graph DB (LPG, IFA)	Relational DB (with index)
Single node lookup	O(log N) — index	O(log N) — B-Tree index
Traverse 1 hop	O(1) per hop	O(log N) per JOIN
Traverse k hops	O(k · avg_degree)	O(k · log N)
Shortest path (BFS)	O(V + E)	Impractical beyond 3-4 hops
Connected components	O(V + E)	Requires recursive CTEs

Here V = number of vertices, E = number of edges, N = total rows in the relevant table. The column “O(1) per hop” is what index-free adjacency buys you — that constant factor difference compounds dramatically over k hops.

Part 8 — Modeling Patterns You’ll See in Interviews

Graph modeling is as much craft as science. Let’s walk through three canonical patterns.

The classic. Nodes are Person, edges are FRIENDS_WITH (undirected conceptually, but stored as two directed edges or one with traversal in both directions).

# Neo4j Cypher: Create social graph
create_query = """
MERGE (alice:Person {name: 'Alice', city: 'Milwaukee'})
MERGE (bob:Person {name: 'Bob', city: 'Chicago'})
MERGE (carol:Person {name: 'Carol', city: 'Milwaukee'})
MERGE (alice)-[:FRIENDS_WITH {since: '2020'}]->(bob)
MERGE (bob)-[:FRIENDS_WITH {since: '2021'}]->(carol)
MERGE (alice)-[:FRIENDS_WITH {since: '2019'}]->(carol)
"""

# Query: Mutual friends of Alice and Bob
mutual_friends_query = """
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH]->(mutual)
      <-[:FRIENDS_WITH]-(bob:Person {name: 'Bob'})
RETURN mutual.name AS mutual_friend
"""

Pattern 2: Fraud Detection (Shared Identity Ring)

Fraud rings work by sharing identifiers — the same phone number, email, or address across multiple fake accounts. Graph databases can detect these rings efficiently using link analysis.

# Cypher: Detect accounts sharing a phone number (potential fraud ring)
fraud_query = """
MATCH (a1:Account)-[:HAS_PHONE]->(phone:PhoneNumber)
      <-[:HAS_PHONE]-(a2:Account)
WHERE a1 <> a2
WITH phone, collect(DISTINCT a1) + collect(DISTINCT a2) AS suspects
WHERE size(suspects) >= 3
RETURN phone.number AS shared_phone,
       [s IN suspects | s.account_id] AS suspect_accounts,
       size(suspects) AS ring_size
ORDER BY ring_size DESC
"""

  Fraud Ring — Three accounts share one phone number

        [Acc #1001]          [Acc #1002]
             \                   /
              \                 /
          HAS_PHONE         HAS_PHONE
                  \         /
                [+1-414-555-0101]   ← Shared PhoneNumber node
                  /
          HAS_PHONE
              /
        [Acc #1003]

In Figure 5, three Account nodes converge on a single PhoneNumber node. In a relational database, this pattern requires a GROUP BY with HAVING COUNT > 2 on a JOINed table. In a graph, it’s visible as a structural pattern — a “star” topology around a shared node — and a single Cypher MATCH finds it naturally.

Pattern 3: Recommendation Engine (Collaborative Filtering)

The “customers who bought X also bought Y” pattern is a classic graph traversal: find items purchased by people with similar purchase history.

# Cypher: Content-based product recommendation
recommendation_query = """
MATCH (user:User {id: $user_id})-[:PURCHASED]->(item:Product)
      <-[:PURCHASED]-(similar_user:User)
      -[:PURCHASED]->(recommended:Product)
WHERE NOT (user)-[:PURCHASED]->(recommended)
  AND user <> similar_user
WITH recommended, count(similar_user) AS score
RETURN recommended.name AS product,
       score
ORDER BY score DESC
LIMIT 10
"""

This query is a two-hop traversal: from User → Products (via PURCHASED) → Similar Users (via the same products) → Other Products. The count(similar_user) aggregation acts as the collaborative filtering signal — products purchased by more similar users rank higher.

Part 9 — When to Use a Graph Database (And When Not To)

This is a decision you will be asked to defend in system design interviews. Here is a clear decision framework.

Use a Graph Database When:

Queries traverse 3+ levels of relationships (friends of friends, supply chain paths, organizational hierarchies)
Relationships are dynamic and unpredictable (you can’t predefine all edge types at schema design time)
Relationship semantics matter (not just “connected” but how and why they’re connected)
You need shortest-path or reachability queries in real time

Common domains: social networks, fraud detection, knowledge graphs, recommendation systems, network/IT topology, identity resolution.

Avoid a Graph Database When:

You need bulk aggregations (SUM, AVG, GROUP BY over millions of rows) — graph databases are poor at this; use a columnar warehouse instead
Data is naturally tabular (financial ledgers, time-series metrics, sensor data) — relational or time-series databases are a better fit
Queries rarely traverse more than 1-2 hops — the overhead of a graph engine isn’t justified; a relational DB with good indexes performs equally well
Your team is unfamiliar with graph modeling — the data modeling discipline for graphs is non-trivial; a poorly designed graph schema is worse than a well-tuned relational schema

The Hybrid Architecture Pattern

In practice, many production systems combine both. Transactional and reporting data lives in PostgreSQL or a similar RDBMS. The relationship graph — social connections, recommendation signals, fraud patterns — lives in Neo4j or Neptune. An application layer queries both and combines the results.

   Application Layer
          │
   ┌──────┴──────────────┐
   │                     │
   ▼                     ▼
[PostgreSQL]         [Neo4j / Neptune]
Users, Orders,       Social graph,
Products, Payments   Recommendations,
(Transactional)      Fraud patterns
                     (Relationship traversal)

In Figure 6, the two databases serve fundamentally different query patterns. PostgreSQL handles ACID transactions and aggregation-heavy reporting. Neo4j handles multi-hop traversal and pattern matching. The application code decides which engine to query based on the type of question being asked.

Part 10 — Graph Databases in System Design Interviews

Let’s anchor everything with the kind of design problem you’ll face.

Problem: “Design the backend for a LinkedIn-style ‘People You May Know’ feature.”

A strong answer using graph databases might unfold like this:

Step 1 — Model the Domain:

User nodes with properties {userId, name, headline}
CONNECTED_TO edges with {connected_since}
WORKS_AT edges to Company nodes
ATTENDED edges to School nodes

Step 2 — Choose the Engine: Neo4j or Amazon Neptune (LPG, Cypher/Gremlin). RDF is unnecessary here — we don’t need semantic reasoning, just fast traversal.

Step 3 — Write the Core Query:

# Cypher: People You May Know
pymk_query = """
MATCH (me:User {userId: $user_id})-[:CONNECTED_TO]->(friend)
      -[:CONNECTED_TO]->(suggestion:User)
WHERE NOT (me)-[:CONNECTED_TO]->(suggestion)
  AND me <> suggestion
WITH suggestion, count(friend) AS mutual_connections
WHERE mutual_connections >= 2
RETURN suggestion.userId,
       suggestion.name,
       mutual_connections
ORDER BY mutual_connections DESC
LIMIT 20
"""

Step 4 — Address Scale:

Index on User.userId for O(log N) entry point lookup
Index-free adjacency ensures O(1) per traversal hop thereafter
Read replicas handle query volume (graph DBs scale reads horizontally)
Pre-computation for users with very high degree (celebrity nodes) — cache their suggestion lists, recompute on schedule

Step 5 — Discuss Trade-offs: At truly massive scale (billions of users, Facebook-level), even native graph DBs hit partitioning challenges. Cross-partition hops add network latency. This is why Facebook historically used custom graph engines (TAO) rather than off-the-shelf graph databases. For a typical enterprise-scale system with tens to hundreds of millions of users, Neo4j or Neptune handles this comfortably.

Part 11 — The Evolving Landscape

We’ve come a long way since the early days of graph databases. A few context-setting notes for interviews:

2000-2007: Graph databases existed but were niche academic tools. The term “NoSQL” hadn’t been coined yet. Relationships were handled poorly by relational databases, and developers lived with the JOINs.

2007: Neo4j is founded, introducing the first production-grade native graph database with index-free adjacency. The property graph model becomes the dominant LPG standard.

2012-2019: Graph database adoption grows 605% as social networks, fraud detection, and recommendation systems become mainstream engineering challenges. [web:6] Apache TinkerPop and Gremlin standardize the traversal API across multiple vendors.

2024: ISO ratifies GQL (ISO/IEC 39075) — the first new ISO database language since SQL in 1987. [web:27] This marks graph databases’ formal graduation into the enterprise standards landscape. Amazon Neptune, Neo4j, and Microsoft Fabric are among the first adopters. [web:22]

2025-2026: The emergence of AI knowledge graphs and retrieval-augmented generation (RAG) systems creates a new wave of demand for graph databases. Connecting LLMs to structured relationship graphs enables richer, more accurate knowledge retrieval than vector stores alone.

Part 12 — Key Interview Flashcards

Before we close, let’s crystallize the highest-signal concepts for interview contexts.

1. What is index-free adjacency? Each node stores direct physical pointers to its adjacent nodes and relationships. Traversal cost is O(1) per hop, compared to O(log N) for a B-Tree index lookup in a relational JOIN.

2. What is the difference between LPG and RDF? LPG (Labeled Property Graph) stores entities as nodes and relationships as typed, directed edges with properties — flexible and pragmatic. RDF stores everything as subject-predicate-object triples with URI identifiers — formal, interoperable, and supports logical inference.

3. What is Cypher’s shortestPath() doing under the hood? It executes BFS (Breadth-First Search) starting from the source node, traversing the graph level by level until it reaches the target. BFS guarantees the shortest path in an unweighted graph.

4. When would you NOT use a graph database? When queries are primarily aggregation-heavy (SUM/GROUP BY), when data is naturally tabular, or when relationships are shallow (1-hop). In those cases, relational databases or columnar stores are more efficient.

5. What is GQL? ISO/IEC 39075, published April 2024, is the first ISO-standard query language for property graphs. It draws from Cypher, PGQL, GSQL, and G-CORE and is designed to do for graph databases what SQL did for relational ones. [web:27]

6. What are the three query language paradigms?

Cypher: Declarative, pattern-matching, ASCII-art syntax
Gremlin: Imperative, step-based traversal pipeline
SPARQL: Declarative, triple-pattern matching for RDF

Putting It All Together

We started with a library full of strings — a mental model of traversal. We built the graph model atom-by-atom: nodes, edges, properties. We uncovered index-free adjacency as the architectural mechanism that makes graph traversal genuinely O(1) per hop. We distinguished LPG from RDF and traced the philosophical DNA of Cypher, Gremlin, and SPARQL. We applied these concepts to canonical interview problems — fraud rings, social recommendations, and system design.

The central insight to carry out of this article is this: graph databases don’t just store data differently — they think about relationships differently. In a relational database, a relationship is a side effect of matching foreign keys. In a graph database, a relationship is a first-class object with its own identity, direction, and meaning. That shift in philosophy is what unlocks the traversal performance, the expressive query patterns, and the modeling flexibility that makes graph databases indispensable for connected data problems.

When you walk into that interview and the whiteboard problem involves “degrees of separation,” “fraud ring detection,” or “people you may know,” you’ll know exactly which tool to reach for — and why.

```

Here’s a summary of the key sources informing the technical accuracy of this article:

Index-free adjacency mechanics and O(1) hop complexity are documented in native graph engine implementations like Neo4j
GQL as ISO/IEC 39075 was officially ratified in April 2024 — the first new ISO database language standard since SQL in 1987
Cypher, Gremlin, and SPARQL are the three dominant query language paradigms across LPG and RDF graph models
Fraud detection via link analysis is a well-established graph database use case in financial services
The 605% growth in graph database adoption from 2012–2019 reflects the rise of social network and recommendation system engineering