system-design.io

Vector Databases and ANN Indexes

A Complete Guide for the Modern Software Engineer

“Finding a needle in a haystack is easy. Finding the needle most similar to your needle — in a haystack of a billion needles — is the problem we are here to solve.”

Part 1: The Bridge — A Library That Doesn’t Use Alphabets

Picture a library. Not a normal library with books sorted alphabetically or by the Dewey Decimal System — but a library organized entirely by meaning.

In this library, books about heartbreak sit near books about loneliness. Books about quantum physics are shelved next to books about philosophy of reality. The novel 1984 is close to a book about modern surveillance capitalism. There’s no alphabet here, no numbering system. The organizing principle is pure, conceptual similarity.

Now imagine someone walks in and asks: “I want something like Hemingway’s The Old Man and the Sea, but more modern.”

In a traditional library, a human librarian must read thousands of books and make a judgment call. In our magical meaning-library, a user simply walks to where that Hemingway book lives, looks at its neighbors on the shelf, and picks the nearest one. The answer is spatial. The answer is a nearest neighbor.

This is exactly what vector databases do. They transform data — text, images, audio, user behavior, DNA sequences — into points in a high-dimensional coordinate space, so that “similar things” literally live close to each other. Once your data is a point in space, the question “find me something similar” becomes the purely geometric question: “find me the nearest neighbors to this point.”

The challenge? Your library doesn’t have 3 dimensions or even 300. It has 1,536 dimensions (OpenAI embeddings), or 768 (BERT), or 4,096 (larger language models). Finding exact neighbors at that scale gets brutally expensive. That’s where Approximate Nearest Neighbor (ANN) indexes come in — and that is the story we are about to tell.

Part 2: Turning the World Into Vectors

Before we can understand how to search, we need to understand what we’re searching through.

The Embedding: Meaning as Coordinates

An embedding is a function that converts raw data into a dense list of floating-point numbers — a vector. The magic is that this function is learned by a machine learning model to preserve semantic relationships. Things that are conceptually similar produce vectors that are numerically close.

Consider this: a model trained on text learns that “king” and “queen” are both royalty, so their vectors are close together. “Dog” and “puppy” land near each other. “Paris” and “France” cluster together in a way that “Paris” and “banana” do not.

# Conceptual illustration: sentence → embedding
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",     # semantically similar
    "The stock market crashed today",  # semantically distant
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 384) — 3 vectors, each 384 dimensions

In the output above, embeddings[^0] (cat on mat) and embeddings (feline on rug) will be very close together in 384-dimensional space. embeddings[^2] (stock market) will be far away.

The Vector Store’s Job

Once we have millions of these vectors, we need a system that can:

Store them efficiently
Index them so searches are fast
Retrieve the k most similar vectors to any given query vector

This is the job of a vector database. But before we can appreciate the clever solutions, we need to feel the pain of the naive approach.

Part 3: The Naive Approach — Brute Force (Flat Index)

The most honest, straightforward approach to finding the nearest neighbor is to check every single vector. We call this a Flat Index or exhaustive search.

import numpy as np
import faiss

dimension = 128
num_vectors = 1_000_000  # 1 million vectors

# Simulate our vector database
database = np.random.random((num_vectors, dimension)).astype('float32')

# The query: "find me vectors similar to this one"
query = np.random.random((1, dimension)).astype('float32')

# Build a flat (brute-force) index using L2 (Euclidean) distance
index = faiss.IndexFlatL2(dimension)
index.add(database)

# Search for top 5 nearest neighbors
distances, indices = index.search(query, k=5)
print(f"Top 5 neighbor indices: {indices}")
print(f"Distances: {distances}")

The flat index is a genius of simplicity. Every query vector is compared against every stored vector, and the distances are ranked. The top k are returned.

The Performance Wall

Here is the problem. Let’s think about the complexity:

N vectors in the database
D dimensions per vector
One query computes N distance calculations, each of D multiplications and additions
Total cost per query: O(N × D)

For N = 10,000 (ten thousand), this is perfectly manageable. But production AI systems don’t have ten thousand vectors. Spotify’s recommendation system has hundreds of millions of song vectors. Pinecone indexes have billions of entries.

Flat Index Search Time (approximate):

  N = 10K      vectors → ~0.5ms    ✅ Fast enough
  N = 1M       vectors → ~18ms     ⚠️ Borderline
  N = 100M     vectors → ~1,800ms  ❌ Way too slow
  N = 1B       vectors → ~18,000ms ❌ Completely unusable

At a billion vectors, a single query takes 18 seconds. A real-time search system needs results in under 100 milliseconds. We are off by a factor of 180x.

This is not a hardware problem. You cannot buy your way out of O(N × D). You need a fundamentally different algorithm. And that is exactly what Approximate Nearest Neighbor (ANN) search provides.

The key insight of ANN: we don’t need the exact nearest neighbor. We need one that is close enough, found fast enough. The art is in controlling that trade-off.

Part 4: The ANN Philosophy — Acceptable Approximation

ANN algorithms accept a simple contract:

“I will find you a neighbor that is very close to the true nearest neighbor, and I’ll do it 100x to 1000x faster than brute force.”

The quality of this approximation is measured by Recall — the fraction of true nearest neighbors that appear in the ANN results:

Recall@10 = |{true top-10}  ∩  {ANN top-10}|  /  10

A recall of 0.95 means 95% of the results are genuine top-10 neighbors. For most applications (recommendation systems, semantic search, RAG pipelines), a recall of 90–95% is indistinguishable from perfect in user experience — and is achieved at a fraction of the cost.

We now have three major strategies for achieving this:

Hashing — group similar vectors into buckets (LSH)
Clustering + Partitioning — divide space into regions (IVF)
Graph Navigation — build a navigable shortcut graph (HNSW)

Let us meet each one.

Part 5: Locality Sensitive Hashing (LSH) — Buckets of Similarity

The Analogy: Sorting Colored Marbles

Imagine you have a million colorful marbles. You want to find all marbles “similar” in color to a bright blue one. Instead of comparing every marble, you build a sorting machine: marbles are dropped through it, and it routes similar colors into the same bin. Blues go into bin A, reds into bin B, and so on.

Locality Sensitive Hashing (LSH) is exactly this idea applied to vectors. It uses a specially designed hash function that intentionally causes similar vectors to collide into the same hash bucket — the opposite of a normal hash table that tries to avoid collisions.

How It Works

A standard hash function is designed to spread keys uniformly. If two keys differ by a single character, their hashes are completely different. That’s good for a dictionary; it’s terrible for similarity search.

An LSH hash function is designed so that:

If two vectors are close, they hash to the same bucket with high probability
If two vectors are far, they hash to different buckets with high probability

Normal Hash Function:       LSH Hash Function:

  "dog" → bucket 7           v1 (close to v2) → bucket A
  "cat" → bucket 2           v2 (close to v1) → bucket A  ← same!
  "puppy" → bucket 9         v3 (far from v1) → bucket D  ← different

  (No useful structure)       (Similarity preserved)

One common approach for real-valued vectors: project your vector onto a random hyperplane and take the sign (+1 or -1) as the hash bit. Do this for multiple hyperplanes to get a multi-bit hash code.

import numpy as np
import faiss

dimension = 64
num_vectors = 100_000

data = np.random.random((num_vectors, dimension)).astype('float32')
query = np.random.random((1, dimension)).astype('float32')

# LSH index: nbits controls the "resolution" of the hash
nbits = dimension * 4  # higher nbits = better recall, more memory
index = faiss.IndexLSH(dimension, nbits)
index.add(data)

distances, indices = index.search(query, k=10)
print(f"LSH results: {indices}")

The Limitation: The Curse of Dimensionality

Here is where LSH begins to break down. As the dimensionality D of your vectors grows, you need exponentially more hash functions (and thus more bits) to maintain good recall. For D = 128, achieving 90% recall requires nbits ≈ 512 — four times the original size. The memory and computation costs explode.

LSH Recall vs. Dimensionality (rule of thumb):

  D = 32    → manageable, LSH works well
  D = 64    → still reasonable with tuning
  D = 128   → recall drops sharply unless nbits is very high
  D = 384+  → LSH becomes impractical

This is the curse of dimensionality in action: the more dimensions you add, the more every point becomes “equally far” from every other, and hash functions lose their discriminating power. For the 384-D or 1536-D embeddings used in modern AI, LSH is largely impractical.

When to use LSH: Small datasets with low-dimensional vectors (D < 64), or sparse binary/text vectors where it remains competitive.

Part 6: Inverted File Index (IVF) — The Voronoi Partition

The Analogy: Postal Zones

A postal system does not deliver packages by checking every address on every street in the country. It first routes to the correct region (zip code), then to the correct district, then to the street. This dramatically narrows the search before any detailed comparison happens.

IVF (Inverted File Index) applies the same layered narrowing to vector space. It divides the entire vector space into Voronoi cells (also known as a Dirichlet tessellation) — regions of space where each region “belongs” to one centroid.

Building the Voronoi Partition

Here is how IVF works, step by step:

Step 1 — Training (k-means clustering): We run k-means on the dataset to learn nlist cluster centroids. Each centroid becomes the representative of a Voronoi cell.

Step 2 — Assignment: Every vector in the database is assigned to its nearest centroid’s cell.

Step 3 — Indexing: We maintain an inverted list: for each centroid, a list of all vectors in that cell.

svgbob
+--------------------------------------------------+
|                 Vector Space (2D projection)      |
|                                                   |
|   *        *    C1        *          *            |
|       *              *         *                  |
|            . . . . . . . . . . .                  |
|   C3 *  . /  Voronoi       \ . *  C2             |
|      *  ./   Cell 1         \. *                  |
|        .+-------------------+.                    |
|       .*|       *      *    |*.                   |
|      . *|   * [xq]  *      |* .                  |
|   C3 . *|  *  query * *    |* . C2               |
|      . *|      *           |*.                   |
|       .*+-------------------+*                    |
|        .                    .                     |
|   * * . . . . . . . . . . . . * *                |
|              C4       *            *              |
+--------------------------------------------------+

  xq = query vector (lands inside Cell 1)
  C1, C2, C3, C4 = Voronoi centroids
  nprobe=1: only search vectors in Cell 1
  nprobe=2: search Cell 1 + its nearest neighbor cell

In the diagram above, our query xq lands inside Cell 1. With nprobe=1, we only compare xq against vectors belonging to Cell 1. Instead of scanning 1 million vectors, we scan only ~1000 (if we have 1000 cells of equal size). That’s a 1000x reduction in comparisons.

Step 4 — Search: For a query vector, find the nprobe nearest centroids, then do an exhaustive search within only those cells.

import numpy as np
import faiss

dimension = 128
num_vectors = 1_000_000
nlist = 256  # number of Voronoi cells

database = np.random.random((num_vectors, dimension)).astype('float32')
query = np.random.random((1, dimension)).astype('float32')

# The quantizer decides how centroid distances are measured
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# IVF requires a training step to learn cluster centroids
print("Training IVF index (learning cluster centroids)...")
index.train(database)

# Add vectors to the trained index
index.add(database)

# nprobe = how many Voronoi cells to search
# Higher nprobe → better recall, slower search
index.nprobe = 16  # search 16 out of 256 cells

distances, indices = index.search(query, k=10)
print(f"Top 10 results: {indices}")

The Edge Problem and nprobe

There is a subtle but critical failure mode in IVF: the edge problem. A query vector that lands right on the boundary between two Voronoi cells may have its true nearest neighbor in the adjacent cell. With nprobe=1, we would miss it entirely.

The fix is clean: increase nprobe. By searching the nprobe nearest cells instead of just one, we expand our net.

nprobe vs. Recall Trade-off (on Sift1M, nlist=256):

  nprobe = 1   → Recall ≈ 0.55   | Query time ≈ 0.5ms
  nprobe = 8   → Recall ≈ 0.82   | Query time ≈ 1.2ms
  nprobe = 32  → Recall ≈ 0.94   | Query time ≈ 3.5ms
  nprobe = 128 → Recall ≈ 0.99   | Query time ≈ 9ms
  nprobe = 256 → Recall = 1.00   | Query time ≈ 18ms (≡ Flat Index)

nprobe = nlist is exactly equivalent to a flat exhaustive search. This gives us a beautiful dial: turn nprobe up for accuracy, turn it down for speed.

When to use IVF: Large datasets (1M–100M), when memory is a constraint, and when you need a simple, well-understood trade-off dial.

Part 7: Product Quantization (PQ) — Compressing the Vectors

The Memory Problem

IVF solves the speed problem, but what about memory? Storing 1 billion 128-dimensional float32 vectors requires:

1,000,000,000 × 128 × 4 bytes = 512 GB of RAM

That’s half a terabyte — just for the raw vectors. We need compression.

Product Quantization (PQ) is a technique that compresses each high-dimensional vector into a tiny code — typically just 8 to 64 bytes — with a controlled loss of precision.

How PQ Works

Instead of storing the full vector, PQ:

Splits each D-dimensional vector into M sub-vectors of size D/M
Trains a separate k-means codebook of K centroids for each sub-vector space
Encodes each sub-vector by its nearest codebook centroid ID (a single integer)
Stores only the M centroid IDs — a tiny code representing the full vector ``` svgbob Original Vector (128 dimensions, 512 bytes with float32):

[f1, f2, …, f32 | f33, …, f64 | f65, …, f96 | f97, …, f128] Sub-vec 1 Sub-vec 2 Sub-vec 3 Sub-vec 4 | | | | Find nearest Find nearest Find nearest Find nearest centroid in centroid in centroid in centroid in codebook 1 codebook 2 codebook 3 codebook 4 | | | | ID=47 ID=12 ID=203 ID=88 | | | | PQ Code: [ 47 | 12 | 203 | 88 ] → just 4 bytes!

Compression: 512 bytes → 4 bytes = 128x reduction

In Figure above, we split a 128-D float32 vector into 4 sub-vectors of 32 dimensions each. Each sub-vector is replaced by a single integer ID pointing to its nearest centroid in a pre-trained codebook. The original 512-byte vector becomes a 4-byte code.

```python
import numpy as np
import faiss

dimension = 128
num_vectors = 1_000_000
M = 16       # number of sub-quantizers (sub-vectors)
nbits = 8    # bits per code = 2^8 = 256 centroids per codebook

database = np.random.random((num_vectors, dimension)).astype('float32')

# IVF combined with Product Quantization — the production workhorse
nlist = 256
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, M, nbits)

# Training learns both the Voronoi centroids AND the PQ codebooks
index.train(database)
index.add(database)

# Memory comparison
original_mb = (num_vectors * dimension * 4) / (1024 ** 2)
compressed_mb = (num_vectors * M) / (1024 ** 2)

print(f"Original:   {original_mb:.1f} MB")   # ~488 MB
print(f"Compressed: {compressed_mb:.1f} MB")  # ~15 MB
print(f"Reduction:  {original_mb/compressed_mb:.0f}x")

index.nprobe = 32
query = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query, k=10)

IVF-PQ: The Production Workhorse

The combination of IVF + PQ is the dominant approach for billion-scale vector search. IVF handles the “which region to search” problem, while PQ handles the “how to store and compare vectors cheaply” problem.

IVF-PQ Pipeline:

 Query vector
      │
      ▼
 [Compare to nlist centroids]     ← IVF: find nearest cells (O(nlist))
      │
      ▼
 [Select nprobe cells]
      │
      ▼
 [Compare to PQ-compressed vectors in those cells]  ← PQ: fast approximate distance
      │
      ▼
 [Return top-k results]

The trade-off: PQ introduces quantization error — the distance computed is approximate because the full vector has been lossy-compressed. You gain enormous memory savings (often 32–64x compression) at the cost of a small recall penalty that is usually acceptable.

Part 8: HNSW — The Graph That Changed Everything

The Analogy: Six Degrees of Separation

The social networking experiment that made “six degrees of separation” famous demonstrated something profound about networks: in a well-connected graph, you can reach any person from any other person in just a few hops. The secret is a combination of local dense connections (your close friends) and long-range “shortcut” connections (your cousin who knows everyone in Paris).

Hierarchical Navigable Small World (HNSW) applies this insight directly to vector search. It is a graph-based index where each vector is a node, nodes are connected to their approximate nearest neighbors, and the search navigates toward the query vector by traversing edges — always moving to the neighbor that is closer to the query.

Building the NSW Graph

Start with a basic NSW (Navigable Small World) graph: each node is connected to its M nearest neighbors. When inserting a new vector, we connect it to its M nearest existing nodes.

svgbob
    NSW Graph (2D projection):

    A ────── B ────── C
    │  \   /   \   / │
    │   \ /     \ /  │
    D    E ───── F   G
    │   / \     / \  │
    │  /   \   /   \ │
    H ────── I ────── J

  Each node connects to its ~M nearest neighbors.
  To find neighbors of query Q:
    - Start at random entry node
    - Greedily hop to the neighbor closest to Q
    - Repeat until no neighbor is closer than current node

The greedy search is fast, but there’s a problem: local minima. The greedy walk can get stuck in a part of the graph that is locally optimal but globally wrong — missing the true nearest neighbor entirely.

Adding the Hierarchy (the “H” in HNSW)

HNSW’s breakthrough, introduced by Malkov and Yashunin in 2018, is to build multiple layers of this graph:

Top layers: Sparse, long-range connections only — the “highway system”
Bottom layer: Dense, short-range connections — the “local streets”

Search enters at the top layer, uses long-range jumps to get close to the target region quickly, then descends to finer and finer layers, and terminates with a precise local search at the bottom layer.

svgbob
  HNSW: Multi-Layer Graph Structure

  Layer 2 (coarsest): A ─────────────────── J
  (few nodes,                  │
  long jumps)                  │
                               │
  Layer 1 (medium): A ──── E ──┤── H ──── J
                               │
                               │
  Layer 0 (finest): A─B─C─D─E─F─G─H─I─J
  (all nodes,
  dense connections)

  Search path: enter Layer 2 → hop quickly → descend to Layer 1
               → refine position → descend to Layer 0 → precise search

In the diagram above, the search enters at Layer 2 with very few nodes and makes large jumps across the vector space to arrive in the correct region. As we descend through layers, the navigation becomes finer and finer. By the time we reach Layer 0, we are already positioned near the correct answer and only need a small local search to finish.

import numpy as np
import faiss

dimension = 128
num_vectors = 1_000_000

database = np.random.random((num_vectors, dimension)).astype('float32')
query = np.random.random((1, dimension)).astype('float32')

# Key HNSW parameters
M = 32               # connections per node (higher = better recall, more RAM)
ef_construction = 200  # search depth during BUILD (higher = better index quality)
ef_search = 64         # search depth during QUERY (higher = better recall)

index = faiss.IndexHNSWFlat(dimension, M)
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = ef_search

# No training step needed — HNSW builds the graph incrementally during add()
print("Building HNSW graph...")
index.add(database)

distances, indices = index.search(query, k=10)
print(f"Top 10: {indices}")

# Experiment: efSearch controls recall vs. speed at query time
for ef in :
    index.hnsw.efSearch = ef
    distances, indices = index.search(query, k=10)
    print(f"efSearch={ef}: {indices[:5]}...")

HNSW’s Three Dials

Parameter	What it Controls	Trade-off
`M`	Connections per node	Higher M → better recall, much more RAM
`efConstruction`	Build-time search depth	Higher → better index quality, slower build
`efSearch`	Query-time search depth	Higher → better recall, slower query

The beauty of HNSW is that efSearch can be tuned at query time without rebuilding the index. You can offer different tiers: fast mode (efSearch=32) for real-time use, and quality mode (efSearch=128) for offline batch search.

The Cost: Memory

HNSW’s major downside is memory. Each node stores M (or 2M at layer 0) edge pointers. For 1M vectors with M=32, the HNSW graph alone can consume 1.2–1.6 GB of RAM — roughly 3x the memory of an equivalent IVF-Flat index.

For systems that can afford the RAM, HNSW consistently achieves state-of-the-art recall vs. query-speed performance among all ANN indexes.

Part 9: Choosing Your Distance Metric

Before we can compare vectors, we need to define what “close” means. This is not a trivial choice.

The Three Main Metrics

Euclidean Distance (L2): The straight-line distance between two points in space. Sensitive to the magnitude of vectors. Best for embeddings where the absolute values matter (e.g., dense feature vectors from image models).

\[d(a, b) = \sqrt{\sum_{i=1}^{D} (a_i - b_i)^2}\]

Cosine Similarity: Measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction are perfectly similar regardless of length. Best for text embeddings where only direction matters, not magnitude.

\[\text{sim}(a, b) = \frac{a \cdot b}{\|a\| \cdot \|b\|}\]

Inner Product (Dot Product): Measures both direction and magnitude. Used in recommendation systems (e.g., matrix factorization) where the magnitude of the embedding encodes “relevance” or “popularity”.

\[\text{IP}(a, b) = \sum_{i=1}^{D} a_i \cdot b_i\]

A Practical Rule of Thumb

Data Type                         → Recommended Metric
──────────────────────────────────────────────────────
Text embeddings (BERT, GPT)       → Cosine Similarity
Image feature vectors (CNN)       → Euclidean (L2)
Recommendation (matrix factoring) → Inner Product
Normalized embeddings (unit norm) → Cosine = Inner Product (equivalent)

Interview tip: If embeddings are L2-normalized (all vectors have length 1), cosine similarity and inner product become mathematically equivalent. Many production systems normalize embeddings before insertion precisely to unlock this equivalence and use the faster inner-product computation.

Part 10: Vector Databases as Full Systems

So far we have talked about index algorithms — standalone libraries like FAISS that run in a single process. A vector database is an entire production system built around these indexes, adding the infrastructure a real application needs.

What a Vector Database Adds

Think of FAISS as the engine. A vector database is the full car — including the chassis, brakes, dashboard, and navigation system.

svgbob
  Vector Database Architecture

  ┌─────────────────────────────────────────────────┐
  │               CLIENT APPLICATION                 │
  └───────────────────────┬─────────────────────────┘
                          │  REST / gRPC API
  ┌───────────────────────▼─────────────────────────┐
  │                   API LAYER                      │
  │        (Auth, Rate Limiting, Routing)            │
  └───────────────────────┬─────────────────────────┘
                          │
  ┌───────────────────────▼─────────────────────────┐
  │              QUERY PROCESSING                    │
  │  ┌────────────┐  ┌──────────────┐  ┌─────────┐  │
  │  │  Metadata  │  │  ANN Index   │  │  CRUD   │  │
  │  │  Filtering │  │  (HNSW/IVF)  │  │ Engine  │  │
  │  └────────────┘  └──────────────┘  └─────────┘  │
  └───────────────────────┬─────────────────────────┘
                          │
  ┌───────────────────────▼─────────────────────────┐
  │                 STORAGE LAYER                    │
  │         Vector Store + Metadata Store            │
  │         (Persistent, Sharded, Replicated)        │
  └─────────────────────────────────────────────────┘

The key additions over a bare-bones index library:

Metadata Filtering: Pre- or post-filter results by structured attributes. Example: “find semantically similar products, but only in the Electronics category under $200.” This combines ANN search with SQL-like predicates.
Persistence: Data survives restarts. Indexes are serialized to disk and reloaded.
CRUD Operations: You can insert, update, and delete vectors without rebuilding the entire index from scratch.
Sharding \& Replication: For billion-scale data, the index is distributed across multiple machines with replicas for fault tolerance.
Tenancy \& Namespacing: Multiple applications sharing one system with logical isolation (namespaces in Pinecone, collections in Qdrant, tenants in Weaviate).

The Major Players (2025–2026)

System	Type	Index Algorithms	Best For
FAISS	Library (Meta)	Flat, IVF, HNSW, PQ	Research, custom systems
Pinecone	Managed cloud DB	HNSW, IVF-PQ	Production SaaS, RAG pipelines
Weaviate	Open-source DB	HNSW	Hybrid search, knowledge graphs
Qdrant	Open-source DB	HNSW + quantization	Rust-performance, on-prem
Milvus	Open-source DB	IVF, HNSW, DiskANN	Billion-scale deployments
pgvector	Postgres extension	HNSW, IVF-Flat	SQL-first, existing Postgres stacks
Chroma	Embedded DB	HNSW (via hnswlib)	Prototyping, local RAG apps

A Complete RAG Example

Here is how a production Retrieval-Augmented Generation (RAG) pipeline connects all the pieces:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Step 1: Build the vector store from documents
model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "HNSW is a graph-based ANN algorithm published in 2018.",
    "IVF partitions vectors into Voronoi cells for fast retrieval.",
    "Product quantization compresses vectors to reduce memory usage.",
    "Cosine similarity measures the angle between two vectors.",
    "FAISS is a library developed by Meta for similarity search.",
]

# Embed all documents
doc_embeddings = model.encode(documents, normalize_embeddings=True)
doc_embeddings = doc_embeddings.astype('float32')

dimension = doc_embeddings.shape  # 384

# Build HNSW index
index = faiss.IndexHNSWFlat(dimension, 32)  # M=32
index.hnsw.efConstruction = 200
index.add(doc_embeddings)

# Step 2: Query time
def semantic_search(query_text: str, k: int = 3):
    query_embedding = model.encode(
        [query_text], normalize_embeddings=True
    ).astype('float32')

    # Inner product on normalized vectors = cosine similarity
    distances, indices = index.search(query_embedding, k)

    results = []
    for dist, idx in zip(distances, indices):
        results.append({
            "document": documents[idx],
            "similarity": float(dist)  # cosine similarity
        })
    return results

results = semantic_search("How do graph-based indexes work?")
for r in results:
    print(f"[{r['similarity']:.3f}] {r['document']}")

Part 11: The Index Decision Framework

We now have enough context to build a principled decision framework. When you walk into a system design interview and the question involves similarity search, here is how to reason about index selection:

svgbob
  ANN Index Selection Decision Tree

  Start
    │
    ├─── Dataset size < 50K? ──────► Flat Index (exact search)
    │                                 (simplicity wins at small scale)
    │
    ├─── Need exact recall = 100%? ─► Flat Index (brute force)
    │
    ├─── Memory severely constrained? ─► IVF-PQ
    │     (billion scale, GPU cluster)    (32-64x compression)
    │
    ├─── D (dimensions) < 64? ────────► LSH (still viable)
    │
    ├─── Best recall/speed trade-off? ─► HNSW
    │     (RAM available, < 100M vecs)    (state of the art)
    │
    └─── Billion+ vectors, good RAM? ─► IVF-HNSW hybrid
                                        or DiskANN (disk-based)

The Fundamental Trade-offs

Metric	Flat	LSH	IVF	HNSW	IVF-PQ
Recall	100%	40–85%	70–99%	85–99%	60–95%
Query Speed	Slow	Fast	Fast	Very Fast	Very Fast
Memory	High	Medium	Medium	Very High	Very Low
Build Time	Instant	Fast	Medium	Slow	Medium
Supports Updates	Yes	Yes	Partial	Yes	Partial
Scales to 1B+	❌	❌	✅	⚠️ (RAM)	✅

Part 12: Interview Essentials — What You Must Know

Vector databases appear regularly in system design rounds at companies like Google, Meta, Nvidia, and any company building AI features. Here is the distilled knowledge you need to articulate clearly.

Conceptual Questions

“What is the curse of dimensionality and why does it matter for ANN?”

As dimensions increase, all points become roughly equidistant from each other. Tree-based structures (like KD-trees) lose their ability to prune branches, because nearly every branch becomes “possibly relevant.” This is why specialized ANN algorithms were developed: they leverage geometric structure in ways that survive high dimensions.

“Why can’t we just use a B-tree or B+ tree for vector search?”

B-trees are designed for exact lookups and range queries on totally ordered data (integers, strings). Vectors have no total ordering — there’s no meaningful way to say vector A is “between” vector B and vector C in a 384-dimensional space. Similarity search requires entirely different data structures based on geometric proximity, not ordering.

“What is the recall-latency trade-off in ANN systems?”

Every ANN algorithm has a parameter that controls the trade-off:

HNSW: efSearch (higher = better recall, slower query)

IVF: nprobe (higher = more cells searched, better recall, slower)

IVF-PQ: nprobe + quantization quality

The correct setting depends on the SLA: a real-time recommendation system might accept 90% recall for sub-10ms latency; an offline document indexer might demand 99% recall and tolerate 100ms.

System Design Pattern: Semantic Search Service

If asked to design a semantic search system for a corpus of 500M documents:

1. Embedding: Documents → 768-D vectors via BERT-family model
   (offline batch job, update incrementally)

2. Indexing: IVF-PQ index (memory-efficient for 500M vectors)
   - nlist ≈ sqrt(500M) ≈ 22,000 Voronoi cells
   - PQ with M=64, nbits=8 for aggressive compression

3. Query Pipeline:
   User query → same embedding model → query vector
   → IVF-PQ search (nprobe=64) → candidate top-100
   → Optional re-ranking with cross-encoder model
   → Return top-10

4. Infrastructure:
   - Milvus or Qdrant cluster with sharding
   - Read replicas for query load
   - Embedding service scaled separately (GPU inference)
   - Metadata filter: apply structured pre-filters before ANN search

The Vocabulary to Drop in Interviews

k-NN vs. ANN: Exact vs. approximate; the former is O(N·D), the latter is sublinear
Recall@k: Standard metric; |{true top-k} ∩ {returned top-k}| / k
Quantization: Lossy compression of vectors (Scalar Quantization, Product Quantization)
nprobe / efSearch: The recall-speed knob in IVF and HNSW respectively
Voronoi cell: The geometric region assigned to one IVF centroid
Codebook: The set of learned centroids used by PQ for compression
DiskANN: A variant that stores the HNSW graph on disk, enabling billion-scale search with minimal RAM (developed by Microsoft)
HNSW hierarchy: Top layers = long-range shortcuts, bottom layer = dense local graph

Summary: The Vector Search Mental Model

We started with a library organized by meaning and ended with a complete system capable of searching a billion vectors in under 10 milliseconds. The journey followed a clear arc:

Raw Data (text, images, audio)
    │
    ▼
Embedding Model → Dense Vectors in high-D space
    │
    ▼
The problem: O(N·D) exact search is too slow
    │
    ▼
ANN Indexes (reduce the search space):
  - LSH      → hash into similarity buckets (works for low-D)
  - IVF      → partition into Voronoi cells (scalable, tunable)
  - PQ       → compress vectors to save memory (32–64x)
  - HNSW     → navigate a shortcut graph (fastest recall/speed)
    │
    ▼
Vector Database (full production system):
  persistence + sharding + metadata filtering + CRUD + APIs
    │
    ▼
Application: RAG pipelines, semantic search,
             recommendation systems, anomaly detection

The choice of index is not a matter of one being “best” — it is a function of your dataset size, dimensionality, memory budget, recall requirements, and update frequency. The engineer who understands why each algorithm was invented, not just how it works, is the one who can reason from first principles in an interview — and that is the engineer who gets the offer.