system-design.io

Relational Database Schema Design and Indexing

Bridge: Building a City Blueprint

Imagine you are tasked with designing a new city from scratch. You could start by throwing houses, roads, and utilities wherever space allows, but you’d quickly end up with tangled streets, duplicated services, and neighborhoods that are hard to navigate. Instead, you begin with a blueprint: you decide where residential zones go, how they connect to commercial areas, and where utilities run beneath the streets. This blueprint ensures that as the city grows, it remains organized, efficient, and easy to maintain.

Designing a relational database schema follows the same principle. Without a thoughtful blueprint, data becomes duplicated, inconsistent, and slow to retrieve. With a clear schema—tables as zones, relationships as roads, and indexes as express lanes—you create a foundation that scales gracefully and serves queries quickly.

In this article, we’ll walk through the process of drafting that blueprint, layer by layer, using the same iterative approach city planners use: start simple, spot the pain points, add fixes, and evolve toward a modern, high‑performance design.

1. The Atomic Unit: A Single Table

1.1. Starting Simple

We begin with the most basic structure: a single table that holds all the information we need. Suppose we are building a blog platform and want to store posts, authors, and comments in one place:

CREATE TABLE BlogFlat (
    post_id       INTEGER,
    post_title    TEXT,
    post_content  TEXT,
    author_name   TEXT,
    author_email  TEXT,
    comment_id    INTEGER,
    comment_text  TEXT,
    comment_date  DATE
);

At first glance, this seems fine—everything is in one spot.

1.2. The First Pain Point: Redundancy and Update Anomalies

Notice what happens when an author writes multiple posts: their name and email are repeated in every row. If the author changes their email, we must update every row that references them. Forgetting even one leads to inconsistent data—an update anomaly.

Similarly, if a post receives many comments, the post’s title and content are duplicated for each comment row, wasting space and risking insertion anomalies (we can’t add a comment without also repeating post data).

1.3. The Fix: Splitting Into Entities

The solution is to identify the distinct entities in our domain—Post, Author, Comment—and give each its own table. We then link them using foreign keys. This process is the first step of normalization, which we’ll explore formally next.

CREATE TABLE Author (
    author_id   INTEGER PRIMARY KEY,
    name        TEXT    NOT NULL,
    email       TEXT    NOT NULL UNIQUE
);

CREATE TABLE Post (
    post_id     INTEGER PRIMARY KEY,
    title       TEXT    NOT NULL,
    content     TEXT    NOT NULL,
    author_id   INTEGER NOT NULL,
    FOREIGN KEY (author_id) REFERENCES Author(author_id)
);

CREATE TABLE Comment (
    comment_id  INTEGER PRIMARY KEY,
    post_id     INTEGER NOT NULL,
    author_id   INTEGER NOT NULL,
    comment_text TEXT   NOT NULL,
    comment_date DATE   NOT NULL,
    FOREIGN KEY (post_id)     REFERENCES Post(post_id),
    FOREIGN KEY (author_id)   REFERENCES Author(author_id)
);

Now each fact lives in one place. Updating an author’s email touches only the Author table. Adding a comment requires no duplication of post data.

Historical note: This idea of separating entities and linking them with keys was formalized by E.F. Codd in his 1970 paper “A Relational Model of Data for Large Shared Data Banks,” which launched the relational database revolution.

2. Iterative Complexity: Normalization Layers

2.1. First Normal Form (1NF) – Atomicity

The first step ensures every column holds atomic (indivisible) values. No repeating groups or arrays inside a cell.

In our flat table, imagine we stored multiple tags per post as a comma‑separated string: post_tags TEXT. That violates 1NF because the tag value is not atomic. To fix it, we create a separate PostTag table:

CREATE TABLE PostTag (
    post_id INTEGER NOT NULL,
    tag     TEXT    NOT NULL,
    PRIMARY KEY (post_id, tag),
    FOREIGN KEY (post_id) REFERENCES Post(post_id)
);

Now each tag occupies its own row, satisfying 1NF.

2.2. Second Normal Form (2NF) – Full Dependency

2NF builds on 1NF by removing partial dependencies—non‑key attributes that depend on only part of a composite primary key.

Consider a table OrderItems with a composite key (order_id, product_id) and columns like product_name and unit_price. Here product_name depends only on product_id, not the whole key, creating a partial dependency.

We solve it by moving such attributes to a table where they are fully dependent on the key:

-- Before (violates 2NF)
CREATE TABLE OrderItemsBad (
    order_id    INTEGER,
    product_id  INTEGER,
    product_name TEXT,
    unit_price  DECIMAL,
    quantity    INTEGER,
    PRIMARY KEY (order_id, product_id)
);

-- After (2NF)
CREATE TABLE Products (
    product_id   INTEGER PRIMARY KEY,
    product_name TEXT    NOT NULL,
    unit_price   DECIMAL  NOT NULL
);

CREATE TABLE OrderItems (
    order_id    INTEGER NOT NULL,
    product_id  INTEGER NOT NULL,
    quantity    INTEGER NOT NULL,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id)  REFERENCES Orders(order_id),
    FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

Now every non‑key column depends on the entire primary key.

2.3. Third Normal Form (3NF) – No Transitive Dependencies

3NF removes transitive dependencies, where a non‑key attribute depends on another non‑key attribute.

Example: a Student table with (student_id, major, advisor_name, advisor_phone). Here advisor_phone depends on advisor_name, which itself depends on student_id (via major). The chain student_id → major → advisor_name → advisor_phone creates a transitive dependency.

We split it:

CREATE TABLE Students (
    student_id INTEGER PRIMARY KEY,
    major      TEXT    NOT NULL,
    advisor_id INTEGER NOT NULL
);

CREATE TABLE Advisors (
    advisor_id INTEGER PRIMARY KEY,
    name       TEXT    NOT NULL,
    phone      TEXT    NOT NULL
);

Now all non‑key attributes depend directly on the primary key of their table.

2.4. Boyce‑Codd Normal Form (BCNF) – Stronger 3NF

BCNF is a stricter version of 3NF that handles cases where a table has multiple overlapping candidate keys. A relation is in BCNF if for every functional dependency X → Y, X is a superkey.

A classic example is a Course table with (student_id, course_id, instructor) where {student_id, course_id} is the primary key, but instructor → course_id (an instructor teaches only one course). Here instructor is not a superkey, violating BCNF.

We decompose:

CREATE TABLE CourseOffering (
    student_id INTEGER NOT NULL,
    course_id  INTEGER NOT NULL,
    PRIMARY KEY (student_id, course_id),
    FOREIGN KEY (course_id) REFERENCES Courses(course_id)
);

CREATE TABLE Courses (
    course_id   INTEGER PRIMARY KEY,
    instructor  TEXT    NOT NULL
);

Now each table satisfies BCNF.

2.5. When to Denormalize

While normalization eliminates redundancy, it can increase query complexity because related data lives in separate tables, requiring joins. In read‑heavy workloads (e.g., analytics dashboards), we may denormalize intentionally—introducing controlled redundancy to speed up queries.

Common denormalization patterns:

Materialized views that pre‑join tables.
Aggregated tables (e.g., daily sales totals).
Flattened dimensions in star schemas for OLAP.

Denormalization should follow a clear performance need and be accompanied by update strategies (triggers, application logic, or periodic refreshes) to keep data consistent.

3. Visual Literacy: ER Diagrams and Schema Mapping

3.1. From Concepts to Pictures

An Entity‑Relationship (ER) diagram is the city‑planner’s sketch before laying down roads. It shows entities as rectangles, relationships as diamonds, and attributes as ovals.

Below is an ER diagram for our blog platform, expressed with SVGBob markdown (a simple ASCII‑art diagramming tool).

   +----------+       +--------+       +----------+
   |  Author  |       |  Post  |       | Comment  |
   +----------+       +--------+       +----------+
   | author_id|<------| post_id|<------| comment_id|
   | name     |       | title  |       | post_id  |
   | email    |       | content|       | author_id|
   +----------+       | author_id|      | comment_text|
                      +--------+      | comment_date|
                                       +----------+

How to read this diagram:

Each box is a table (entity).
Inside the box are the columns (attributes); the primary key is listed first.
Arrow lines represent foreign‑key relationships.
The “<——” notation indicates the direction of the foreign key (e.g., Post.author_id points to Author.author_id).

3.2. Translating to SQL

The ER diagram maps directly to the SQL we wrote earlier. Notice how the foreign keys enforce the relationships shown by the arrows.

4. Categorical Chunking: Index Types by Intent

Indexes are the express lanes and traffic signals of our database city. They speed up read operations but come with write overhead and storage costs. We group them by the behavior they optimize.

4.1. Tree‑Based Indexes (B‑Tree, B+‑Tree)

Purpose: Efficient equality and range queries, ordered traversal.

B‑Tree: Stores keys in sorted order; each node holds multiple keys and child pointers. Supports <, <=, =, >=, BETWEEN.
B+‑Tree: A variant where all data pointers reside in leaf nodes; internal nodes only guide the search. Leaf nodes are linked, making range scans extremely fast.

When to use: Most general‑purpose columns (IDs, timestamps, names). The default index type in PostgreSQL, MySQL InnoDB, and SQL Server.

4.2. Hash Indexes

Purpose: Lightning‑fast exact‑match lookups (O(1) average).

Uses a hash function to map a key to a bucket.
Excellent for WHERE column = value predicates.
Does not support range queries or ordering because the hash scrambles ordering.

When to use: Columns that are only ever looked up by exact value (e.g., session IDs, UUIDs, cache lookups).

4.3. Bitmap Indexes

Purpose: Efficient filtering on low‑cardinality columns (few distinct values).

Creates a bitmap (bit array) for each distinct value; each bit indicates whether a row has that value.
Combines multiple conditions via bitwise AND/OR, which is extremely fast in CPU‑friendly operations.
Poor choice for high‑cardinality columns (e.g., timestamps) because the bitmap becomes huge and sparse.

When to use: Data‑warehouse scenarios like product categories, status flags, or gender columns.

4.4. Specialized Indexes

Index Type	Intent / Use Case
Filtered	Indexes only rows matching a `WHERE` clause (e.g., `WHERE IsActive = 1`).
Function‑Based	Stores the result of a function (e.g., `UPPER(last_name)`) for case‑insensitive searches.
Spatial	Indexes geometric data (points, polygons) for GIS queries.
Full‑Text (Inverted)	Maps each token to rows containing it; enables fast keyword search.

5. Problem‑Solution Narrative: Indexing in Action

5.1. The Pain Point: Sequential Scans

Without indexes, the database must perform a sequential scan (full table read) for every query that filters on a non‑key column. Imagine searching for all blog posts written in January 2026 by scanning every row—costly as the table grows.

5.2. The Fix: Adding a B‑Tree Index

We create an index on the Post table’s created_at column (assuming we added such a column).

CREATE INDEX idx_post_created_at ON Post(created_at);

Now the optimizer can locate the relevant rows by traversing the B‑tree, dramatically reducing I/O.

5.3. Trade‑Off: Write Overhead

Every INSERT, UPDATE, or DELETE on Post must also update the index. If our workload is write‑heavy (e.g., a logging system), too many indexes can slow throughput. The rule of thumb: index only columns that appear in WHERE, JOIN, ORDER BY, or GROUP BY clauses and monitor the impact.

5.4. Covering Indexes

A covering index includes all columns needed by a query, allowing the engine to satisfy the query entirely from the index without touching the table (a “index‑only scan”).

CREATE INDEX idx_post_covering ON Post(created_at, author_id) INCLUDE (title);

A query like SELECT title FROM Post WHERE created_at >= '2026-01-01' AND author_id = 5 can be served from the index alone.

6. Putting It All Together: Schema Design Checklist

When you walk into an interview, you’ll want to demonstrate a structured approach. Here’s a concise checklist you can articulate:

Gather Requirements – Identify entities, attributes, and business rules.
Draft an ER Diagram – Sketch entities and relationships (one‑to‑many, many‑to‑many, etc.).
Map to Relational Schema – Convert each entity to a table, each relationship to foreign keys.
Apply Normalization –
- Ensure 1NF (atomic values).
- Eliminate partial dependencies → 2NF.
- Eliminate transitive dependencies → 3NF.
- Consider BCNF for overlapping keys.
Identify Access Patterns – Determine which columns are filtered, joined, sorted, or aggregated.
Select Index Types –
- B‑tree/B+‑tree for general purpose (range + equality).
- Hash for exact‑match lookups.
- Bitmap for low‑cardinality filter columns.
- Consider filtered, function‑based, spatial, or full‑text indexes as needed.
Document – Provide clear naming conventions, comments, and a data dictionary.
Plan for Evolution – Use schema versioning tools (e.g., Flyway, Liquibase) and design for backward compatibility.

7. Code Example: Python‑Driven Schema Creation

Below is a self‑contained Python script that uses SQLAlchemy (a popular ORM) to define the blog schema, create the tables in an SQLite file, and add a few indexes. This demonstrates how you might prototype a design before committing to production DDL.

# schema_design.py
from sqlalchemy import (
    create_engine, MetaData, Table, Column,
    Integer, String, Text, DateTime, ForeignKey, Index
)
from sqlalchemy.sql import func

# --- 1. Engine & Metadata -------------------------------------------------
engine = create_engine("sqlite:///blog.db", echo=False, future=True)
metadata = MetaData()

# --- 2. Tables ------------------------------------------------------------
author = Table(
    "author", metadata,
    Column("author_id", Integer, primary_key=True),
    Column("name", String(100), nullable=False),
    Column("email", String(255), nullable=False, unique=True),
)

post = Table(
    "post", metadata,
    Column("post_id", Integer, primary_key=True),
    Column("title", String(255), nullable=False),
    Column("content", Text, nullable=False),
    Column("author_id", Integer, ForeignKey("author.author_id"), nullable=False),
    Column("created_at", DateTime(timezone=True), server_default=func.now()),
)

comment = Table(
    "comment", metadata,
    Column("comment_id", Integer, primary_key=True),
    Column("post_id", Integer, ForeignKey("post.post_id"), nullable=False),
    Column("author_id", Integer, ForeignKey("author.author_id"), nullable=False),
    Column("comment_text", Text, nullable=False),
    Column("comment_date", DateTime(timezone=True), server_default=func.now()),
)

# --- 3. Indexes -----------------------------------------------------------
# B‑tree index for date range queries on posts
Index("idx_post_created_at", post.c.created_at)

# Covering index for recent posts by author (includes title)
Index(
    "idx_post_author_created",
    post.c.author_id,
    post.c.created_at,
    post_c_title=post.c.title,  # INCLUDE via sqlite's "post_c_title" column (SQLite ignores extra columns in index definition)
)

# Hash-like index for exact email lookups (SQLite doesn't have native hash;
# we simulate with a unique B‑tree index, which is still O(log n) but fine for demo)
Index("idx_author_email", author.c.email, unique=True)

# --- 4. Create All --------------------------------------------------------
if __name__ == "__main__":
    metadata.create_all(engine)
    print("Schema created in blog.db")

How to run:

pip install sqlalchemy
python schema_design.py

This script creates blog.db with the tables and indexes we discussed. You can then use SQLAlchemy’s ORM layer or raw SQL to insert/query data, observing how the indexes accelerate lookups.

8. Historical Context: From Early Flat Files to Modern Indexing

1970s: Codd’s relational model introduced tables and keys, replacing navigational (network/hierarchical) models.
1980s: Early commercial DBMS (Oracle, IBM DB2) implemented B‑tree indexes as the primary access method.
1990s: Bitmap indexes rose to fame in data warehouses (e.g., Oracle Exadata) for star‑schema queries.
2000s: Full‑text and spatial indexes became standard as applications demanded richer search.
2010s‑Present: Adaptive indexing, in‑memory column stores (e.g., SAP HANA), and learned indexes (using ML models to predict positions) push performance further.

Understanding this lineage helps you appreciate why we still rely on B‑trees for most workloads while embracing specialized indexes for niche cases.

9. Closing Thoughts

Designing a relational database schema is equal parts art and science. The art lies in capturing the real‑world domain with clarity and foresight; the science rests on applying normalization, choosing the right indexes, and validating performance with measurable benchmarks.

By following the iterative “build‑up” method—start with a single table, spot the pain of redundancy, fix it with normalization, then accelerate queries with thoughtful indexing—you create a foundation that is both flexible (easy to change as requirements evolve) and fast (able to serve thousands of queries per second).

When you face an interview question about schema design, walk the interviewer through your thought process exactly as we did here:

Sketch the ER diagram (show you understand entities and relationships).
Explain normalization steps (prove you can eliminate anomalies).
Propose indexes (demonstrate you know how to optimize reads without crippling writes).
Discuss trade‑offs (show maturity in balancing consistency, performance, and maintainability).

With that narrative, you’ll not only answer the question—you’ll convince the interviewer that you can build databases that scale like well‑planned cities, ready for whatever growth comes next.