Imagine you are driving from Milwaukee to Chicago for a job interview. You open Google Maps and type in the destination. In milliseconds, the app doesn’t just find a route — it evaluates dozens of them. It weighs the interstate against city roads, checks for construction, estimates fuel cost, factors in the time of day, and then serves you the optimal path — not just any path.
Now imagine a second version of Maps that ignores all traffic data. It picks the route based purely on a fixed rulebook: “Always take the road with the most lanes.” Sometimes that works fine. Sometimes you sit in traffic for two hours.
This is exactly the difference between a cost-based query optimizer and a rule-based query optimizer in a relational database. And the “turn-by-turn directions” the GPS ultimately generates? That’s your execution plan.
Every time you fire a SQL query at a database like PostgreSQL, MySQL, or SQL Server, there’s a small but brilliant engine working behind the scenes — the Query Optimizer — doing exactly what Google Maps does: evaluating multiple strategies and selecting the fastest, cheapest path to your data.
By the end of this article, you will not only understand how that engine works, but you’ll be able to read its decisions, challenge them, and influence them to write dramatically faster queries.
Before we talk about optimization, we need to understand the full journey a SQL query takes from the moment you press “Run” to the moment rows appear on your screen.
Think of this journey like a manuscript going through a publishing house. The raw text (your SQL) passes through several editorial departments — each one transforming it — before it becomes a printed book (your results).
┌──────────────────────────────────────────────────────────────────────┐
│ THE SQL QUERY PIPELINE │
└──────────────────────────────────────────────────────────────────────┘
┌─────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐
│ Raw │───▶│ Parser │───▶│ Query Rewriter│───▶│ Optimizer │
│ SQL │ │ (Syntax │ │ (Logical │ │ (Plan │
│ Text │ │ Check) │ │ Transform) │ │ Generator) │
└─────────┘ └─────────────┘ └───────────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ Executor │
│ (Physical │
│ Operations) │
└──────┬───────┘
│
▼
┌──────────────┐
│ Results │
└──────────────┘
In Figure 1, notice the four distinct stages a query passes through. The optimizer sits squarely in the middle — it receives a logically valid query tree and must decide how to physically execute it.
Let’s walk through each stage:
The parser reads your raw SQL string and checks it for syntactic correctness. It’s analogous to a spell-checker — it doesn’t care what you mean, only whether you’ve used the language correctly. The output is a parse tree, a structured, in-memory representation of your query’s clauses.
-- If you write this:
SELECT name FROM users WEHRE id = 1;
-- ^^^^^ Typo! Parser throws an error immediately.
The parser never touches the disk. It works entirely in memory against the text itself.
The rewriter takes the parse tree and applies logical transformations before any planning happens. This is where views get expanded into their underlying SELECT statements, where IN subqueries sometimes get rewritten as EXISTS, and where permission checks happen.
Think of this stage as a translator who converts shorthand into full prose before it goes to the editor.
This is where the magic happens. The optimizer takes the logically rewritten query tree and asks: “What is the cheapest physical way to execute this?”
It generates multiple candidate execution plans and estimates the cost of each one — in terms of CPU cycles, I/O operations, and memory usage — then selects the plan with the lowest estimated cost.
The executor is the worker on the floor. It takes the chosen plan and carries it out, step by step, retrieving pages from disk, joining rows, sorting, and aggregating. The executor doesn’t think — it just follows instructions.
The optimizer’s core job is simple to state and complex to execute: find the best execution plan without spending more time searching for it than you’d save by running it.
Historically, databases have approached this problem in two fundamentally different ways.
The earliest query optimizers followed a rigid playbook. Given a query, the optimizer would run through a checklist of fixed rules — like a recipe — and produce a plan by applying them in order, without ever looking at how the data was actually distributed.
Common rules included:
This approach is predictable and fast to plan — the optimizer doesn’t need to run statistics queries. Oracle’s original optimizer (pre-Oracle 7) used this approach exclusively.
The fatal flaw: Rules are blind to data. Imagine a table orders where 99% of all rows have status = 'completed'. A rule-based optimizer might happily use an index on status because “use indexes when available.” In reality, a full table scan would be dramatically faster since nearly every row matches anyway. The RBO doesn’t know what it doesn’t look at.
Modern databases — PostgreSQL, MySQL 8+, SQL Server, Oracle 10g+ — use cost-based optimization. Instead of following rules blindly, the CBO:
┌────────────────────────────────────────────────────────────────────┐
│ COST-BASED OPTIMIZER DECISION PROCESS │
└────────────────────────────────────────────────────────────────────┘
Query Tree
│
▼
┌─────────────────┐
│ Plan Generator │ ──── generates ────▶ Plan A (cost: 450)
│ │ Plan B (cost: 120) ◀── WINNER
│ │ Plan C (cost: 780)
└────────┬────────┘
│ uses
▼
┌────────────────────────────────────┐
│ Statistics Catalog │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Row Counts │ │ Histograms │ │
│ │ per table │ │ per column │ │
│ └────────────┘ └──────────────┘ │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Index Info │ │ Cardinality │ │
│ │ │ │ Estimates │ │
│ └────────────┘ └──────────────┘ │
└────────────────────────────────────┘
In Figure 2, the optimizer is only as good as its statistics. When statistics are stale (not recently updated), the cost estimates can be wildly wrong — leading the optimizer to choose a bad plan with complete confidence.
The cost model typically boils down to a formula roughly like:
\[\text{Plan Cost} = (n_{\text{pages}} \times C_{\text{io}}) + (n_{\text{tuples}} \times C_{\text{cpu}})\]Where $C_{\text{io}}$ is the estimated cost per disk page read and $C_{\text{cpu}}$ is the cost per tuple processed. The exact values are tunable constants in most databases.
An execution plan is the optimizer’s final decision, materialized as a tree of physical operations. Each node in the tree is called a plan node or operator, and each one represents one discrete physical action: scan a table, build a hash table, sort rows, etc.
In PostgreSQL (our reference database), you can reveal a query’s execution plan with EXPLAIN:
import psycopg2
conn = psycopg2.connect("dbname=mydb user=postgres")
cur = conn.cursor()
# Get the estimated plan (does NOT run the query)
cur.execute("EXPLAIN SELECT * FROM orders WHERE customer_id = 42")
plan = cur.fetchall()
for row in plan:
print(row)
# Get the actual plan (RUNS the query, shows real stats)
cur.execute("EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42")
actual_plan = cur.fetchall()
for row in actual_plan:
print(row)
The difference is crucial: [web:7]
EXPLAIN — Shows the estimated plan without running the query. Safe for production.EXPLAIN ANALYZE — Actually executes the query and shows real row counts and timings alongside estimates. Use with caution on mutation queries (UPDATE, DELETE) in production — wrap in a BEGIN/ROLLBACK if needed.Here’s a real example plan for a simple join query:
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.id)
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01'
GROUP BY c.name
ORDER BY COUNT(o.id) DESC;
Sort (cost=1200.45..1202.95 rows=1000 width=40)
(actual time=45.2..45.8 rows=982 loops=1)
Sort Key: (count(o.id)) DESC
-> HashAggregate (cost=1145.00..1155.00 rows=1000 width=40)
(actual time=44.1..44.9 rows=982 loops=1)
Group Key: c.name
-> Hash Join (cost=450.00..1100.00 rows=9000 width=32)
(actual time=12.3..40.2 rows=9144 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders (cost=0.00..520.00 rows=9000 width=16)
(actual time=0.05..15.3 rows=9144 loops=1)
Filter: (order_date > '2024-01-01')
-> Hash (cost=250.00..250.00 rows=16000 width=24)
(actual time=11.8..11.8 rows=16000 loops=1)
-> Seq Scan on customers (cost=0.00..250.00 rows=16000 width=24)
(actual time=0.02..6.4 rows=16000 loops=1)
Now let’s decode this. The plan reads bottom-up, inside-out — just like evaluating a nested function call. The deepest indented nodes execute first.
┌──────────────────────────────────────────────────────────────────────┐
│ EXECUTION PLAN TREE (reads bottom-up) │
└──────────────────────────────────────────────────────────────────────┘
┌──────────┐
│ Sort │ ← Step 5: Sort final results
└────┬─────┘
│
┌───────▼────────┐
│ HashAggregate │ ← Step 4: GROUP BY + COUNT()
└───────┬────────┘
│
┌────────▼─────────┐
│ Hash Join │ ← Step 3: Join customers + orders
└──┬───────────┬───┘
│ │
┌───────▼──┐ ┌────▼──────┐
│ Seq Scan │ │ Hash │ ← Step 2: Build hash table
│ orders │ └────┬──────┘
│(filtered)│ │
└──────────┘ ┌────▼──────┐
│ Seq Scan │ ← Step 1: Scan all customers
│ customers │
└───────────┘
In Figure 3, the execution flows upward. Notice how we scan both tables first (Steps 1 and 2), then join them (Step 3), then aggregate (Step 4), then sort (Step 5). Each arrow represents rows flowing upward from child to parent.
Each plan node shows two key numbers in the format (cost=start..total rows=N width=W):
| Field | Meaning |
|---|---|
cost=0.00..520.00 |
Startup cost .. Total cost (in arbitrary units) |
rows=9000 |
Estimated number of rows output |
width=16 |
Estimated average row size in bytes |
actual time=0.05..15.3 |
Real start time .. Real end time (ms) — only with ANALYZE |
loops=1 |
How many times this node was executed |
The most important flag to watch: when estimated rows diverge wildly from actual rows, your statistics are stale and the optimizer made decisions based on wrong assumptions.
Understanding an execution plan means recognizing its building blocks. Let’s categorize the major operators by their behavioral intent.
These are how the database reads raw data from storage. Think of these as your starting ingredients.
Sequential Scan (Seq Scan)
The database reads every single page of the table, in order, from start to finish. This is the “brute force” approach.
Table Pages on Disk:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │ P7 │ P8 │
└────┴────┴────┴────┴────┴────┴────┴────┘
──────────────────────────────────────▶
Reads ALL pages
Cost: O(N) where N = number of pages
When is a Seq Scan good? When you’re fetching a large fraction (typically > 5-10%) of the table. For small tables, it’s almost always the right choice since index lookups have overhead.
Index Scan
The database traverses a B-tree index to find specific row pointers, then fetches those rows from the heap (main table storage).
B-Tree Index Heap (Table)
┌─────────────┐ ┌────────────────┐
│ Root Node │ │ Page 1 │
│ [50|100] │ │ ... row id=42 │◀──┐
└──┬──────┬───┘ │ ... │ │
│ │ └────────────────┘ │
┌──▼─┐ ┌─▼──┐ pointer ┌────────────────┐ │
│Leaf│ │Leaf│ ───────────────▶│ Page 7 │ │
│ 42 │ │ 98 │ │ ... row id=98 │◀──┘
└────┘ └────┘ └────────────────┘
Cost: O(log N) for index + O(K) for heap fetches
where K = number of matching rows
In Figure 5, each leaf of the B-tree points to a specific row location on disk. For small result sets, this is dramatically faster than scanning every page.
Index Only Scan
The best case: all the columns your query needs exist inside the index itself. The database never touches the heap at all.
# This query can be satisfied by an index-only scan
# IF an index exists on (customer_id, order_date)
cur.execute("""
SELECT customer_id, order_date
FROM orders
WHERE customer_id = 42
""")
# The index already has both columns — no heap visit needed!
Bitmap Index Scan + Bitmap Heap Scan
This is the hybrid approach — a middle ground between Index Scan and Seq Scan. PostgreSQL first scans the index to build a bitmap (a memory structure marking which heap pages contain matching rows), then reads only those heap pages in physical order. This avoids random I/O.
Phase 1: Scan Index, Build Bitmap
┌──────────────────────────────────┐
│ Bitmap (in memory) │
│ Page 1: ✓ Page 2: ✗ Page 3: ✓ │
│ Page 4: ✓ Page 5: ✗ Page 6: ✗ │
└──────────────────────────────────┘
Phase 2: Read only marked pages in order
┌────┐ ┌────┐ ┌────┐
│ P1 │─────▶│ P3 │─────▶│ P4 │ (sequential access pattern!)
└────┘ └────┘ └────┘
Joins are the most performance-sensitive part of most queries. The optimizer must choose between three physical join strategies.
Nested Loop Join
The simplest conceptually: for each row in the “outer” table, scan the “inner” table looking for matches.
# Conceptual Python equivalent:
results = []
for outer_row in table_A: # outer loop
for inner_row in table_B: # inner loop
if outer_row.id == inner_row.a_id:
results.append((outer_row, inner_row))
Outer Table (small) Inner Table (large, with index)
┌────────────────┐ ┌────────────────────────────┐
│ Row 1 ─────────┼────▶│ Index Lookup → matching row │
│ Row 2 ─────────┼────▶│ Index Lookup → matching row │
│ Row 3 ─────────┼────▶│ Index Lookup → matching row │
└────────────────┘ └────────────────────────────┘
Best when: outer table is small AND inner has an index
Cost: O(M × log N) where M = outer rows, N = inner rows
Hash Join
Build a hash table from the smaller table (“build side”), then probe it with every row from the larger table (“probe side”).
# Conceptual Python equivalent:
hash_table = {}
for row in smaller_table: # Build phase
key = row['id']
hash_table[key] = row
results = []
for row in larger_table: # Probe phase
key = row['foreign_key']
if key in hash_table:
results.append((row, hash_table[key]))
This is the workhorse for large, un-indexed joins. The initial build phase has cost, but probing is O(1) per row. [web:7]
Merge Join
Requires both inputs to be sorted on the join key. It then performs a single linear pass through both sorted lists simultaneously — like merging two sorted card decks.
Table A (sorted by id): Table B (sorted by a_id):
┌───┐ ┌───┐
│ 1 │──────────────────────│ 1 │ ← Match! Emit row.
│ 2 │──────────────────────│ 2 │ ← Match! Emit row.
│ 3 │ ┐ │ 4 │ ← A=3 < B=4, advance A.
│ 5 │ └───────────────▶│ 4 │ ← A=5 > B=4, advance B.
│ 5 │──────────────────────│ 5 │ ← Match! Emit row.
└───┘ └───┘
Best when: both inputs are already sorted (e.g., from index scans)
Cost: O(M log M + N log N) if sorting needed, O(M + N) if pre-sorted
| Strategy | Best For | Requires | Memory |
|---|---|---|---|
| Nested Loop | Small outer, indexed inner | Index on join key | Low |
| Hash Join | Large, unsorted tables | Nothing | Medium-High |
| Merge Join | Pre-sorted or already indexed | Sort order | Low |
HashAggregate
Groups rows using an in-memory hash table. Fast when the number of distinct groups fits in memory (work_mem in PostgreSQL).
GroupAggregate
Operates on pre-sorted input. If rows are already sorted by the GROUP BY key (from an index or a prior Sort node), GroupAggregate processes them in a single linear pass without building a hash table.
Sort
Sorts rows by one or more keys. If the data fits in work_mem, it’s an in-memory quicksort. If not, it spills to disk — a condition you’ll see flagged as “Batches: N” in the plan output when N > 1.
Here’s the central insight that every senior engineer understands: the optimizer is only as smart as its statistics. Bad statistics → bad cost estimates → bad plan choice → slow queries. The optimizer cannot improve its reasoning beyond the quality of data it receives.
PostgreSQL’s pg_statistic catalog stores, for every column:
n_distinct — Estimated number of distinct values (cardinality)null_frac — Fraction of null valuesmost_common_vals — The N most frequently occurring valuesmost_common_freqs — The frequency of each common valuehistogram_bounds — Bucket boundaries for non-common values (a histogram) Column: order_status
┌─────────────────────────────────────────────────────┐
│ Most Common Values: │
│ 'completed' → 72% │
│ 'pending' → 18% │
│ 'cancelled' → 8% │
│ 'refunded' → 2% │
└─────────────────────────────────────────────────────┘
The optimizer uses this to estimate:
WHERE status = 'completed' → ~72% of rows (Seq Scan likely)
WHERE status = 'refunded' → ~2% of rows (Index Scan likely)
In Figure 9, notice that the same column with different filter values can lead to completely different plan choices. This is the power of histogram-based statistics.
In PostgreSQL, ANALYZE collects fresh statistics:
import psycopg2
conn = psycopg2.connect("dbname=mydb user=postgres")
conn.autocommit = True
cur = conn.cursor()
# Analyze a specific table
cur.execute("ANALYZE orders;")
# Analyze entire database
cur.execute("ANALYZE;")
PostgreSQL also runs autovacuum (which includes auto-analyze) in the background, but on heavily updated tables, manual ANALYZE after bulk operations is a best practice.
Cardinality estimation — predicting how many rows an operation will return — is one of the hardest problems in database engineering. When you have a multi-column filter like WHERE status = 'completed' AND region = 'West', the optimizer typically assumes statistical independence between columns:
If status and region are correlated (e.g., the West region disproportionately has ‘pending’ orders), this assumption breaks down. PostgreSQL 10+ introduced extended statistics to address this:
# Create extended stats for correlated columns
cur.execute("""
CREATE STATISTICS order_stats (dependencies)
ON status, region
FROM orders;
""")
cur.execute("ANALYZE orders;")
Armed with our understanding of how plans work, let’s look at the primary levers we can pull to improve query performance.
Indexes are the single highest-leverage optimization tool. But indexes are not free — each index adds write overhead and storage cost. The goal is targeted, surgical indexing.
Composite Indexes and Column Order
For a composite index, column order matters enormously. The rule is: equality predicates first, then range predicates.
# Query pattern we want to optimize:
# WHERE status = 'completed' AND order_date > '2024-01-01'
# GOOD: equality column first
cur.execute("""
CREATE INDEX idx_orders_status_date
ON orders (status, order_date);
""")
# BAD: range column first — the index is far less useful
# because the range on order_date prevents efficient use
# of the status part of the index for this specific query
cur.execute("""
CREATE INDEX idx_orders_date_status
ON orders (order_date, status);
""")
Partial Indexes
When you frequently query a subset of data, a partial index covers only that subset — smaller, faster, and more maintainable:
# If 90% of queries only look at 'pending' orders:
cur.execute("""
CREATE INDEX idx_pending_orders
ON orders (created_at)
WHERE status = 'pending';
""")
# This index is tiny compared to a full-table index
# but serves the most common query pattern perfectly.
Covering Indexes
Add frequently-accessed non-filter columns to enable Index Only Scans:
# Query: SELECT id, name FROM customers WHERE email = 'x@y.com'
# Regular index: must visit heap for 'name'
# Covering index: heap visit eliminated
cur.execute("""
CREATE INDEX idx_customers_email_covering
ON customers (email)
INCLUDE (name, id);
""")
Sometimes the way we write SQL inadvertently confuses the optimizer or prevents index usage.
The OR trap:
# BAD: OR on different columns often prevents index usage
cur.execute("""
SELECT * FROM orders
WHERE customer_id = 42 OR support_id = 42;
""")
# GOOD: UNION ALL — each branch can use its own index
cur.execute("""
SELECT * FROM orders WHERE customer_id = 42
UNION ALL
SELECT * FROM orders WHERE support_id = 42;
""")
Functions on indexed columns:
# BAD: Function on the column defeats the index!
# The optimizer cannot use an index on order_date
# when you wrap it in a function.
cur.execute("""
SELECT * FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024;
""")
# GOOD: Rewrite as a range query — index-friendly
cur.execute("""
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
""")
Avoiding SELECT *:
# BAD: SELECT * forces the executor to read all columns
# and prevents Index Only Scans
cur.execute("SELECT * FROM customers WHERE email = 'x@y.com'")
# GOOD: Project only what you need
cur.execute("SELECT id, name FROM customers WHERE email = 'x@y.com'")
Sometimes, especially when statistics are hard to keep current, the optimizer chooses a suboptimal plan. Most databases offer escape hatches.
In PostgreSQL, you can disable specific join types or scan methods at the session level:
# Temporarily disable hash joins to force merge or nested loop
cur.execute("SET enable_hashjoin = OFF;")
cur.execute("SET enable_seqscan = OFF;")
# Run your query
cur.execute("EXPLAIN ANALYZE SELECT ...")
# Always re-enable! This is session-scoped but be careful.
cur.execute("SET enable_hashjoin = ON;")
cur.execute("SET enable_seqscan = ON;")
⚠️ Caution: Disabling planner methods is a diagnostic tool, not a permanent fix. If you need the optimizer to consistently make a different choice, the right solution is to improve statistics, add an index, or rewrite the query.
The work_mem setting controls how much memory each sort or hash operation can use before spilling to disk. Spill-to-disk events are catastrophic for performance — often 10-100x slower than in-memory operations.
# Check if a sort or hash operation spilled to disk
# Look for "Batches: N" where N > 1 in EXPLAIN ANALYZE output
# Increase work_mem for a specific heavy query
cur.execute("SET work_mem = '256MB';")
cur.execute("EXPLAIN ANALYZE SELECT ... ORDER BY ...")
cur.execute("SET work_mem = '4MB';") # Reset to default
# Note: work_mem applies PER SORT OPERATION, not per query
# A query with 3 sorts uses up to 3 × work_mem simultaneously
One of the most common performance anti-patterns in application code is the N+1 query problem. It doesn’t appear in a single execution plan — because it’s the result of many separate queries, each individually fast, combined catastrophically.
The Pattern:
# Naive ORM-style code — produces N+1 queries
cur.execute("SELECT id FROM customers WHERE active = true")
customers = cur.fetchall() # Query 1
orders_by_customer = {}
for customer in customers: # N more queries!
cur.execute(
"SELECT * FROM orders WHERE customer_id = %s",
(customer,)
)
orders_by_customer[customer] = cur.fetchall()
If there are 1,000 active customers, this fires 1,001 queries. The solution is a single JOIN:
# GOOD: Single query with a JOIN
cur.execute("""
SELECT c.id, c.name, o.id as order_id, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE c.active = true
""")
results = cur.fetchall()
One execution plan. One round trip. Dramatically faster.
Common Table Expressions (CTEs, the WITH clause) are powerful organizational tools, but their optimization behavior is often misunderstood.
In PostgreSQL prior to version 12, CTEs were optimization fences — the planner treated each CTE as a materialized subquery, executing it fully before the outer query could push predicates into it:
# In PostgreSQL < 12, this CTE would scan ALL orders
# before filtering — the WHERE in the outer query
# couldn't be "pushed down" into the CTE
cur.execute("""
WITH recent_orders AS (
SELECT * FROM orders -- scanned ALL rows
)
SELECT * FROM recent_orders
WHERE order_date > '2024-01-01'; -- filter too late!
""")
From PostgreSQL 12 onward, non-recursive CTEs that are referenced only once are inlined by default, removing this barrier. But explicitly marking a CTE as MATERIALIZED or NOT MATERIALIZED gives you fine-grained control:
# Force materialization (pre-12 behavior) — useful when
# the CTE result is used multiple times and recomputation is expensive
cur.execute("""
WITH MATERIALIZED expensive_calc AS (
SELECT customer_id, SUM(total) as lifetime_value
FROM orders
GROUP BY customer_id
)
SELECT c.name, ec.lifetime_value
FROM customers c
JOIN expensive_calc ec ON ec.customer_id = c.id
WHERE ec.lifetime_value > 10000;
""")
You now have the conceptual framework. Let’s map this to what actually appears in technical interviews.
When asked “How would you investigate a slow query?”, walk through this sequence:
┌───────────────────────────────────────────────────────────┐
│ SLOW QUERY DIAGNOSIS FLOWCHART │
└───────────────────────────────────────────────────────────┘
┌─────────────────┐
│ EXPLAIN ANALYZE │
│ the query │
└────────┬─────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Are estimated rows ≈ actual rows? │
│ NO ─────────────────────────────▶ Run ANALYZE │
│ YES │
└────────┬────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Are there Seq Scans on large tables? │
│ YES ─────────────────────────────▶ Add Index │
│ NO │
└────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Are there Sort nodes with Batches > 1? │
│ YES ─────────────────────────────▶ Increase │
│ NO work_mem │
└────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Any Hash Join on huge tables? │
│ Consider composite indexes for Merge/NL Join │
└──────────────────────────────────────────────────┘
Memorize these as interview talking points:
ANALYZEBatches: 4 → Sorting spilled to disk; increase work_mem# Interviewer gives you this query and says it's slow:
cur.execute("""
SELECT u.name, COUNT(p.id) as post_count
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
WHERE LOWER(u.email) = 'alice@example.com'
GROUP BY u.name;
""")
Your answer: The LOWER() function on u.email defeats any index on the email column. The fix is either:
# Option 1: Create a functional index
cur.execute("""
CREATE INDEX idx_users_email_lower
ON users (LOWER(email));
""")
# Option 2: Rewrite the query to avoid the function
cur.execute("""
SELECT u.name, COUNT(p.id) as post_count
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
WHERE u.email = 'alice@example.com' -- assume emails stored in lowercase
GROUP BY u.name;
""")
Let’s walk through a complete real-world optimization scenario from start to finish.
Setup: An e-commerce platform is experiencing slow report generation. The offending query:
# SLOW: Report of top 100 customers by revenue in the last 90 days
cur.execute("""
SELECT
c.id,
c.name,
c.email,
SUM(o.total_amount) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
AND o.status = 'completed'
GROUP BY c.id, c.name, c.email
ORDER BY revenue DESC
LIMIT 100;
""")
Step 1 — Run EXPLAIN ANALYZE:
Sort (cost=85000..85000 rows=100 width=64)
(actual time=4821.3..4821.4 rows=100 loops=1)
-> HashAggregate (cost=72000..73000 rows=50000 width=64)
(actual time=4750.2..4820.1 rows=48392 loops=1)
-> Hash Join (cost=1200..55000 rows=2200000 width=48)
(actual time=45.2..3100.4 rows=2314892 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders (cost=0..45000 rows=2200000)
(actual time=0.1..1200.3 rows=2314892 loops=1)
Filter: (created_at >= NOW()-'90 days'
AND status='completed')
Rows Removed by Filter: 4821033
-> Hash on customers (cost=800..800 rows=32000)
(actual time=40.1..40.1 rows=32000 loops=1)
Step 2 — Diagnose:
orders reads 7.1 million rows and discards 4.8 million. Massive waste.(created_at, status) or (status, created_at).Step 3 — Fix:
# Create a composite index: equality filter (status) first,
# then range filter (created_at)
cur.execute("""
CREATE INDEX idx_orders_status_created
ON orders (status, created_at)
INCLUDE (customer_id, total_amount);
""")
# The INCLUDE adds our needed columns for a potential index-only scan
cur.execute("ANALYZE orders;")
Step 4 — Verify:
Sort (cost=12000..12003 rows=100 width=64)
(actual time=312.4..312.5 rows=100 loops=1)
-> HashAggregate (cost=9800..10300 rows=50000 width=64)
(actual time=295.1..311.8 rows=48392 loops=1)
-> Hash Join (cost=1200..7500 rows=92000 width=48)
(actual time=45.2..212.3 rows=2314892 loops=1)
-> Index Scan on orders (cost=0.56..4800 rows=92000)
(actual time=0.09..98.2 rows=92000 loops=1)
Index Cond: (status='completed'
AND created_at >= NOW()-'90 days')
The query went from 4.8 seconds → 312 ms — a 15x speedup — by adding a single well-designed index. The key insight: the Seq Scan that was reading 7.1 million rows is now an Index Scan reading only 92,000.
| Scenario | Root Cause | Fix |
|---|---|---|
| Seq Scan on large table | Missing index | CREATE INDEX on filter columns |
| Index not used | Function wrapping column | Functional index or rewrite predicate |
| Estimated ≠ Actual rows | Stale statistics | ANALYZE table_name |
| Sort “Batches: N” (N>1) | work_mem too low | SET work_mem = '256MB' |
| Slow JOIN | No index on join key | Add index on FK column |
| N+1 queries | Application-level loop | Rewrite as single JOIN |
| CTE not optimized well | Optimization fence (pre-PG12) | Use NOT MATERIALIZED or inline |
| Hash Join on tiny tables | Stats misleading optimizer | Run ANALYZE, adjust statistics target |
We traveled the full arc from the GPS analogy through the four stages of query processing, decoded the two optimizer philosophies, learned to read execution plan trees, catalogued every major operator type, explored how statistics fuel the optimizer’s decisions, and practiced a complete optimization workflow end-to-end.
The recurring theme: the optimizer is a probabilistic reasoner working from imperfect statistics and a finite search budget. Your job as an engineer is not to fight it — it’s to give it better information (statistics, indexes, schema design) so it can make better decisions automatically.
Interview Mindset: When asked about query performance, always follow the evidence: get the plan first, then diagnose, then fix. Never guess and add indexes blindly. The execution plan is the ground truth.