claude-skills-reference/engineering-team/senior-backend/references/database_optimization_guide.md

# Database Optimization Guide

Practical strategies for PostgreSQL query optimization, indexing, and performance tuning.

## Guide Index

1. [Query Analysis with EXPLAIN](#1-query-analysis-with-explain)
2. [Indexing Strategies](#2-indexing-strategies)
3. [N+1 Query Problem](#3-n1-query-problem)
4. [Connection Pooling](#4-connection-pooling)
5. [Query Optimization Patterns](#5-query-optimization-patterns)
6. [Database Migrations](#6-database-migrations)
7. [Monitoring and Alerting](#7-monitoring-and-alerting)

---

## 1. Query Analysis with EXPLAIN

### Basic EXPLAIN Usage

```sql
-- Show query plan
EXPLAIN SELECT * FROM orders WHERE user_id = 123;

-- Show plan with actual execution times
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

-- Show buffers and I/O statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE user_id = 123;
```

### Reading EXPLAIN Output

```
                                    QUERY PLAN
---------------------------------------------------------------------------
 Index Scan using idx_orders_user_id on orders  (cost=0.43..8.45 rows=10 width=120)
   Index Cond: (user_id = 123)
   Buffers: shared hit=3
 Planning Time: 0.152 ms
 Execution Time: 0.089 ms
```

**Key metrics:**
- `cost`: Estimated cost (startup..total)
- `rows`: Estimated row count
- `width`: Average row size in bytes
- `actual time`: Real execution time (with ANALYZE)
- `Buffers: shared hit`: Pages read from cache

### Scan Types (Best to Worst)

| Scan Type | Description | Performance |
|-----------|-------------|-------------|
| Index Only Scan | Data from index alone | Best |
| Index Scan | Index lookup + heap fetch | Good |
| Bitmap Index Scan | Multiple index conditions | Good |
| Index Scan + Filter | Index + row filtering | Okay |
| Seq Scan (small table) | Full table scan | Okay |
| Seq Scan (large table) | Full table scan | Bad |
| Nested Loop (large) | O(n*m) join | Very Bad |

### Warning Signs

```sql
-- BAD: Sequential scan on large table
Seq Scan on orders  (cost=0.00..1854231.00 rows=50000000 width=120)
  Filter: (status = 'pending')
  Rows Removed by Filter: 49500000

-- BAD: Nested loop with high iterations
Nested Loop  (cost=0.43..2847593.20 rows=12500000 width=240)
  ->  Seq Scan on users  (cost=0.00..1250.00 rows=50000 width=120)
  ->  Index Scan on orders  (cost=0.43..45.73 rows=250 width=120)
       Index Cond: (orders.user_id = users.id)
```

---

## 2. Indexing Strategies

### Index Types

```sql
-- B-tree (default, most common)
CREATE INDEX idx_users_email ON users(email);

-- Hash (equality only, rarely better than B-tree)
CREATE INDEX idx_users_id_hash ON users USING hash(id);

-- GIN (arrays, JSONB, full-text search)
CREATE INDEX idx_products_tags ON products USING gin(tags);
CREATE INDEX idx_users_data ON users USING gin(metadata jsonb_path_ops);

-- GiST (geometric, range types, full-text)
CREATE INDEX idx_locations_point ON locations USING gist(coordinates);
```

### Composite Indexes

```sql
-- Order matters! Column with = first, then range/sort
CREATE INDEX idx_orders_user_status_date
ON orders(user_id, status, created_at DESC);

-- This index supports:
-- WHERE user_id = ?
-- WHERE user_id = ? AND status = ?
-- WHERE user_id = ? AND status = ? ORDER BY created_at DESC
-- WHERE user_id = ? ORDER BY created_at DESC

-- This index does NOT efficiently support:
-- WHERE status = ? (user_id not in query)
-- WHERE created_at > ? (leftmost column not in query)
```

### Partial Indexes

```sql
-- Index only active users (smaller, faster)
CREATE INDEX idx_users_active_email
ON users(email)
WHERE status = 'active';

-- Index only recent orders
CREATE INDEX idx_orders_recent
ON orders(created_at DESC)
WHERE created_at > CURRENT_DATE - INTERVAL '90 days';

-- Index only unprocessed items
CREATE INDEX idx_queue_pending
ON job_queue(priority DESC, created_at)
WHERE processed_at IS NULL;
```

### Covering Indexes (Index-Only Scans)

```sql
-- Include non-indexed columns to avoid heap lookup
CREATE INDEX idx_users_email_covering
ON users(email)
INCLUDE (name, created_at);

-- Query can be satisfied from index alone
SELECT name, created_at FROM users WHERE email = 'test@example.com';
-- Result: Index Only Scan
```

### Index Maintenance

```sql
-- Check index usage
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Find unused indexes (candidates for removal)
SELECT indexrelid::regclass as index,
       relid::regclass as table,
       pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelid NOT IN (SELECT conindid FROM pg_constraint);

-- Rebuild bloated indexes
REINDEX INDEX CONCURRENTLY idx_orders_user_id;
```

---

## 3. N+1 Query Problem

### The Problem

```typescript
// BAD: N+1 queries
const users = await db.query('SELECT * FROM users LIMIT 100');

for (const user of users) {
  // This runs 100 times!
  const orders = await db.query(
    'SELECT * FROM orders WHERE user_id = $1',
    [user.id]
  );
  user.orders = orders;
}
// Total queries: 1 + 100 = 101
```

### Solution 1: JOIN

```typescript
// GOOD: Single query with JOIN
const usersWithOrders = await db.query(`
  SELECT u.*, o.id as order_id, o.total, o.status
  FROM users u
  LEFT JOIN orders o ON o.user_id = u.id
  LIMIT 100
`);
// Total queries: 1
```

### Solution 2: Batch Loading (DataLoader pattern)

```typescript
// GOOD: Two queries with batch loading
const users = await db.query('SELECT * FROM users LIMIT 100');
const userIds = users.map(u => u.id);

const orders = await db.query(
  'SELECT * FROM orders WHERE user_id = ANY($1)',
  [userIds]
);

// Group orders by user_id
const ordersByUser = groupBy(orders, 'user_id');
users.forEach(user => {
  user.orders = ordersByUser[user.id] || [];
});
// Total queries: 2
```

### Solution 3: ORM Eager Loading

```typescript
// Prisma
const users = await prisma.user.findMany({
  take: 100,
  include: { orders: true }
});

// TypeORM
const users = await userRepository.find({
  take: 100,
  relations: ['orders']
});

// Sequelize
const users = await User.findAll({
  limit: 100,
  include: [{ model: Order }]
});
```

### Detecting N+1 in Production

```typescript
// Query logging middleware
let queryCount = 0;
const originalQuery = db.query;

db.query = async (...args) => {
  queryCount++;
  if (queryCount > 10) {
    console.warn(`High query count: ${queryCount} in single request`);
    console.trace();
  }
  return originalQuery.apply(db, args);
};
```

---

## 4. Connection Pooling

### Why Pooling Matters

```
Without pooling:
Request → Create connection → Query → Close connection
         (50-100ms overhead)

With pooling:
Request → Get connection from pool → Query → Return to pool
         (0-1ms overhead)
```

### pg-pool Configuration

```typescript
import { Pool } from 'pg';

const pool = new Pool({
  host: process.env.DB_HOST,
  port: 5432,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,

  // Pool settings
  min: 5,                    // Minimum connections
  max: 20,                   // Maximum connections
  idleTimeoutMillis: 30000,  // Close idle connections after 30s
  connectionTimeoutMillis: 5000, // Fail if can't connect in 5s

  // Statement timeout (cancel long queries)
  statement_timeout: 30000,
});

// Health check
pool.on('error', (err, client) => {
  console.error('Unexpected pool error', err);
});
```

### Pool Sizing Formula

```
Optimal connections = (CPU cores * 2) + effective_spindle_count

For SSD with 4 cores:
connections = (4 * 2) + 1 = 9

For multiple app servers:
connections_per_server = total_connections / num_servers
```

### PgBouncer for High Scale

```ini
# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
```

---

## 5. Query Optimization Patterns

### Pagination Optimization

```sql
-- BAD: OFFSET is slow for large values
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 10000;
-- Must scan 10,020 rows, discard 10,000

-- GOOD: Cursor-based pagination
SELECT * FROM orders
WHERE created_at < '2024-01-15T10:00:00Z'
ORDER BY created_at DESC
LIMIT 20;
-- Only scans 20 rows
```

### Batch Updates

```sql
-- BAD: Individual updates
UPDATE orders SET status = 'shipped' WHERE id = 1;
UPDATE orders SET status = 'shipped' WHERE id = 2;
-- ...repeat 1000 times

-- GOOD: Batch update
UPDATE orders
SET status = 'shipped'
WHERE id = ANY(ARRAY[1, 2, 3, ...1000]);

-- GOOD: Update from values
UPDATE orders o
SET status = v.new_status
FROM (VALUES
  (1, 'shipped'),
  (2, 'delivered'),
  (3, 'cancelled')
) AS v(id, new_status)
WHERE o.id = v.id;
```

### Avoiding SELECT *

```sql
-- BAD: Fetches all columns including large text/blob
SELECT * FROM articles WHERE published = true;

-- GOOD: Only fetch needed columns
SELECT id, title, summary, author_id, published_at
FROM articles
WHERE published = true;
```

### Using EXISTS vs IN

```sql
-- For checking existence, EXISTS is often faster
-- BAD
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 1000);

-- GOOD (for large subquery results)
SELECT * FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.user_id = u.id AND o.total > 1000
);
```

### Materialized Views for Complex Aggregations

```sql
-- Create materialized view for expensive aggregations
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
  date_trunc('day', created_at) as date,
  product_id,
  COUNT(*) as order_count,
  SUM(quantity) as total_quantity,
  SUM(total) as total_revenue
FROM orders
GROUP BY date_trunc('day', created_at), product_id;

-- Create index on materialized view
CREATE INDEX idx_daily_sales_date ON daily_sales_summary(date);

-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales_summary;
```

---

## 6. Database Migrations

### Migration Best Practices

```sql
-- Always include rollback
-- migrations/20240115_001_add_user_status.sql
-- UP
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
CREATE INDEX CONCURRENTLY idx_users_status ON users(status);

-- DOWN (in separate file or comment)
DROP INDEX CONCURRENTLY IF EXISTS idx_users_status;
ALTER TABLE users DROP COLUMN IF EXISTS status;
```

### Safe Column Addition

```sql
-- SAFE: Add nullable column (no table rewrite)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- SAFE: Add column with volatile default (PG 11+)
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW();

-- UNSAFE: Add column with constant default (table rewrite before PG 11)
-- ALTER TABLE users ADD COLUMN score INTEGER DEFAULT 0;

-- SAFE alternative for constant default:
ALTER TABLE users ADD COLUMN score INTEGER;
UPDATE users SET score = 0 WHERE score IS NULL;
ALTER TABLE users ALTER COLUMN score SET DEFAULT 0;
ALTER TABLE users ALTER COLUMN score SET NOT NULL;
```

### Safe Index Creation

```sql
-- UNSAFE: Locks table
CREATE INDEX idx_orders_user ON orders(user_id);

-- SAFE: Non-blocking
CREATE INDEX CONCURRENTLY idx_orders_user ON orders(user_id);

-- Note: CONCURRENTLY cannot run in a transaction
```

### Safe Column Removal

```sql
-- Step 1: Stop writing to column (application change)
-- Step 2: Wait for all deployments
-- Step 3: Drop column
ALTER TABLE users DROP COLUMN IF EXISTS legacy_field;
```

---

## 7. Monitoring and Alerting

### Key Metrics to Monitor

```sql
-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Connection by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- Long-running queries
SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
  AND state != 'idle';

-- Table bloat
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
  pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
  pg_size_pretty(pg_indexes_size(schemaname||'.'||tablename)) as index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
```

### pg_stat_statements for Query Analysis

```sql
-- Enable extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries
SELECT
  round(total_exec_time::numeric, 2) as total_time_ms,
  calls,
  round(mean_exec_time::numeric, 2) as avg_time_ms,
  round((100 * total_exec_time / sum(total_exec_time) over())::numeric, 2) as percentage,
  query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Find most frequent queries
SELECT
  calls,
  round(total_exec_time::numeric, 2) as total_time_ms,
  round(mean_exec_time::numeric, 2) as avg_time_ms,
  query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
```

### Alert Thresholds

| Metric | Warning | Critical |
|--------|---------|----------|
| Connection usage | > 70% | > 90% |
| Query time P95 | > 500ms | > 2s |
| Replication lag | > 30s | > 5m |
| Disk usage | > 70% | > 85% |
| Cache hit ratio | < 95% | < 90% |

---

## Quick Reference: PostgreSQL Commands

```sql
-- Check table sizes
SELECT pg_size_pretty(pg_total_relation_size('orders'));

-- Check index sizes
SELECT pg_size_pretty(pg_indexes_size('orders'));

-- Kill a query
SELECT pg_cancel_backend(pid);  -- Graceful
SELECT pg_terminate_backend(pid);  -- Force

-- Check locks
SELECT * FROM pg_locks WHERE granted = false;

-- Vacuum analyze (update statistics)
VACUUM ANALYZE orders;

-- Check autovacuum status
SELECT * FROM pg_stat_user_tables WHERE relname = 'orders';
```