feat(engineering,ra-qm): add secrets-vault-manager, sql-database-assistant, gcp-cloud-architect, soc2-compliance

secrets-vault-manager (403-line SKILL.md, 3 scripts, 3 references):
- HashiCorp Vault, AWS SM, Azure KV, GCP SM integration
- Secret rotation, dynamic secrets, audit logging, emergency procedures

sql-database-assistant (457-line SKILL.md, 3 scripts, 3 references):
- Query optimization, migration generation, schema exploration
- Multi-DB support (PostgreSQL, MySQL, SQLite, SQL Server)
- ORM patterns (Prisma, Drizzle, TypeORM, SQLAlchemy)

gcp-cloud-architect (418-line SKILL.md, 3 scripts, 3 references):
- 6-step workflow mirroring aws-solution-architect for GCP
- Cloud Run, GKE, BigQuery, Cloud Functions, cost optimization
- Completes cloud trifecta (AWS + Azure + GCP)

soc2-compliance (417-line SKILL.md, 3 scripts, 3 references):
- SOC 2 Type I & II preparation, Trust Service Criteria mapping
- Control matrix generation, evidence tracking, gap analysis
- First SOC 2 skill in ra-qm-team (joins GDPR, ISO 27001, ISO 13485)

All 12 scripts pass --help. Docs generated, mkdocs.yml nav updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Reza Rezvani
2026-03-25 14:05:11 +01:00
parent 7a2189fa21
commit 87f3a007c9
36 changed files with 13450 additions and 6 deletions

View File

@@ -0,0 +1,457 @@
---
name: "sql-database-assistant"
description: "Use when the user asks to write SQL queries, optimize database performance, generate migrations, explore database schemas, or work with ORMs like Prisma, Drizzle, TypeORM, or SQLAlchemy."
---
# SQL Database Assistant - POWERFUL Tier Skill
## Overview
The operational companion to database design. While **database-designer** focuses on schema architecture and **database-schema-designer** handles ERD modeling, this skill covers the day-to-day: writing queries, optimizing performance, generating migrations, and bridging the gap between application code and database engines.
### Core Capabilities
- **Natural Language to SQL** — translate requirements into correct, performant queries
- **Schema Exploration** — introspect live databases across PostgreSQL, MySQL, SQLite, SQL Server
- **Query Optimization** — EXPLAIN analysis, index recommendations, N+1 detection, rewrite patterns
- **Migration Generation** — up/down scripts, zero-downtime strategies, rollback plans
- **ORM Integration** — Prisma, Drizzle, TypeORM, SQLAlchemy patterns and escape hatches
- **Multi-Database Support** — dialect-aware SQL with compatibility guidance
### Tools
| Script | Purpose |
|--------|---------|
| `scripts/query_optimizer.py` | Static analysis of SQL queries for performance issues |
| `scripts/migration_generator.py` | Generate migration file templates from change descriptions |
| `scripts/schema_explorer.py` | Generate schema documentation from introspection queries |
---
## Natural Language to SQL
### Translation Patterns
When converting requirements to SQL, follow this sequence:
1. **Identify entities** — map nouns to tables
2. **Identify relationships** — map verbs to JOINs or subqueries
3. **Identify filters** — map adjectives/conditions to WHERE clauses
4. **Identify aggregations** — map "total", "average", "count" to GROUP BY
5. **Identify ordering** — map "top", "latest", "highest" to ORDER BY + LIMIT
### Common Query Templates
**Top-N per group (window function)**
```sql
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn
FROM employees
) ranked WHERE rn <= 3;
```
**Running totals**
```sql
SELECT date, amount,
SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;
```
**Gap detection**
```sql
SELECT curr.id, curr.seq_num, prev.seq_num AS prev_seq
FROM records curr
LEFT JOIN records prev ON prev.seq_num = curr.seq_num - 1
WHERE prev.id IS NULL AND curr.seq_num > 1;
```
**UPSERT (PostgreSQL)**
```sql
INSERT INTO settings (key, value, updated_at)
VALUES ('theme', 'dark', NOW())
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value, updated_at = EXCLUDED.updated_at;
```
**UPSERT (MySQL)**
```sql
INSERT INTO settings (key_name, value, updated_at)
VALUES ('theme', 'dark', NOW())
ON DUPLICATE KEY UPDATE value = VALUES(value), updated_at = VALUES(updated_at);
```
> See references/query_patterns.md for JOINs, CTEs, window functions, JSON operations, and more.
---
## Schema Exploration
### Introspection Queries
**PostgreSQL — list tables and columns**
```sql
SELECT table_name, column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;
```
**PostgreSQL — foreign keys**
```sql
SELECT tc.table_name, kcu.column_name,
ccu.table_name AS foreign_table, ccu.column_name AS foreign_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
```
**MySQL — table sizes**
```sql
SELECT table_name, table_rows,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY data_length DESC;
```
**SQLite — schema dump**
```sql
SELECT name, sql FROM sqlite_master WHERE type = 'table' ORDER BY name;
```
**SQL Server — columns with types**
```sql
SELECT t.name AS table_name, c.name AS column_name,
ty.name AS data_type, c.max_length, c.is_nullable
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
ORDER BY t.name, c.column_id;
```
### Generating Documentation from Schema
Use `scripts/schema_explorer.py` to produce markdown or JSON documentation:
```bash
python scripts/schema_explorer.py --dialect postgres --tables all --format md
python scripts/schema_explorer.py --dialect mysql --tables users,orders --format json --json
```
---
## Query Optimization
### EXPLAIN Analysis Workflow
1. **Run EXPLAIN ANALYZE** (PostgreSQL) or **EXPLAIN FORMAT=JSON** (MySQL)
2. **Identify the costliest node** — Seq Scan on large tables, Nested Loop with high row estimates
3. **Check for missing indexes** — sequential scans on filtered columns
4. **Look for estimation errors** — planned vs actual rows divergence signals stale statistics
5. **Evaluate JOIN order** — ensure the smallest result set drives the join
### Index Recommendation Checklist
- Columns in WHERE clauses with high selectivity
- Columns in JOIN conditions (foreign keys)
- Columns in ORDER BY when combined with LIMIT
- Composite indexes matching multi-column WHERE predicates (most selective column first)
- Partial indexes for queries with constant filters (e.g., `WHERE status = 'active'`)
- Covering indexes to avoid table lookups for read-heavy queries
### Query Rewriting Patterns
| Anti-Pattern | Rewrite |
|-------------|---------|
| `SELECT * FROM orders` | `SELECT id, status, total FROM orders` (explicit columns) |
| `WHERE YEAR(created_at) = 2025` | `WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'` (sargable) |
| Correlated subquery in SELECT | LEFT JOIN with aggregation |
| `NOT IN (SELECT ...)` with NULLs | `NOT EXISTS (SELECT 1 ...)` |
| `UNION` (dedup) when not needed | `UNION ALL` |
| `LIKE '%search%'` | Full-text search index (GIN/FULLTEXT) |
| `ORDER BY RAND()` | Application-side random sampling or `TABLESAMPLE` |
### N+1 Detection
**Symptoms:**
- Application loop that executes one query per parent row
- ORM lazy-loading related entities inside a loop
- Query log shows hundreds of identical SELECT patterns with different IDs
**Fixes:**
- Use eager loading (`include` in Prisma, `joinedload` in SQLAlchemy)
- Batch queries with `WHERE id IN (...)`
- Use DataLoader pattern for GraphQL resolvers
### Static Analysis Tool
```bash
python scripts/query_optimizer.py --query "SELECT * FROM orders WHERE status = 'pending'" --dialect postgres
python scripts/query_optimizer.py --query queries.sql --dialect mysql --json
```
> See references/optimization_guide.md for EXPLAIN plan reading, index types, and connection pooling.
---
## Migration Generation
### Zero-Downtime Migration Patterns
**Adding a column (safe)**
```sql
-- Up
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Down
ALTER TABLE users DROP COLUMN phone;
```
**Renaming a column (expand-contract)**
```sql
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Step 2: Backfill
UPDATE users SET full_name = name;
-- Step 3: Deploy app reading both columns
-- Step 4: Deploy app writing only new column
-- Step 5: Drop old column
ALTER TABLE users DROP COLUMN name;
```
**Adding a NOT NULL column (safe sequence)**
```sql
-- Step 1: Add nullable
ALTER TABLE orders ADD COLUMN region VARCHAR(50);
-- Step 2: Backfill with default
UPDATE orders SET region = 'unknown' WHERE region IS NULL;
-- Step 3: Add constraint
ALTER TABLE orders ALTER COLUMN region SET NOT NULL;
ALTER TABLE orders ALTER COLUMN region SET DEFAULT 'unknown';
```
**Index creation (non-blocking, PostgreSQL)**
```sql
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);
```
### Data Backfill Strategies
- **Batch updates** — process in chunks of 1000-10000 rows to avoid lock contention
- **Background jobs** — run backfills asynchronously with progress tracking
- **Dual-write** — write to old and new columns during transition period
- **Validation queries** — verify row counts and data integrity after each batch
### Rollback Strategies
Every migration must have a reversible down script. For irreversible changes:
1. **Backup before execution**`pg_dump` the affected tables
2. **Feature flags** — application can switch between old/new schema reads
3. **Shadow tables** — keep a copy of the original table during migration window
### Migration Generator Tool
```bash
python scripts/migration_generator.py --change "add email_verified boolean to users" --dialect postgres --format sql
python scripts/migration_generator.py --change "rename column name to full_name in customers" --dialect mysql --format alembic --json
```
---
## Multi-Database Support
### Dialect Differences
| Feature | PostgreSQL | MySQL | SQLite | SQL Server |
|---------|-----------|-------|--------|------------|
| UPSERT | `ON CONFLICT DO UPDATE` | `ON DUPLICATE KEY UPDATE` | `ON CONFLICT DO UPDATE` | `MERGE` |
| Boolean | Native `BOOLEAN` | `TINYINT(1)` | `INTEGER` | `BIT` |
| Auto-increment | `SERIAL` / `GENERATED` | `AUTO_INCREMENT` | `INTEGER PRIMARY KEY` | `IDENTITY` |
| JSON | `JSONB` (indexed) | `JSON` | Text (ext) | `NVARCHAR(MAX)` |
| Array | Native `ARRAY` | Not supported | Not supported | Not supported |
| CTE (recursive) | Full support | 8.0+ | 3.8.3+ | Full support |
| Window functions | Full support | 8.0+ | 3.25.0+ | Full support |
| Full-text search | `tsvector` + GIN | `FULLTEXT` index | FTS5 extension | Full-text catalog |
| LIMIT/OFFSET | `LIMIT n OFFSET m` | `LIMIT n OFFSET m` | `LIMIT n OFFSET m` | `OFFSET m ROWS FETCH NEXT n ROWS ONLY` |
### Compatibility Tips
- **Always use parameterized queries** — prevents SQL injection across all dialects
- **Avoid dialect-specific functions in shared code** — wrap in adapter layer
- **Test migrations on target engine** — `information_schema` varies between engines
- **Use ISO date format** — `'YYYY-MM-DD'` works everywhere
- **Quote identifiers** — use double quotes (SQL standard) or backticks (MySQL)
---
## ORM Patterns
### Prisma
**Schema definition**
```prisma
model User {
id Int @id @default(autoincrement())
email String @unique
name String?
posts Post[]
createdAt DateTime @default(now())
}
model Post {
id Int @id @default(autoincrement())
title String
author User @relation(fields: [authorId], references: [id])
authorId Int
}
```
**Migrations**: `npx prisma migrate dev --name add_user_email`
**Query API**: `prisma.user.findMany({ where: { email: { contains: '@' } }, include: { posts: true } })`
**Raw SQL escape hatch**: `prisma.$queryRaw\`SELECT * FROM users WHERE id = ${userId}\``
### Drizzle
**Schema-first definition**
```typescript
export const users = pgTable('users', {
id: serial('id').primaryKey(),
email: varchar('email', { length: 255 }).notNull().unique(),
name: text('name'),
createdAt: timestamp('created_at').defaultNow(),
});
```
**Query builder**: `db.select().from(users).where(eq(users.email, email))`
**Migrations**: `npx drizzle-kit generate:pg` then `npx drizzle-kit push:pg`
### TypeORM
**Entity decorators**
```typescript
@Entity()
export class User {
@PrimaryGeneratedColumn()
id: number;
@Column({ unique: true })
email: string;
@OneToMany(() => Post, post => post.author)
posts: Post[];
}
```
**Repository pattern**: `userRepo.find({ where: { email }, relations: ['posts'] })`
**Migrations**: `npx typeorm migration:generate -n AddUserEmail`
### SQLAlchemy
**Declarative models**
```python
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
email = Column(String(255), unique=True, nullable=False)
name = Column(String(255))
posts = relationship('Post', back_populates='author')
```
**Session management**: Always use `with Session() as session:` context manager
**Alembic migrations**: `alembic revision --autogenerate -m "add user email"`
> See references/orm_patterns.md for side-by-side comparisons and migration workflows per ORM.
---
## Data Integrity
### Constraint Strategy
- **Primary keys** — every table must have one; prefer surrogate keys (serial/UUID)
- **Foreign keys** — enforce referential integrity; define ON DELETE behavior explicitly
- **UNIQUE constraints** — for business-level uniqueness (email, slug, API key)
- **CHECK constraints** — validate ranges, enums, and business rules at the DB level
- **NOT NULL** — default to NOT NULL; make nullable only when genuinely optional
### Transaction Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use Case |
|-------|-----------|-------------------|-------------|----------|
| READ UNCOMMITTED | Yes | Yes | Yes | Never recommended |
| READ COMMITTED | No | Yes | Yes | Default for PostgreSQL, general OLTP |
| REPEATABLE READ | No | No | Yes (InnoDB: No) | Financial calculations |
| SERIALIZABLE | No | No | No | Critical consistency (billing, inventory) |
### Deadlock Prevention
1. **Consistent lock ordering** — always acquire locks in the same table/row order
2. **Short transactions** — minimize time between first lock and commit
3. **Advisory locks** — use `pg_advisory_lock()` for application-level coordination
4. **Retry logic** — catch deadlock errors and retry with exponential backoff
---
## Backup & Restore
### PostgreSQL
```bash
# Full backup
pg_dump -Fc --no-owner dbname > backup.dump
# Restore
pg_restore -d dbname --clean --no-owner backup.dump
# Point-in-time recovery: configure WAL archiving + restore_command
```
### MySQL
```bash
# Full backup
mysqldump --single-transaction --routines --triggers dbname > backup.sql
# Restore
mysql dbname < backup.sql
# Binary log for PITR: mysqlbinlog --start-datetime="2025-01-01 00:00:00" binlog.000001
```
### SQLite
```bash
# Backup (safe with concurrent reads)
sqlite3 dbname ".backup backup.db"
```
### Backup Best Practices
- **Automate** — cron or systemd timer, never manual-only
- **Test restores** — untested backups are not backups
- **Offsite copies** — S3, GCS, or separate region
- **Retention policy** — daily for 7 days, weekly for 4 weeks, monthly for 12 months
- **Monitor backup size and duration** — sudden changes signal issues
---
## Anti-Patterns
| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| `SELECT *` | Transfers unnecessary data, breaks on schema changes | Explicit column list |
| Missing indexes on FK columns | Slow JOINs and cascading deletes | Add indexes on all foreign keys |
| N+1 queries | 1 + N round trips to database | Eager loading or batch queries |
| Implicit type coercion | `WHERE id = '123'` prevents index use | Match types in predicates |
| No connection pooling | Exhausts connections under load | PgBouncer, ProxySQL, or ORM pool |
| Unbounded queries | No LIMIT risks returning millions of rows | Always paginate |
| Storing money as FLOAT | Rounding errors | Use `DECIMAL(19,4)` or integer cents |
| God tables | One table with 50+ columns | Normalize or use vertical partitioning |
| Soft deletes everywhere | Complicates every query with `WHERE deleted_at IS NULL` | Archive tables or event sourcing |
| Raw string concatenation | SQL injection | Parameterized queries always |
---
## Cross-References
| Skill | Relationship |
|-------|-------------|
| **database-designer** | Schema architecture, normalization analysis, ERD generation |
| **database-schema-designer** | Visual ERD modeling, relationship mapping |
| **migration-architect** | Complex multi-step migration orchestration |
| **api-design-reviewer** | Ensuring API endpoints align with query patterns |
| **observability-platform** | Query performance monitoring, slow query alerts |

View File

@@ -0,0 +1,330 @@
# Query Optimization Guide
How to read EXPLAIN plans, choose the right index types, understand query plan operators, and configure connection pooling.
---
## Reading EXPLAIN Plans
### PostgreSQL — EXPLAIN ANALYZE
```sql
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE status = 'paid' ORDER BY created_at DESC LIMIT 20;
```
**Sample output:**
```
Limit (cost=0.43..12.87 rows=20 width=128) (actual time=0.052..0.089 rows=20 loops=1)
-> Index Scan Backward using idx_orders_status_created on orders (cost=0.43..4521.33 rows=7284 width=128) (actual time=0.051..0.085 rows=20 loops=1)
Index Cond: (status = 'paid')
Buffers: shared hit=4
Planning Time: 0.156 ms
Execution Time: 0.112 ms
```
**Key fields to check:**
| Field | What it tells you |
|-------|-------------------|
| `cost` | Estimated startup..total cost (arbitrary units) |
| `rows` | Estimated row count at that node |
| `actual time` | Real wall-clock time in milliseconds |
| `actual rows` | Real row count — compare against estimate |
| `Buffers: shared hit` | Pages read from cache (good) |
| `Buffers: shared read` | Pages read from disk (slow) |
| `loops` | How many times the node executed |
**Red flags:**
- `Seq Scan` on a large table with a WHERE clause — missing index
- `actual rows` >> `rows` (estimated) — stale statistics, run `ANALYZE`
- `Nested Loop` with high loop count — consider hash join or add index
- `Sort` with `external merge` — not enough `work_mem`, spilling to disk
- `Buffers: shared read` much higher than `shared hit` — cold cache or table too large for memory
### MySQL — EXPLAIN FORMAT=JSON
```sql
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE status = 'paid' ORDER BY created_at DESC LIMIT 20;
```
**Key fields:**
- `query_block.select_id` — identifies subqueries
- `table.access_type``ALL` (full scan), `ref` (index lookup), `range`, `index`, `const`
- `table.rows_examined_per_scan` — how many rows the engine reads
- `table.using_index` — covering index (no table lookup needed)
- `table.attached_condition` — the WHERE filter applied
**Access types ranked (best to worst):**
`system` > `const` > `eq_ref` > `ref` > `range` > `index` > `ALL`
---
## Index Types
### B-tree (default)
The workhorse index. Supports equality, range, prefix, and ORDER BY operations.
**Best for:** `=`, `<`, `>`, `<=`, `>=`, `BETWEEN`, `LIKE 'prefix%'`, `ORDER BY`, `MIN()`, `MAX()`
```sql
CREATE INDEX idx_orders_created ON orders (created_at);
```
**Composite B-tree:** Column order matters. The index is useful for queries that filter on a leftmost prefix of the indexed columns.
```sql
-- This index serves: WHERE status = ... AND created_at > ...
-- Also serves: WHERE status = ...
-- Does NOT serve: WHERE created_at > ... (without status)
CREATE INDEX idx_orders_status_created ON orders (status, created_at);
```
### Hash
Equality-only lookups. Faster than B-tree for exact matches but no range support.
**Best for:** `=` lookups on high-cardinality columns
```sql
-- PostgreSQL
CREATE INDEX idx_sessions_token ON sessions USING hash (token);
```
**Limitations:** No range queries, no ORDER BY, not WAL-logged before PostgreSQL 10.
### GIN (Generalized Inverted Index)
For multi-valued data: arrays, JSONB, full-text search vectors.
```sql
-- JSONB containment
CREATE INDEX idx_products_tags ON products USING gin (tags);
-- Query: SELECT * FROM products WHERE tags @> '["sale"]';
-- Full-text search
CREATE INDEX idx_articles_search ON articles USING gin (to_tsvector('english', title || ' ' || body));
```
### GiST (Generalized Search Tree)
For geometric, range, and proximity data.
```sql
-- Range type (e.g., date ranges)
CREATE INDEX idx_bookings_period ON bookings USING gist (during);
-- Query: SELECT * FROM bookings WHERE during && '[2025-01-01, 2025-01-31]';
-- PostGIS geometry
CREATE INDEX idx_locations_geom ON locations USING gist (geom);
```
### BRIN (Block Range INdex)
Tiny index for naturally ordered data (e.g., time-series append-only tables).
```sql
CREATE INDEX idx_events_created ON events USING brin (created_at);
```
**Best for:** Large tables where the indexed column correlates with physical row order. Much smaller than B-tree but less precise.
### Partial Index
Index only rows matching a condition. Smaller and faster for targeted queries.
```sql
-- Only index active users (skip millions of inactive)
CREATE INDEX idx_users_active_email ON users (email) WHERE status = 'active';
```
### Covering Index (INCLUDE)
Store extra columns in the index to avoid table lookups (index-only scans).
```sql
-- PostgreSQL 11+
CREATE INDEX idx_orders_status ON orders (status) INCLUDE (total, created_at);
-- Query can be answered entirely from the index:
-- SELECT total, created_at FROM orders WHERE status = 'paid';
```
### Expression Index
Index the result of a function or expression.
```sql
CREATE INDEX idx_users_lower_email ON users (LOWER(email));
-- Query: SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
```
---
## Query Plan Operators
### Scan operators
| Operator | Description | Performance |
|----------|-------------|-------------|
| **Seq Scan** | Full table scan, reads every row | Slow on large tables |
| **Index Scan** | B-tree lookup + table fetch | Fast for selective queries |
| **Index Only Scan** | Reads only the index (covering) | Fastest for covered queries |
| **Bitmap Index Scan** | Builds a bitmap of matching pages | Good for medium selectivity |
| **Bitmap Heap Scan** | Fetches pages identified by bitmap | Pairs with bitmap index scan |
### Join operators
| Operator | Description | Best when |
|----------|-------------|-----------|
| **Nested Loop** | For each outer row, scan inner | Small outer set, indexed inner |
| **Hash Join** | Build hash table on inner, probe with outer | Medium-large sets, no index |
| **Merge Join** | Merge two sorted inputs | Both inputs already sorted |
### Other operators
| Operator | Description |
|----------|-------------|
| **Sort** | Sorts rows (may spill to disk if work_mem exceeded) |
| **Hash Aggregate** | GROUP BY using hash table |
| **Group Aggregate** | GROUP BY on pre-sorted input |
| **Limit** | Stops after N rows |
| **Materialize** | Caches subquery results in memory |
| **Gather / Gather Merge** | Collects results from parallel workers |
---
## Connection Pooling
### Why pool connections?
Each database connection consumes memory (5-10 MB in PostgreSQL). Without pooling:
- Application creates a new connection per request (slow: TCP + TLS + auth)
- Under load, connection count spikes past `max_connections`
- Database OOM or connection refused errors
### PgBouncer (PostgreSQL)
The standard external connection pooler for PostgreSQL.
**Modes:**
- **Session** — connection assigned for entire client session (safest, least efficient)
- **Transaction** — connection returned to pool after each transaction (recommended)
- **Statement** — connection returned after each statement (cannot use transactions)
```ini
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 200
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 300
```
**Sizing formula:**
```
default_pool_size = num_cpu_cores * 2 + effective_spindle_count
```
For SSDs, start with `num_cpu_cores * 2` (typically 4-16 connections is optimal).
### ProxySQL (MySQL)
```ini
mysql_servers = ({ address="127.0.0.1", port=3306, hostgroup=0, max_connections=100 })
mysql_query_rules = ({ rule_id=1, match_pattern="^SELECT.*FOR UPDATE", destination_hostgroup=0 })
```
### Application-Level Pooling
Most ORMs and drivers include built-in pooling:
| Platform | Pool Configuration |
|----------|--------------------|
| **node-postgres** | `new Pool({ max: 20, idleTimeoutMillis: 30000 })` |
| **SQLAlchemy** | `create_engine(url, pool_size=20, max_overflow=5)` |
| **HikariCP (Java)** | `maximumPoolSize=20, minimumIdle=5, idleTimeout=300000` |
| **Prisma** | `connection_limit=20` in connection string |
### Pool Sizing Guidelines
| Metric | Guideline |
|--------|-----------|
| **Minimum** | Number of always-active background workers |
| **Maximum** | 2-4x CPU cores for OLTP; lower for OLAP |
| **Idle timeout** | 30-300 seconds (reclaim unused connections) |
| **Connection timeout** | 3-10 seconds (fail fast under pressure) |
| **Queue size** | 2-5x pool max (buffer bursts before rejecting) |
**Warning:** More connections does not mean better performance. Beyond the optimal point (usually 20-50), contention on locks, CPU, and I/O causes throughput to decrease.
---
## Statistics and Maintenance
### PostgreSQL
```sql
-- Update statistics for the query planner
ANALYZE orders;
ANALYZE; -- All tables
-- Check table bloat and dead tuples
SELECT relname, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
-- Identify unused indexes
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexrelname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;
```
### MySQL
```sql
-- Update statistics
ANALYZE TABLE orders;
-- Check index usage
SELECT * FROM sys.schema_unused_indexes;
SELECT * FROM sys.schema_redundant_indexes;
-- Identify long-running queries
SELECT * FROM information_schema.processlist WHERE time > 10;
```
---
## Performance Checklist
Before deploying any query to production:
1. Run `EXPLAIN ANALYZE` and verify no unexpected sequential scans
2. Check that estimated rows are within 10x of actual rows
3. Verify index usage on all WHERE, JOIN, and ORDER BY columns
4. Ensure LIMIT is present for user-facing list queries
5. Confirm parameterized queries (no string concatenation)
6. Test with production-like data volume (not just 10 rows)
7. Monitor query time in application metrics after deployment
8. Set up slow query log alerting (> 100ms for OLTP, > 5s for reports)
---
## Quick Reference: When to Use Which Index
| Query Pattern | Index Type |
|--------------|-----------|
| `WHERE col = value` | B-tree or Hash |
| `WHERE col > value` | B-tree |
| `WHERE col LIKE 'prefix%'` | B-tree |
| `WHERE col LIKE '%substring%'` | GIN (full-text) or trigram |
| `WHERE jsonb_col @> '{...}'` | GIN |
| `WHERE array_col && ARRAY[...]` | GIN |
| `WHERE range_col && '[a,b]'` | GiST |
| `WHERE ST_DWithin(geom, ...)` | GiST |
| `WHERE col = value` (append-only) | BRIN |
| `WHERE col = value AND status = 'active'` | Partial B-tree |
| `SELECT a, b WHERE c = value` | Covering (INCLUDE) |

View File

@@ -0,0 +1,451 @@
# ORM Patterns Reference
Side-by-side comparison of Prisma, Drizzle, TypeORM, and SQLAlchemy patterns for common database operations.
---
## Schema Definition
### Prisma (schema.prisma)
```prisma
model User {
id Int @id @default(autoincrement())
email String @unique
name String?
role Role @default(USER)
posts Post[]
profile Profile?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([email])
@@map("users")
}
model Post {
id Int @id @default(autoincrement())
title String
body String?
published Boolean @default(false)
author User @relation(fields: [authorId], references: [id], onDelete: Cascade)
authorId Int
tags Tag[]
createdAt DateTime @default(now())
@@index([authorId])
@@index([published, createdAt])
@@map("posts")
}
enum Role {
USER
ADMIN
MODERATOR
}
```
### Drizzle (schema.ts)
```typescript
import { pgTable, serial, varchar, text, boolean, timestamp, integer, pgEnum } from 'drizzle-orm/pg-core';
export const roleEnum = pgEnum('role', ['USER', 'ADMIN', 'MODERATOR']);
export const users = pgTable('users', {
id: serial('id').primaryKey(),
email: varchar('email', { length: 255 }).notNull().unique(),
name: varchar('name', { length: 255 }),
role: roleEnum('role').default('USER').notNull(),
createdAt: timestamp('created_at').defaultNow().notNull(),
updatedAt: timestamp('updated_at').defaultNow().notNull(),
});
export const posts = pgTable('posts', {
id: serial('id').primaryKey(),
title: varchar('title', { length: 255 }).notNull(),
body: text('body'),
published: boolean('published').default(false).notNull(),
authorId: integer('author_id').notNull().references(() => users.id, { onDelete: 'cascade' }),
createdAt: timestamp('created_at').defaultNow().notNull(),
}, (table) => ({
authorIdx: index('idx_posts_author').on(table.authorId),
publishedIdx: index('idx_posts_published').on(table.published, table.createdAt),
}));
```
### TypeORM (entities)
```typescript
import { Entity, PrimaryGeneratedColumn, Column, ManyToOne, OneToMany, CreateDateColumn, UpdateDateColumn, Index } from 'typeorm';
export enum Role { USER = 'USER', ADMIN = 'ADMIN', MODERATOR = 'MODERATOR' }
@Entity('users')
export class User {
@PrimaryGeneratedColumn()
id: number;
@Column({ unique: true })
@Index()
email: string;
@Column({ nullable: true })
name: string;
@Column({ type: 'enum', enum: Role, default: Role.USER })
role: Role;
@OneToMany(() => Post, post => post.author)
posts: Post[];
@CreateDateColumn()
createdAt: Date;
@UpdateDateColumn()
updatedAt: Date;
}
@Entity('posts')
@Index(['published', 'createdAt'])
export class Post {
@PrimaryGeneratedColumn()
id: number;
@Column()
title: string;
@Column({ nullable: true, type: 'text' })
body: string;
@Column({ default: false })
published: boolean;
@ManyToOne(() => User, user => user.posts, { onDelete: 'CASCADE' })
author: User;
@Column()
authorId: number;
@CreateDateColumn()
createdAt: Date;
}
```
### SQLAlchemy (models.py)
```python
import enum
from datetime import datetime
from sqlalchemy import Column, Integer, String, Text, Boolean, DateTime, Enum, ForeignKey, Index
from sqlalchemy.orm import relationship, DeclarativeBase
class Base(DeclarativeBase):
pass
class Role(enum.Enum):
USER = "USER"
ADMIN = "ADMIN"
MODERATOR = "MODERATOR"
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True, autoincrement=True)
email = Column(String(255), unique=True, nullable=False, index=True)
name = Column(String(255), nullable=True)
role = Column(Enum(Role), default=Role.USER, nullable=False)
posts = relationship('Post', back_populates='author', cascade='all, delete-orphan')
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow, nullable=False)
class Post(Base):
__tablename__ = 'posts'
__table_args__ = (
Index('idx_posts_published', 'published', 'created_at'),
)
id = Column(Integer, primary_key=True, autoincrement=True)
title = Column(String(255), nullable=False)
body = Column(Text, nullable=True)
published = Column(Boolean, default=False, nullable=False)
author_id = Column(Integer, ForeignKey('users.id', ondelete='CASCADE'), nullable=False, index=True)
author = relationship('User', back_populates='posts')
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
```
---
## CRUD Operations
### Create
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.create({ data: { email, name } })` |
| **Drizzle** | `await db.insert(users).values({ email, name }).returning()` |
| **TypeORM** | `await userRepo.save(userRepo.create({ email, name }))` |
| **SQLAlchemy** | `session.add(User(email=email, name=name)); session.commit()` |
### Read (with filter)
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.findMany({ where: { role: 'ADMIN' }, orderBy: { createdAt: 'desc' } })` |
| **Drizzle** | `await db.select().from(users).where(eq(users.role, 'ADMIN')).orderBy(desc(users.createdAt))` |
| **TypeORM** | `await userRepo.find({ where: { role: Role.ADMIN }, order: { createdAt: 'DESC' } })` |
| **SQLAlchemy** | `session.query(User).filter(User.role == Role.ADMIN).order_by(User.created_at.desc()).all()` |
### Update
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.update({ where: { id }, data: { name } })` |
| **Drizzle** | `await db.update(users).set({ name }).where(eq(users.id, id))` |
| **TypeORM** | `await userRepo.update(id, { name })` |
| **SQLAlchemy** | `session.query(User).filter(User.id == id).update({User.name: name}); session.commit()` |
### Delete
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.delete({ where: { id } })` |
| **Drizzle** | `await db.delete(users).where(eq(users.id, id))` |
| **TypeORM** | `await userRepo.delete(id)` |
| **SQLAlchemy** | `session.query(User).filter(User.id == id).delete(); session.commit()` |
---
## Relations and Eager Loading
### Prisma — include / select
```typescript
// Eager load posts with user
const user = await prisma.user.findUnique({
where: { id: 1 },
include: { posts: { where: { published: true }, orderBy: { createdAt: 'desc' } } },
});
// Nested create
await prisma.user.create({
data: {
email: 'new@example.com',
posts: { create: [{ title: 'First post' }] },
},
});
```
### Drizzle — relational queries
```typescript
const result = await db.query.users.findFirst({
where: eq(users.id, 1),
with: { posts: { where: eq(posts.published, true), orderBy: [desc(posts.createdAt)] } },
});
```
### TypeORM — relations / query builder
```typescript
// FindOptions
const user = await userRepo.findOne({ where: { id: 1 }, relations: ['posts'] });
// QueryBuilder for complex joins
const result = await userRepo.createQueryBuilder('u')
.leftJoinAndSelect('u.posts', 'p', 'p.published = :pub', { pub: true })
.where('u.id = :id', { id: 1 })
.getOne();
```
### SQLAlchemy — joinedload / selectinload
```python
from sqlalchemy.orm import joinedload, selectinload
# Eager load in one JOIN query
user = session.query(User).options(joinedload(User.posts)).filter(User.id == 1).first()
# Eager load in a separate IN query (better for collections)
users = session.query(User).options(selectinload(User.posts)).all()
```
---
## Raw SQL Escape Hatches
Every ORM should provide a way to execute raw SQL for complex queries:
| ORM | Pattern |
|-----|---------|
| **Prisma** | `` prisma.$queryRaw`SELECT * FROM users WHERE id = ${id}` `` |
| **Drizzle** | `db.execute(sql`SELECT * FROM users WHERE id = ${id}`)` |
| **TypeORM** | `dataSource.query('SELECT * FROM users WHERE id = $1', [id])` |
| **SQLAlchemy** | `session.execute(text('SELECT * FROM users WHERE id = :id'), {'id': id})` |
Always use parameterized queries in raw SQL to prevent injection.
---
## Transaction Patterns
### Prisma
```typescript
await prisma.$transaction(async (tx) => {
const user = await tx.user.create({ data: { email } });
await tx.post.create({ data: { title: 'Welcome', authorId: user.id } });
});
```
### Drizzle
```typescript
await db.transaction(async (tx) => {
const [user] = await tx.insert(users).values({ email }).returning();
await tx.insert(posts).values({ title: 'Welcome', authorId: user.id });
});
```
### TypeORM
```typescript
await dataSource.transaction(async (manager) => {
const user = await manager.save(User, { email });
await manager.save(Post, { title: 'Welcome', authorId: user.id });
});
```
### SQLAlchemy
```python
with Session() as session:
try:
user = User(email=email)
session.add(user)
session.flush() # Get user.id without committing
session.add(Post(title='Welcome', author_id=user.id))
session.commit()
except Exception:
session.rollback()
raise
```
---
## Migration Workflows
### Prisma
```bash
# Generate migration from schema changes
npx prisma migrate dev --name add_posts_table
# Apply in production
npx prisma migrate deploy
# Reset database (dev only)
npx prisma migrate reset
# Generate client after schema change
npx prisma generate
```
**Files:** `prisma/migrations/<timestamp>_<name>/migration.sql`
### Drizzle
```bash
# Generate migration SQL from schema diff
npx drizzle-kit generate:pg
# Push schema directly (dev only, no migration files)
npx drizzle-kit push:pg
# Apply migrations
npx drizzle-kit migrate
```
**Files:** `drizzle/<timestamp>_<name>.sql`
### TypeORM
```bash
# Auto-generate migration from entity changes
npx typeorm migration:generate -d data-source.ts -n AddPostsTable
# Create empty migration
npx typeorm migration:create -n CustomMigration
# Run pending migrations
npx typeorm migration:run -d data-source.ts
# Revert last migration
npx typeorm migration:revert -d data-source.ts
```
**Files:** `src/migrations/<timestamp>-<Name>.ts`
### SQLAlchemy (Alembic)
```bash
# Initialize Alembic
alembic init alembic
# Auto-generate migration from model changes
alembic revision --autogenerate -m "add posts table"
# Apply all pending
alembic upgrade head
# Revert one step
alembic downgrade -1
# Show current state
alembic current
```
**Files:** `alembic/versions/<hash>_<slug>.py`
---
## N+1 Prevention Cheat Sheet
| ORM | Lazy (N+1 risk) | Eager (fixed) |
|-----|-----------------|---------------|
| **Prisma** | Not accessing `include` | `include: { posts: true }` |
| **Drizzle** | Separate queries | `with: { posts: true }` |
| **TypeORM** | `@ManyToOne(() => ..., { lazy: true })` | `relations: ['posts']` or `leftJoinAndSelect` |
| **SQLAlchemy** | Default `lazy='select'` | `joinedload()` or `selectinload()` |
**Rule of thumb:** If you access a relation inside a loop, you have an N+1 problem. Always load relations before the loop.
---
## Connection Pooling
### Prisma
```
# In .env or connection string
DATABASE_URL="postgresql://user:pass@host/db?connection_limit=20&pool_timeout=10"
```
### Drizzle (with node-postgres)
```typescript
import { Pool } from 'pg';
const pool = new Pool({ max: 20, idleTimeoutMillis: 30000, connectionTimeoutMillis: 5000 });
const db = drizzle(pool);
```
### TypeORM
```typescript
const dataSource = new DataSource({
type: 'postgres',
extra: { max: 20, idleTimeoutMillis: 30000 },
});
```
### SQLAlchemy
```python
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/db', pool_size=20, max_overflow=5, pool_timeout=30)
```
---
## Best Practices Summary
1. **Always use migrations** — never modify production schemas by hand
2. **Eager load relations** — prevent N+1 in every list/collection query
3. **Use transactions** — group related writes to maintain consistency
4. **Parameterize raw SQL** — never concatenate user input into queries
5. **Connection pooling** — configure pool size matching your workload
6. **Index foreign keys** — ORMs often skip this; add manually if needed
7. **Review generated SQL** — enable query logging in development to catch inefficiencies
8. **Type-safe queries** — leverage TypeScript/Python typing for compile-time checks
9. **Separate read/write models** — use views or read replicas for heavy reporting queries
10. **Test migrations both ways** — always verify that down migrations actually reverse up migrations

View File

@@ -0,0 +1,406 @@
# SQL Query Patterns Reference
Common query patterns for everyday database operations. All examples use PostgreSQL syntax with dialect notes where they differ.
---
## JOIN Patterns
### INNER JOIN — matching rows in both tables
```sql
SELECT u.name, o.id AS order_id, o.total
FROM users u
INNER JOIN orders o ON o.user_id = u.id
WHERE o.status = 'paid';
```
### LEFT JOIN — all rows from left, matching from right
```sql
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name;
```
Returns users even if they have zero orders.
### Self JOIN — comparing rows within the same table
```sql
-- Find employees who earn more than their manager
SELECT e.name AS employee, m.name AS manager, e.salary, m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
```
### CROSS JOIN — every combination (cartesian product)
```sql
-- Generate a calendar grid
SELECT d.date, s.shift_name
FROM dates d
CROSS JOIN shifts s;
```
Use intentionally. Accidental cartesian joins are a performance killer.
### LATERAL JOIN (PostgreSQL) — correlated subquery as a table
```sql
-- Top 3 orders per user
SELECT u.name, top_orders.*
FROM users u
CROSS JOIN LATERAL (
SELECT id, total FROM orders
WHERE user_id = u.id
ORDER BY total DESC LIMIT 3
) top_orders;
```
MySQL equivalent: use a subquery with `ROW_NUMBER()`.
---
## Common Table Expressions (CTEs)
### Basic CTE — readable subquery
```sql
WITH active_users AS (
SELECT id, name, email
FROM users
WHERE last_login > CURRENT_DATE - INTERVAL '30 days'
)
SELECT au.name, COUNT(o.id) AS recent_orders
FROM active_users au
JOIN orders o ON o.user_id = au.id
GROUP BY au.name;
```
### Multiple CTEs — chaining transformations
```sql
WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', created_at) AS month, SUM(total) AS revenue
FROM orders WHERE status = 'paid'
GROUP BY 1
),
growth AS (
SELECT month, revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month) * 100, 1) AS growth_pct
FROM monthly_revenue
)
SELECT * FROM growth ORDER BY month;
```
### Recursive CTE — hierarchical data
```sql
-- Organization tree
WITH RECURSIVE org_tree AS (
-- Base case: top-level managers
SELECT id, name, manager_id, 0 AS depth
FROM employees WHERE manager_id IS NULL
UNION ALL
-- Recursive case: subordinates
SELECT e.id, e.name, e.manager_id, ot.depth + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth, name;
```
### Recursive CTE — path traversal
```sql
-- Category breadcrumb
WITH RECURSIVE breadcrumb AS (
SELECT id, name, parent_id, name::TEXT AS path
FROM categories WHERE id = 42
UNION ALL
SELECT c.id, c.name, c.parent_id, c.name || ' > ' || b.path
FROM categories c
JOIN breadcrumb b ON c.id = b.parent_id
)
SELECT path FROM breadcrumb WHERE parent_id IS NULL;
```
---
## Window Functions
### ROW_NUMBER — assign unique rank per partition
```sql
SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;
```
### RANK and DENSE_RANK — handle ties
```sql
-- RANK: 1, 2, 2, 4 (skips after tie)
-- DENSE_RANK: 1, 2, 2, 3 (no skip)
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;
```
### Running total and moving average
```sql
SELECT date, amount,
SUM(amount) OVER (ORDER BY date) AS running_total,
AVG(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7d
FROM daily_revenue;
```
### LAG / LEAD — access adjacent rows
```sql
SELECT date, revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_day,
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change
FROM daily_revenue;
```
### NTILE — divide into buckets
```sql
-- Split customers into quartiles by total spend
SELECT customer_id, total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_summary;
```
### FIRST_VALUE / LAST_VALUE
```sql
SELECT department_id, name, salary,
FIRST_VALUE(name) OVER (PARTITION BY department_id ORDER BY salary DESC) AS highest_paid
FROM employees;
```
---
## Subquery Patterns
### EXISTS — correlated existence check
```sql
-- Users who have placed at least one order
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
```
### NOT EXISTS — safer than NOT IN for NULLs
```sql
-- Users who have never ordered
SELECT u.* FROM users u
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
```
### Scalar subquery — single value
```sql
SELECT name, salary,
salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;
```
### Derived table — subquery in FROM
```sql
SELECT dept, avg_salary
FROM (
SELECT department_id AS dept, AVG(salary) AS avg_salary
FROM employees GROUP BY department_id
) dept_avg
WHERE avg_salary > 100000;
```
---
## Aggregation Patterns
### GROUP BY with HAVING
```sql
-- Departments with more than 10 employees
SELECT department_id, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 10;
```
### GROUPING SETS — multiple grouping levels
```sql
SELECT region, product_category, SUM(revenue)
FROM sales
GROUP BY GROUPING SETS (
(region, product_category),
(region),
(product_category),
()
);
```
### ROLLUP — hierarchical subtotals
```sql
SELECT region, city, SUM(revenue)
FROM sales
GROUP BY ROLLUP (region, city);
-- Produces: (region, city), (region), ()
```
### CUBE — all combinations
```sql
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY CUBE (region, product);
```
### FILTER clause (PostgreSQL) — conditional aggregation
```sql
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = 'paid') AS paid,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled,
SUM(total) FILTER (WHERE status = 'paid') AS paid_revenue
FROM orders;
```
MySQL/SQL Server equivalent: `SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END)`.
---
## UPSERT Patterns
### PostgreSQL — ON CONFLICT
```sql
INSERT INTO user_settings (user_id, key, value, updated_at)
VALUES (1, 'theme', 'dark', NOW())
ON CONFLICT (user_id, key)
DO UPDATE SET value = EXCLUDED.value, updated_at = EXCLUDED.updated_at;
```
### MySQL — ON DUPLICATE KEY
```sql
INSERT INTO user_settings (user_id, key_name, value, updated_at)
VALUES (1, 'theme', 'dark', NOW())
ON DUPLICATE KEY UPDATE value = VALUES(value), updated_at = VALUES(updated_at);
```
### SQL Server — MERGE
```sql
MERGE INTO user_settings AS target
USING (VALUES (1, 'theme', 'dark')) AS source (user_id, key_name, value)
ON target.user_id = source.user_id AND target.key_name = source.key_name
WHEN MATCHED THEN UPDATE SET value = source.value, updated_at = GETDATE()
WHEN NOT MATCHED THEN INSERT (user_id, key_name, value, updated_at)
VALUES (source.user_id, source.key_name, source.value, GETDATE());
```
---
## JSON Operations
### PostgreSQL JSONB
```sql
-- Extract field
SELECT data->>'name' AS name FROM products WHERE data->>'category' = 'electronics';
-- Array contains
SELECT * FROM products WHERE data->'tags' ? 'sale';
-- Update nested field
UPDATE products SET data = jsonb_set(data, '{price}', '29.99') WHERE id = 1;
-- Aggregate into JSON array
SELECT jsonb_agg(jsonb_build_object('id', id, 'name', name)) FROM users;
```
### MySQL JSON
```sql
-- Extract field
SELECT JSON_EXTRACT(data, '$.name') AS name FROM products;
-- Shorthand: SELECT data->>"$.name"
-- Search in array
SELECT * FROM products WHERE JSON_CONTAINS(data->"$.tags", '"sale"');
-- Update
UPDATE products SET data = JSON_SET(data, '$.price', 29.99) WHERE id = 1;
```
---
## Pagination Patterns
### Offset pagination (simple but slow for deep pages)
```sql
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 40;
```
### Keyset pagination (fast, requires ordered unique column)
```sql
-- Page after the last seen id
SELECT * FROM products WHERE id > :last_seen_id ORDER BY id LIMIT 20;
```
### Keyset with composite sort
```sql
SELECT * FROM products
WHERE (created_at, id) < (:last_created_at, :last_id)
ORDER BY created_at DESC, id DESC
LIMIT 20;
```
---
## Bulk Operations
### Batch INSERT
```sql
INSERT INTO events (type, payload, created_at) VALUES
('click', '{"page": "/home"}', NOW()),
('view', '{"page": "/pricing"}', NOW()),
('click', '{"page": "/signup"}', NOW());
```
### Batch UPDATE with VALUES
```sql
UPDATE products AS p SET price = v.price
FROM (VALUES (1, 29.99), (2, 49.99), (3, 9.99)) AS v(id, price)
WHERE p.id = v.id;
```
### DELETE with subquery
```sql
DELETE FROM sessions
WHERE user_id IN (SELECT id FROM users WHERE deleted_at IS NOT NULL);
```
### COPY (PostgreSQL bulk load)
```sql
COPY products (name, price, category) FROM '/path/to/data.csv' WITH (FORMAT csv, HEADER true);
```
---
## Utility Patterns
### Generate series (PostgreSQL)
```sql
-- Fill date gaps
SELECT d::date FROM generate_series('2025-01-01'::date, '2025-12-31', '1 day') d;
```
### Deduplicate rows
```sql
DELETE FROM events a USING events b
WHERE a.id > b.id AND a.user_id = b.user_id AND a.event_type = b.event_type
AND a.created_at = b.created_at;
```
### Pivot (manual)
```sql
SELECT user_id,
SUM(CASE WHEN month = 1 THEN revenue END) AS jan,
SUM(CASE WHEN month = 2 THEN revenue END) AS feb,
SUM(CASE WHEN month = 3 THEN revenue END) AS mar
FROM monthly_revenue
GROUP BY user_id;
```
### Conditional INSERT (skip if exists)
```sql
INSERT INTO tags (name) SELECT 'new-tag'
WHERE NOT EXISTS (SELECT 1 FROM tags WHERE name = 'new-tag');
```

View File

@@ -0,0 +1,442 @@
#!/usr/bin/env python3
"""
Migration Generator
Generates database migration file templates (up/down) from natural-language
schema change descriptions.
Supported operations:
- Add column, drop column, rename column
- Add table, drop table, rename table
- Add index, drop index
- Add constraint, drop constraint
- Change column type
Usage:
python migration_generator.py --change "add email_verified boolean to users" --dialect postgres
python migration_generator.py --change "rename column name to full_name in customers" --format alembic
python migration_generator.py --change "add index on orders(status, created_at)" --output 001_add_index.sql
python migration_generator.py --change "create table reviews with id, user_id, rating, body" --json
"""
import argparse
import json
import os
import re
import sys
import textwrap
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import List, Optional, Tuple
@dataclass
class Migration:
"""A generated migration with up and down scripts."""
description: str
dialect: str
format: str
up: str
down: str
warnings: List[str]
def to_dict(self):
return asdict(self)
# ---------------------------------------------------------------------------
# Change parsers — extract structured intent from natural language
# ---------------------------------------------------------------------------
def parse_add_column(desc: str) -> Optional[dict]:
"""Parse: add <column> <type> to <table>"""
m = re.match(
r'add\s+(?:column\s+)?(\w+)\s+(\w[\w(),.]*)\s+(?:to|on)\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "add_column", "column": m.group(1), "type": m.group(2), "table": m.group(3)}
return None
def parse_drop_column(desc: str) -> Optional[dict]:
"""Parse: drop/remove <column> from <table>"""
m = re.match(
r'(?:drop|remove)\s+(?:column\s+)?(\w+)\s+from\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "drop_column", "column": m.group(1), "table": m.group(2)}
return None
def parse_rename_column(desc: str) -> Optional[dict]:
"""Parse: rename column <old> to <new> in <table>"""
m = re.match(
r'rename\s+column\s+(\w+)\s+to\s+(\w+)\s+in\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "rename_column", "old": m.group(1), "new": m.group(2), "table": m.group(3)}
return None
def parse_add_table(desc: str) -> Optional[dict]:
"""Parse: create table <name> with <col1>, <col2>, ..."""
m = re.match(
r'create\s+table\s+(\w+)\s+with\s+(.+)',
desc, re.IGNORECASE,
)
if m:
cols = [c.strip() for c in m.group(2).split(",")]
return {"op": "add_table", "table": m.group(1), "columns": cols}
return None
def parse_drop_table(desc: str) -> Optional[dict]:
"""Parse: drop table <name>"""
m = re.match(r'drop\s+table\s+(\w+)', desc, re.IGNORECASE)
if m:
return {"op": "drop_table", "table": m.group(1)}
return None
def parse_add_index(desc: str) -> Optional[dict]:
"""Parse: add index on <table>(<col1>, <col2>)"""
m = re.match(
r'add\s+(?:unique\s+)?index\s+(?:on\s+)?(\w+)\s*\(([^)]+)\)',
desc, re.IGNORECASE,
)
if m:
unique = "unique" in desc.lower()
cols = [c.strip() for c in m.group(2).split(",")]
return {"op": "add_index", "table": m.group(1), "columns": cols, "unique": unique}
return None
def parse_change_type(desc: str) -> Optional[dict]:
"""Parse: change <column> type to <type> in <table>"""
m = re.match(
r'change\s+(?:column\s+)?(\w+)\s+type\s+to\s+(\w[\w(),.]*)\s+in\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "change_type", "column": m.group(1), "new_type": m.group(2), "table": m.group(3)}
return None
PARSERS = [
parse_add_column,
parse_drop_column,
parse_rename_column,
parse_add_table,
parse_drop_table,
parse_add_index,
parse_change_type,
]
def parse_change(desc: str) -> Optional[dict]:
for parser in PARSERS:
result = parser(desc)
if result:
return result
return None
# ---------------------------------------------------------------------------
# SQL generators per dialect
# ---------------------------------------------------------------------------
TYPE_MAP = {
"boolean": {"postgres": "BOOLEAN", "mysql": "TINYINT(1)", "sqlite": "INTEGER", "sqlserver": "BIT"},
"text": {"postgres": "TEXT", "mysql": "TEXT", "sqlite": "TEXT", "sqlserver": "NVARCHAR(MAX)"},
"integer": {"postgres": "INTEGER", "mysql": "INT", "sqlite": "INTEGER", "sqlserver": "INT"},
"int": {"postgres": "INTEGER", "mysql": "INT", "sqlite": "INTEGER", "sqlserver": "INT"},
"serial": {"postgres": "SERIAL", "mysql": "INT AUTO_INCREMENT", "sqlite": "INTEGER", "sqlserver": "INT IDENTITY(1,1)"},
"varchar": {"postgres": "VARCHAR(255)", "mysql": "VARCHAR(255)", "sqlite": "TEXT", "sqlserver": "NVARCHAR(255)"},
"timestamp": {"postgres": "TIMESTAMP", "mysql": "DATETIME", "sqlite": "TEXT", "sqlserver": "DATETIME2"},
"uuid": {"postgres": "UUID", "mysql": "CHAR(36)", "sqlite": "TEXT", "sqlserver": "UNIQUEIDENTIFIER"},
"json": {"postgres": "JSONB", "mysql": "JSON", "sqlite": "TEXT", "sqlserver": "NVARCHAR(MAX)"},
"decimal": {"postgres": "DECIMAL(19,4)", "mysql": "DECIMAL(19,4)", "sqlite": "REAL", "sqlserver": "DECIMAL(19,4)"},
"float": {"postgres": "DOUBLE PRECISION", "mysql": "DOUBLE", "sqlite": "REAL", "sqlserver": "FLOAT"},
}
def map_type(type_name: str, dialect: str) -> str:
"""Map a generic type name to a dialect-specific type."""
key = type_name.lower().rstrip("()")
if key in TYPE_MAP and dialect in TYPE_MAP[key]:
return TYPE_MAP[key][dialect]
return type_name.upper()
def gen_add_column(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
col_type = map_type(change["type"], dialect)
table = change["table"]
col = change["column"]
up = f"ALTER TABLE {table} ADD COLUMN {col} {col_type};"
down = f"ALTER TABLE {table} DROP COLUMN {col};"
return up, down, []
def gen_drop_column(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
col = change["column"]
up = f"ALTER TABLE {table} DROP COLUMN {col};"
down = f"-- WARNING: Cannot fully reverse DROP COLUMN. Provide the original type.\nALTER TABLE {table} ADD COLUMN {col} TEXT;"
return up, down, ["Down migration uses TEXT as placeholder. Replace with the original column type."]
def gen_rename_column(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
old, new = change["old"], change["new"]
warnings = []
if dialect == "postgres":
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
elif dialect == "mysql":
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
elif dialect == "sqlite":
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
warnings.append("SQLite RENAME COLUMN requires version 3.25.0+.")
elif dialect == "sqlserver":
up = f"EXEC sp_rename '{table}.{old}', '{new}', 'COLUMN';"
down = f"EXEC sp_rename '{table}.{new}', '{old}', 'COLUMN';"
else:
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
return up, down, warnings
def gen_add_table(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
cols = change["columns"]
col_defs = []
has_id = False
for col in cols:
col = col.strip()
if col.lower() == "id":
has_id = True
if dialect == "postgres":
col_defs.append(" id SERIAL PRIMARY KEY")
elif dialect == "mysql":
col_defs.append(" id INT AUTO_INCREMENT PRIMARY KEY")
elif dialect == "sqlite":
col_defs.append(" id INTEGER PRIMARY KEY AUTOINCREMENT")
elif dialect == "sqlserver":
col_defs.append(" id INT IDENTITY(1,1) PRIMARY KEY")
else:
# Check if type is specified (e.g., "rating int")
parts = col.split()
if len(parts) >= 2:
col_defs.append(f" {parts[0]} {map_type(parts[1], dialect)}")
else:
col_defs.append(f" {col} TEXT")
cols_sql = ",\n".join(col_defs)
up = f"CREATE TABLE {table} (\n{cols_sql}\n);"
down = f"DROP TABLE {table};"
warnings = []
if not has_id:
warnings.append("Table has no explicit primary key. Consider adding an 'id' column.")
return up, down, warnings
def gen_drop_table(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
up = f"DROP TABLE {table};"
down = f"-- WARNING: Cannot reverse DROP TABLE without original DDL.\nCREATE TABLE {table} (id INTEGER PRIMARY KEY);"
return up, down, ["Down migration is a placeholder. Replace with the original CREATE TABLE statement."]
def gen_add_index(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
cols = change["columns"]
unique = "UNIQUE " if change.get("unique") else ""
idx_name = f"idx_{table}_{'_'.join(cols)}"
if dialect == "postgres":
up = f"CREATE {unique}INDEX CONCURRENTLY {idx_name} ON {table} ({', '.join(cols)});"
else:
up = f"CREATE {unique}INDEX {idx_name} ON {table} ({', '.join(cols)});"
down = f"DROP INDEX {idx_name};" if dialect != "mysql" else f"DROP INDEX {idx_name} ON {table};"
warnings = []
if dialect == "postgres":
warnings.append("CONCURRENTLY cannot run inside a transaction. Run outside migration transaction.")
return up, down, warnings
def gen_change_type(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
col = change["column"]
new_type = map_type(change["new_type"], dialect)
warnings = ["Down migration uses TEXT as placeholder. Replace with the original column type."]
if dialect == "postgres":
up = f"ALTER TABLE {table} ALTER COLUMN {col} TYPE {new_type};"
down = f"ALTER TABLE {table} ALTER COLUMN {col} TYPE TEXT;"
elif dialect == "mysql":
up = f"ALTER TABLE {table} MODIFY COLUMN {col} {new_type};"
down = f"ALTER TABLE {table} MODIFY COLUMN {col} TEXT;"
elif dialect == "sqlserver":
up = f"ALTER TABLE {table} ALTER COLUMN {col} {new_type};"
down = f"ALTER TABLE {table} ALTER COLUMN {col} NVARCHAR(MAX);"
else:
up = f"-- SQLite does not support ALTER COLUMN. Recreate the table."
down = f"-- SQLite does not support ALTER COLUMN. Recreate the table."
warnings.append("SQLite requires table recreation for type changes.")
return up, down, warnings
GENERATORS = {
"add_column": gen_add_column,
"drop_column": gen_drop_column,
"rename_column": gen_rename_column,
"add_table": gen_add_table,
"drop_table": gen_drop_table,
"add_index": gen_add_index,
"change_type": gen_change_type,
}
# ---------------------------------------------------------------------------
# Format wrappers
# ---------------------------------------------------------------------------
def wrap_sql(up: str, down: str, description: str) -> Tuple[str, str]:
"""Wrap as plain SQL migration files."""
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
header = f"-- Migration: {description}\n-- Generated: {datetime.now().isoformat()}\n\n"
return header + "-- Up\n" + up, header + "-- Down\n" + down
def wrap_prisma(up: str, down: str, description: str) -> Tuple[str, str]:
"""Format as Prisma migration SQL (Prisma uses raw SQL in migration.sql)."""
header = f"-- Migration: {description}\n-- Format: Prisma (migration.sql)\n\n"
return header + up, header + "-- Rollback\n" + down
def wrap_alembic(up: str, down: str, description: str) -> Tuple[str, str]:
"""Format as Alembic Python migration."""
slug = re.sub(r'\W+', '_', description.lower())[:40]
revision = datetime.now().strftime("%Y%m%d%H%M")
template = textwrap.dedent(f'''\
"""
{description}
Revision ID: {revision}
"""
from alembic import op
import sqlalchemy as sa
revision = '{revision}'
down_revision = None # Set to previous revision
def upgrade():
op.execute("""
{textwrap.indent(up, " ")}
""")
def downgrade():
op.execute("""
{textwrap.indent(down, " ")}
""")
''')
return template, ""
FORMATTERS = {
"sql": wrap_sql,
"prisma": wrap_prisma,
"alembic": wrap_alembic,
}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Generate database migration templates from change descriptions.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Supported change descriptions:
"add email_verified boolean to users"
"drop column legacy_flag from accounts"
"rename column name to full_name in customers"
"create table reviews with id, user_id, rating int, body text"
"drop table temp_imports"
"add index on orders(status, created_at)"
"add unique index on users(email)"
"change email type to varchar in users"
Examples:
%(prog)s --change "add phone varchar to users" --dialect postgres
%(prog)s --change "create table reviews with id, user_id, rating int, body" --format prisma
%(prog)s --change "add index on orders(status)" --output migrations/001.sql --json
""",
)
parser.add_argument("--change", required=True, help="Natural-language description of the schema change")
parser.add_argument("--dialect", choices=["postgres", "mysql", "sqlite", "sqlserver"],
default="postgres", help="Target database dialect (default: postgres)")
parser.add_argument("--format", choices=["sql", "prisma", "alembic"], default="sql",
dest="fmt", help="Output format (default: sql)")
parser.add_argument("--output", help="Write migration to file instead of stdout")
parser.add_argument("--json", action="store_true", dest="json_output", help="Output as JSON")
args = parser.parse_args()
change = parse_change(args.change)
if not change:
print(f"Error: Could not parse change description: '{args.change}'", file=sys.stderr)
print("Run with --help to see supported patterns.", file=sys.stderr)
sys.exit(1)
gen_fn = GENERATORS.get(change["op"])
if not gen_fn:
print(f"Error: No generator for operation '{change['op']}'", file=sys.stderr)
sys.exit(1)
up, down, warnings = gen_fn(change, args.dialect)
fmt_fn = FORMATTERS[args.fmt]
up_formatted, down_formatted = fmt_fn(up, down, args.change)
migration = Migration(
description=args.change,
dialect=args.dialect,
format=args.fmt,
up=up_formatted,
down=down_formatted,
warnings=warnings,
)
if args.json_output:
print(json.dumps(migration.to_dict(), indent=2))
else:
if args.output:
with open(args.output, "w") as f:
f.write(migration.up)
print(f"Migration written to {args.output}")
if migration.down:
down_path = args.output.replace(".sql", "_down.sql")
with open(down_path, "w") as f:
f.write(migration.down)
print(f"Rollback written to {down_path}")
else:
print(migration.up)
if migration.down:
print("\n" + "=" * 40 + " ROLLBACK " + "=" * 40 + "\n")
print(migration.down)
if warnings:
print("\nWarnings:")
for w in warnings:
print(f" - {w}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,348 @@
#!/usr/bin/env python3
"""
SQL Query Optimizer — Static Analysis
Analyzes SQL queries for common performance issues:
- SELECT * usage
- Missing WHERE clauses on UPDATE/DELETE
- Cartesian joins (missing JOIN conditions)
- Subqueries in SELECT list
- Missing LIMIT on unbounded SELECTs
- Function calls on indexed columns (non-sargable)
- LIKE with leading wildcard
- ORDER BY RAND()
- UNION instead of UNION ALL
- NOT IN with subquery (NULL-unsafe)
Usage:
python query_optimizer.py --query "SELECT * FROM users"
python query_optimizer.py --query queries.sql --dialect postgres
python query_optimizer.py --query "SELECT * FROM orders" --json
"""
import argparse
import json
import os
import re
import sys
from dataclasses import dataclass, asdict
from typing import List, Optional
@dataclass
class Issue:
"""A single optimization issue found in a query."""
severity: str # critical, warning, info
rule: str
message: str
suggestion: str
line: Optional[int] = None
@dataclass
class QueryAnalysis:
"""Analysis result for one SQL query."""
query: str
issues: List[Issue]
score: int # 0-100, higher is better
def to_dict(self):
return {
"query": self.query[:200] + ("..." if len(self.query) > 200 else ""),
"issues": [asdict(i) for i in self.issues],
"issue_count": len(self.issues),
"score": self.score,
}
# ---------------------------------------------------------------------------
# Rule checkers
# ---------------------------------------------------------------------------
def check_select_star(sql: str) -> Optional[Issue]:
"""Detect SELECT * usage."""
if re.search(r'\bSELECT\s+\*\s', sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="select-star",
message="SELECT * transfers unnecessary data and breaks on schema changes.",
suggestion="List only the columns you need: SELECT col1, col2, ...",
)
return None
def check_missing_where(sql: str) -> Optional[Issue]:
"""Detect UPDATE/DELETE without WHERE."""
upper = sql.upper().strip()
for keyword in ("UPDATE", "DELETE"):
if upper.startswith(keyword) and "WHERE" not in upper:
return Issue(
severity="critical",
rule="missing-where",
message=f"{keyword} without WHERE affects every row in the table.",
suggestion=f"Add a WHERE clause to restrict the {keyword} scope.",
)
return None
def check_cartesian_join(sql: str) -> Optional[Issue]:
"""Detect comma-separated tables without explicit JOIN or WHERE join condition."""
upper = sql.upper()
if "SELECT" not in upper:
return None
from_match = re.search(r'\bFROM\s+(.+?)(?:\bWHERE\b|\bGROUP\b|\bORDER\b|\bLIMIT\b|\bHAVING\b|;|$)',
sql, re.IGNORECASE | re.DOTALL)
if not from_match:
return None
from_clause = from_match.group(1)
# Skip if explicit JOINs are used
if re.search(r'\bJOIN\b', from_clause, re.IGNORECASE):
return None
# Count comma-separated tables
tables = [t.strip() for t in from_clause.split(",") if t.strip()]
if len(tables) > 1 and "WHERE" not in upper:
return Issue(
severity="critical",
rule="cartesian-join",
message="Multiple tables in FROM without JOIN or WHERE creates a cartesian product.",
suggestion="Use explicit JOIN syntax with ON conditions.",
)
return None
def check_subquery_in_select(sql: str) -> Optional[Issue]:
"""Detect correlated subqueries in SELECT list."""
select_match = re.search(r'\bSELECT\b(.+?)\bFROM\b', sql, re.IGNORECASE | re.DOTALL)
if select_match:
select_clause = select_match.group(1)
if re.search(r'\(\s*SELECT\b', select_clause, re.IGNORECASE):
return Issue(
severity="warning",
rule="subquery-in-select",
message="Subquery in SELECT list executes once per row (correlated subquery).",
suggestion="Rewrite as a LEFT JOIN with aggregation.",
)
return None
def check_missing_limit(sql: str) -> Optional[Issue]:
"""Detect unbounded SELECT without LIMIT."""
upper = sql.upper().strip()
if not upper.startswith("SELECT"):
return None
# Skip if it's a subquery or aggregate-only
if re.search(r'\bCOUNT\s*\(', upper) and "GROUP BY" not in upper:
return None
if "LIMIT" not in upper and "FETCH" not in upper and "TOP " not in upper:
return Issue(
severity="info",
rule="missing-limit",
message="SELECT without LIMIT may return unbounded rows.",
suggestion="Add LIMIT to prevent returning excessive data.",
)
return None
def check_function_on_column(sql: str) -> Optional[Issue]:
"""Detect function calls on columns in WHERE (non-sargable)."""
where_match = re.search(r'\bWHERE\b(.+?)(?:\bGROUP\b|\bORDER\b|\bLIMIT\b|\bHAVING\b|;|$)',
sql, re.IGNORECASE | re.DOTALL)
if not where_match:
return None
where_clause = where_match.group(1)
non_sargable = re.search(
r'\b(YEAR|MONTH|DAY|DATE|UPPER|LOWER|TRIM|CAST|COALESCE|IFNULL|NVL)\s*\(',
where_clause, re.IGNORECASE
)
if non_sargable:
func = non_sargable.group(1).upper()
return Issue(
severity="warning",
rule="non-sargable",
message=f"Function {func}() on column in WHERE prevents index usage.",
suggestion="Rewrite to compare the raw column against transformed constants.",
)
return None
def check_leading_wildcard(sql: str) -> Optional[Issue]:
"""Detect LIKE '%...' patterns."""
if re.search(r"LIKE\s+'%", sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="leading-wildcard",
message="LIKE with leading wildcard prevents index usage.",
suggestion="Use full-text search (GIN index, FULLTEXT, FTS5) for substring matching.",
)
return None
def check_order_by_rand(sql: str) -> Optional[Issue]:
"""Detect ORDER BY RAND() / RANDOM()."""
if re.search(r'ORDER\s+BY\s+(RAND|RANDOM)\s*\(\)', sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="order-by-rand",
message="ORDER BY RAND() scans and sorts the entire table.",
suggestion="Use application-side random sampling or TABLESAMPLE.",
)
return None
def check_union_vs_union_all(sql: str) -> Optional[Issue]:
"""Detect UNION without ALL (unnecessary dedup)."""
if re.search(r'\bUNION\b(?!\s+ALL\b)', sql, re.IGNORECASE):
return Issue(
severity="info",
rule="union-without-all",
message="UNION performs deduplication sort; use UNION ALL if duplicates are acceptable.",
suggestion="Replace UNION with UNION ALL unless you specifically need deduplication.",
)
return None
def check_not_in_subquery(sql: str) -> Optional[Issue]:
"""Detect NOT IN (SELECT ...) which is NULL-unsafe."""
if re.search(r'\bNOT\s+IN\s*\(\s*SELECT\b', sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="not-in-subquery",
message="NOT IN with subquery returns no rows if any subquery result is NULL.",
suggestion="Use NOT EXISTS (SELECT 1 ...) instead.",
)
return None
ALL_CHECKS = [
check_select_star,
check_missing_where,
check_cartesian_join,
check_subquery_in_select,
check_missing_limit,
check_function_on_column,
check_leading_wildcard,
check_order_by_rand,
check_union_vs_union_all,
check_not_in_subquery,
]
# ---------------------------------------------------------------------------
# Analysis engine
# ---------------------------------------------------------------------------
def analyze_query(sql: str, dialect: str = "postgres") -> QueryAnalysis:
"""Run all checks against a single SQL query."""
issues: List[Issue] = []
for check_fn in ALL_CHECKS:
issue = check_fn(sql)
if issue:
issues.append(issue)
# Score: start at 100, deduct per severity
score = 100
for issue in issues:
if issue.severity == "critical":
score -= 25
elif issue.severity == "warning":
score -= 10
else:
score -= 5
score = max(0, score)
return QueryAnalysis(query=sql.strip(), issues=issues, score=score)
def split_queries(text: str) -> List[str]:
"""Split SQL text into individual statements."""
queries = []
for stmt in text.split(";"):
stmt = stmt.strip()
if stmt and len(stmt) > 5:
queries.append(stmt + ";")
return queries
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
SEVERITY_ICONS = {"critical": "[CRITICAL]", "warning": "[WARNING]", "info": "[INFO]"}
def format_text(analyses: List[QueryAnalysis]) -> str:
"""Format analysis results as human-readable text."""
lines = []
for i, analysis in enumerate(analyses, 1):
lines.append(f"{'='*60}")
lines.append(f"Query {i} (Score: {analysis.score}/100)")
lines.append(f" {analysis.query[:120]}{'...' if len(analysis.query) > 120 else ''}")
lines.append("")
if not analysis.issues:
lines.append(" No issues detected.")
for issue in analysis.issues:
icon = SEVERITY_ICONS.get(issue.severity, "")
lines.append(f" {icon} {issue.rule}: {issue.message}")
lines.append(f" -> {issue.suggestion}")
lines.append("")
return "\n".join(lines)
def format_json(analyses: List[QueryAnalysis]) -> str:
"""Format analysis results as JSON."""
return json.dumps(
{"analyses": [a.to_dict() for a in analyses], "total_queries": len(analyses)},
indent=2,
)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Analyze SQL queries for common performance issues.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s --query "SELECT * FROM users"
%(prog)s --query queries.sql --dialect mysql
%(prog)s --query "DELETE FROM orders" --json
""",
)
parser.add_argument(
"--query", required=True,
help="SQL query string or path to a .sql file",
)
parser.add_argument(
"--dialect", choices=["postgres", "mysql", "sqlite", "sqlserver"],
default="postgres", help="SQL dialect (default: postgres)",
)
parser.add_argument(
"--json", action="store_true", dest="json_output",
help="Output results as JSON",
)
args = parser.parse_args()
# Determine if query is a file path or inline SQL
sql_text = args.query
if os.path.isfile(args.query):
with open(args.query, "r") as f:
sql_text = f.read()
queries = split_queries(sql_text)
if not queries:
# Treat the whole input as a single query
queries = [sql_text.strip()]
analyses = [analyze_query(q, args.dialect) for q in queries]
if args.json_output:
print(format_json(analyses))
else:
print(format_text(analyses))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,315 @@
#!/usr/bin/env python3
"""
Schema Explorer
Generates schema documentation from database introspection queries.
Outputs the introspection SQL and sample documentation templates
for PostgreSQL, MySQL, SQLite, and SQL Server.
Since this tool runs without a live database connection, it generates:
1. The introspection queries you need to run
2. Documentation templates from the results
3. Sample schema docs for common table patterns
Usage:
python schema_explorer.py --dialect postgres --tables all --format md
python schema_explorer.py --dialect mysql --tables users,orders --format json
python schema_explorer.py --dialect sqlite --tables all --json
"""
import argparse
import json
import sys
import textwrap
from dataclasses import dataclass, asdict
from typing import List, Optional, Dict
# ---------------------------------------------------------------------------
# Introspection query templates per dialect
# ---------------------------------------------------------------------------
INTROSPECTION_QUERIES: Dict[str, Dict[str, str]] = {
"postgres": {
"tables": textwrap.dedent("""\
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public' AND table_type = 'BASE TABLE'
ORDER BY table_name;"""),
"columns": textwrap.dedent("""\
SELECT table_name, column_name, data_type, character_maximum_length,
is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public' {table_filter}
ORDER BY table_name, ordinal_position;"""),
"primary_keys": textwrap.dedent("""\
SELECT tc.table_name, kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'PRIMARY KEY' AND tc.table_schema = 'public'
ORDER BY tc.table_name;"""),
"foreign_keys": textwrap.dedent("""\
SELECT tc.table_name, kcu.column_name,
ccu.table_name AS foreign_table, ccu.column_name AS foreign_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
ORDER BY tc.table_name;"""),
"indexes": textwrap.dedent("""\
SELECT schemaname, tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY tablename, indexname;"""),
"table_sizes": textwrap.dedent("""\
SELECT relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS data_size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;"""),
},
"mysql": {
"tables": textwrap.dedent("""\
SELECT table_name
FROM information_schema.tables
WHERE table_schema = DATABASE() AND table_type = 'BASE TABLE'
ORDER BY table_name;"""),
"columns": textwrap.dedent("""\
SELECT table_name, column_name, column_type, is_nullable,
column_default, column_key, extra
FROM information_schema.columns
WHERE table_schema = DATABASE() {table_filter}
ORDER BY table_name, ordinal_position;"""),
"foreign_keys": textwrap.dedent("""\
SELECT table_name, column_name, referenced_table_name, referenced_column_name
FROM information_schema.key_column_usage
WHERE table_schema = DATABASE() AND referenced_table_name IS NOT NULL
ORDER BY table_name;"""),
"indexes": textwrap.dedent("""\
SELECT table_name, index_name, non_unique, column_name, seq_in_index
FROM information_schema.statistics
WHERE table_schema = DATABASE()
ORDER BY table_name, index_name, seq_in_index;"""),
"table_sizes": textwrap.dedent("""\
SELECT table_name, table_rows,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY data_length DESC;"""),
},
"sqlite": {
"tables": textwrap.dedent("""\
SELECT name FROM sqlite_master
WHERE type = 'table' AND name NOT LIKE 'sqlite_%'
ORDER BY name;"""),
"columns": textwrap.dedent("""\
-- Run for each table:
PRAGMA table_info({table_name});"""),
"foreign_keys": textwrap.dedent("""\
-- Run for each table:
PRAGMA foreign_key_list({table_name});"""),
"indexes": textwrap.dedent("""\
SELECT name, tbl_name, sql FROM sqlite_master
WHERE type = 'index'
ORDER BY tbl_name, name;"""),
"schema_dump": textwrap.dedent("""\
SELECT name, sql FROM sqlite_master
WHERE type = 'table'
ORDER BY name;"""),
},
"sqlserver": {
"tables": textwrap.dedent("""\
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_NAME;"""),
"columns": textwrap.dedent("""\
SELECT t.name AS table_name, c.name AS column_name,
ty.name AS data_type, c.max_length, c.precision, c.scale,
c.is_nullable, dc.definition AS default_value
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
LEFT JOIN sys.default_constraints dc ON c.default_object_id = dc.object_id
{table_filter}
ORDER BY t.name, c.column_id;"""),
"foreign_keys": textwrap.dedent("""\
SELECT fk.name AS fk_name,
tp.name AS parent_table, cp.name AS parent_column,
tr.name AS referenced_table, cr.name AS referenced_column
FROM sys.foreign_keys fk
JOIN sys.foreign_key_columns fkc ON fk.object_id = fkc.constraint_object_id
JOIN sys.tables tp ON fkc.parent_object_id = tp.object_id
JOIN sys.columns cp ON fkc.parent_object_id = cp.object_id AND fkc.parent_column_id = cp.column_id
JOIN sys.tables tr ON fkc.referenced_object_id = tr.object_id
JOIN sys.columns cr ON fkc.referenced_object_id = cr.object_id AND fkc.referenced_column_id = cr.column_id
ORDER BY tp.name;"""),
"indexes": textwrap.dedent("""\
SELECT t.name AS table_name, i.name AS index_name,
i.type_desc, i.is_unique, c.name AS column_name,
ic.key_ordinal
FROM sys.indexes i
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.tables t ON i.object_id = t.object_id
WHERE i.name IS NOT NULL
ORDER BY t.name, i.name, ic.key_ordinal;"""),
},
}
# ---------------------------------------------------------------------------
# Documentation generators
# ---------------------------------------------------------------------------
SAMPLE_TABLES = {
"users": {
"columns": [
{"name": "id", "type": "SERIAL / INT", "nullable": "NO", "default": "auto", "notes": "Primary key"},
{"name": "email", "type": "VARCHAR(255)", "nullable": "NO", "default": "-", "notes": "Unique, indexed"},
{"name": "name", "type": "VARCHAR(255)", "nullable": "YES", "default": "NULL", "notes": "Display name"},
{"name": "password_hash", "type": "VARCHAR(255)", "nullable": "NO", "default": "-", "notes": "bcrypt hash"},
{"name": "created_at", "type": "TIMESTAMP", "nullable": "NO", "default": "NOW()", "notes": ""},
{"name": "updated_at", "type": "TIMESTAMP", "nullable": "NO", "default": "NOW()", "notes": ""},
],
"indexes": ["PRIMARY KEY (id)", "UNIQUE INDEX (email)"],
"foreign_keys": [],
},
"orders": {
"columns": [
{"name": "id", "type": "SERIAL / INT", "nullable": "NO", "default": "auto", "notes": "Primary key"},
{"name": "user_id", "type": "INTEGER", "nullable": "NO", "default": "-", "notes": "FK -> users.id"},
{"name": "status", "type": "VARCHAR(50)", "nullable": "NO", "default": "'pending'", "notes": "pending/paid/shipped/cancelled"},
{"name": "total", "type": "DECIMAL(19,4)", "nullable": "NO", "default": "0", "notes": "Order total in cents"},
{"name": "created_at", "type": "TIMESTAMP", "nullable": "NO", "default": "NOW()", "notes": ""},
],
"indexes": ["PRIMARY KEY (id)", "INDEX (user_id)", "INDEX (status, created_at)"],
"foreign_keys": ["user_id -> users.id ON DELETE CASCADE"],
},
}
def generate_md(dialect: str, tables: List[str]) -> str:
"""Generate markdown schema documentation."""
lines = [f"# Database Schema Documentation ({dialect.upper()})\n"]
lines.append(f"Generated by sql-database-assistant schema_explorer.\n")
# Introspection queries section
lines.append("## Introspection Queries\n")
lines.append("Run these queries against your database to extract schema information:\n")
queries = INTROSPECTION_QUERIES.get(dialect, {})
for qname, qsql in queries.items():
table_filter = ""
if "all" not in tables:
tlist = ", ".join(f"'{t}'" for t in tables)
table_filter = f"AND table_name IN ({tlist})"
qsql = qsql.replace("{table_filter}", table_filter)
qsql = qsql.replace("{table_name}", tables[0] if tables and tables[0] != "all" else "TABLE_NAME")
lines.append(f"### {qname.replace('_', ' ').title()}\n")
lines.append(f"```sql\n{qsql}\n```\n")
# Sample documentation
lines.append("## Sample Table Documentation\n")
lines.append("Below is an example of the documentation format produced from query results:\n")
show_tables = tables if "all" not in tables else list(SAMPLE_TABLES.keys())
for tname in show_tables:
sample = SAMPLE_TABLES.get(tname)
if not sample:
lines.append(f"### {tname}\n")
lines.append("_No sample data available. Run introspection queries above._\n")
continue
lines.append(f"### {tname}\n")
lines.append("| Column | Type | Nullable | Default | Notes |")
lines.append("|--------|------|----------|---------|-------|")
for col in sample["columns"]:
lines.append(f"| {col['name']} | {col['type']} | {col['nullable']} | {col['default']} | {col['notes']} |")
lines.append("")
if sample["indexes"]:
lines.append("**Indexes:** " + ", ".join(sample["indexes"]))
if sample["foreign_keys"]:
lines.append("**Foreign Keys:** " + ", ".join(sample["foreign_keys"]))
lines.append("")
return "\n".join(lines)
def generate_json_output(dialect: str, tables: List[str]) -> dict:
"""Generate JSON schema documentation."""
queries = INTROSPECTION_QUERIES.get(dialect, {})
processed = {}
for qname, qsql in queries.items():
table_filter = ""
if "all" not in tables:
tlist = ", ".join(f"'{t}'" for t in tables)
table_filter = f"AND table_name IN ({tlist})"
processed[qname] = qsql.replace("{table_filter}", table_filter).replace(
"{table_name}", tables[0] if tables and tables[0] != "all" else "TABLE_NAME"
)
show_tables = tables if "all" not in tables else list(SAMPLE_TABLES.keys())
sample_docs = {}
for tname in show_tables:
sample = SAMPLE_TABLES.get(tname)
if sample:
sample_docs[tname] = sample
return {
"dialect": dialect,
"requested_tables": tables,
"introspection_queries": processed,
"sample_documentation": sample_docs,
"instructions": "Run the introspection queries against your database, then use the results to populate documentation in the sample format shown.",
}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Generate schema documentation from database introspection.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s --dialect postgres --tables all --format md
%(prog)s --dialect mysql --tables users,orders --format json
%(prog)s --dialect sqlite --tables all --json
""",
)
parser.add_argument(
"--dialect", required=True, choices=["postgres", "mysql", "sqlite", "sqlserver"],
help="Target database dialect",
)
parser.add_argument(
"--tables", default="all",
help="Comma-separated table names or 'all' (default: all)",
)
parser.add_argument(
"--format", choices=["md", "json"], default="md", dest="fmt",
help="Output format (default: md)",
)
parser.add_argument(
"--json", action="store_true", dest="json_output",
help="Output as JSON (overrides --format)",
)
args = parser.parse_args()
tables = [t.strip() for t in args.tables.split(",")]
if args.json_output or args.fmt == "json":
result = generate_json_output(args.dialect, tables)
print(json.dumps(result, indent=2))
else:
print(generate_md(args.dialect, tables))
if __name__ == "__main__":
main()