# System Design Workflows

Step-by-step workflows for common system design tasks.

## Workflows Index

1. [System Design Interview Approach](#1-system-design-interview-approach)
2. [Capacity Planning Workflow](#2-capacity-planning-workflow)
3. [API Design Workflow](#3-api-design-workflow)
4. [Database Schema Design](#4-database-schema-design-workflow)
5. [Scalability Assessment](#5-scalability-assessment-workflow)
6. [Migration Planning](#6-migration-planning-workflow)

---

## 1. System Design Interview Approach

Use when designing a system from scratch or explaining architecture decisions.

### Step 1: Clarify Requirements (3-5 minutes)

**Functional requirements:**
- What are the core features?
- Who are the users?
- What actions can users take?

**Non-functional requirements:**
- Expected scale (users, requests/sec, data size)
- Latency requirements
- Availability requirements (99.9%? 99.99%?)
- Consistency requirements (strong? eventual?)

**Example questions to ask:**
```
- How many users? Daily active users?
- Read/write ratio?
- Data retention period?
- Geographic distribution?
- Peak vs average load?
```

### Step 2: Estimate Scale (2-3 minutes)

**Calculate key metrics:**
```
Users:        10M monthly active users
DAU:          1M daily active users
Requests:     100 req/user/day = 100M req/day
              = 1,200 req/sec (avg)
              = 3,600 req/sec (peak, 3x)

Storage:      1KB/request × 100M = 100GB/day
              = 36TB/year

Bandwidth:    100GB/day = 1.2 MB/sec (avg)
```

### Step 3: Design High-Level Architecture (5-10 minutes)

**Start with basic components:**
```
┌──────────┐     ┌──────────┐     ┌──────────┐
│  Client  │────▶│   API    │────▶│ Database │
└──────────┘     └──────────┘     └──────────┘
```

**Add components as needed:**
- Load balancer for traffic distribution
- Cache for read-heavy workloads
- CDN for static content
- Message queue for async processing
- Search index for complex queries

### Step 4: Deep Dive into Components (10-15 minutes)

**For each major component, discuss:**
- Why this technology choice?
- How does it handle failures?
- How does it scale?
- What are the trade-offs?

### Step 5: Address Bottlenecks (5 minutes)

**Common bottlenecks:**
- Database read/write capacity
- Network bandwidth
- Single points of failure
- Hot spots in data distribution

**Solutions:**
- Caching (Redis, Memcached)
- Database sharding
- Read replicas
- CDN for static content
- Async processing for non-critical paths

---

## 2. Capacity Planning Workflow

Use when estimating infrastructure requirements for a new system or feature.

### Step 1: Gather Requirements

| Metric | Current | 6 months | 1 year |
|--------|---------|----------|--------|
| Monthly active users | | | |
| Peak concurrent users | | | |
| Requests per second | | | |
| Data storage (GB) | | | |
| Bandwidth (Mbps) | | | |

### Step 2: Calculate Compute Requirements

**Web/API servers:**
```
Peak RPS:           3,600
Requests per server: 500 (conservative)
Servers needed:     3,600 / 500 = 8 servers

With redundancy (N+2): 10 servers
```

**CPU estimation:**
```
Per request: 50ms CPU time
Peak RPS:    3,600
CPU cores:   3,600 × 0.05 = 180 cores

With headroom (70% target utilization):
             180 / 0.7 = 257 cores
             = 32 servers × 8 cores
```

### Step 3: Calculate Storage Requirements

**Database storage:**
```
Records per day:    100,000
Record size:        2KB
Daily growth:       200MB

With indexes (2x):  400MB/day
Retention (1 year): 146GB

With replication (3x): 438GB
```

**File storage:**
```
Files per day:      10,000
Average file size:  500KB
Daily growth:       5GB

Retention (1 year): 1.8TB
```

### Step 4: Calculate Network Requirements

**Bandwidth:**
```
Response size:      10KB average
Peak RPS:           3,600
Outbound:           3,600 × 10KB = 36MB/s = 288 Mbps

With headroom (50%): 432 Mbps ≈ 500 Mbps connection
```

### Step 5: Document and Review

**Create capacity plan document:**
- Current requirements
- Growth projections
- Infrastructure recommendations
- Cost estimates
- Review triggers (when to re-evaluate)

---

## 3. API Design Workflow

Use when designing new APIs or refactoring existing ones.

### Step 1: Identify Resources

**List the nouns in your domain:**
```
E-commerce example:
- Users
- Products
- Orders
- Payments
- Reviews
```

### Step 2: Define Operations

**Map CRUD to HTTP methods:**
| Operation | HTTP Method | URL Pattern |
|-----------|-------------|-------------|
| List | GET | /resources |
| Get one | GET | /resources/{id} |
| Create | POST | /resources |
| Update | PUT/PATCH | /resources/{id} |
| Delete | DELETE | /resources/{id} |

### Step 3: Design Request/Response Formats

**Request example:**
```json
POST /api/v1/orders
Content-Type: application/json

{
  "customer_id": "cust-123",
  "items": [
    {"product_id": "prod-456", "quantity": 2}
  ],
  "shipping_address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zip": "94102"
  }
}
```

**Response example:**
```json
HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "ord-789",
  "status": "pending",
  "customer_id": "cust-123",
  "items": [...],
  "total": 99.99,
  "created_at": "2024-01-15T10:30:00Z",
  "_links": {
    "self": "/api/v1/orders/ord-789",
    "customer": "/api/v1/customers/cust-123"
  }
}
```

### Step 4: Handle Errors Consistently

**Error response format:**
```json
HTTP/1.1 400 Bad Request
Content-Type: application/json

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid request parameters",
    "details": [
      {
        "field": "quantity",
        "message": "must be greater than 0"
      }
    ]
  },
  "request_id": "req-abc123"
}
```

**Standard error codes:**
| HTTP Status | Use Case |
|-------------|----------|
| 400 | Validation errors |
| 401 | Authentication required |
| 403 | Permission denied |
| 404 | Resource not found |
| 409 | Conflict (duplicate, etc.) |
| 429 | Rate limit exceeded |
| 500 | Internal server error |

### Step 5: Document the API

**Include:**
- Authentication method
- Base URL and versioning
- Endpoints with examples
- Error codes and meanings
- Rate limits
- Pagination format

---

## 4. Database Schema Design Workflow

Use when designing a new database or major schema changes.

### Step 1: Identify Entities

**List the things you need to store:**
```
E-commerce:
- User (id, email, name, created_at)
- Product (id, name, price, stock)
- Order (id, user_id, status, total)
- OrderItem (id, order_id, product_id, quantity, price)
```

### Step 2: Define Relationships

**Relationship types:**
```
User ──1:N──▶ Order       (one user, many orders)
Order ──1:N──▶ OrderItem  (one order, many items)
Product ──1:N──▶ OrderItem (one product, many order items)
```

### Step 3: Choose Primary Keys

**Options:**
| Type | Pros | Cons |
|------|------|------|
| Auto-increment | Simple, ordered | Not distributed-friendly |
| UUID | Globally unique | Larger, random |
| ULID | Globally unique, sortable | Larger |

### Step 4: Add Indexes

**Index selection rules:**
```sql
-- Index columns used in WHERE clauses
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Index columns used in JOINs
CREATE INDEX idx_order_items_order_id ON order_items(order_id);

-- Index columns used in ORDER BY with WHERE
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- Consider composite indexes for common queries
-- Query: SELECT * FROM orders WHERE user_id = ? AND status = 'active'
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
```

### Step 5: Plan for Scale

**Partitioning strategies:**
```sql
-- Partition by date (time-series data)
CREATE TABLE events (
  id BIGINT,
  created_at TIMESTAMP,
  data JSONB
) PARTITION BY RANGE (created_at);

-- Partition by hash (distribute evenly)
CREATE TABLE users (
  id BIGINT,
  email VARCHAR(255)
) PARTITION BY HASH (id);
```

**Sharding considerations:**
- Shard key selection (user_id, tenant_id, etc.)
- Cross-shard query limitations
- Rebalancing strategy

---

## 5. Scalability Assessment Workflow

Use when evaluating if current architecture can handle growth.

### Step 1: Profile Current System

**Metrics to collect:**
```
Current load:
- Average requests/sec: ___
- Peak requests/sec: ___
- Average latency: ___ ms
- P99 latency: ___ ms
- Error rate: ___%

Resource utilization:
- CPU: ___%
- Memory: ___%
- Disk I/O: ___%
- Network: ___%
```

### Step 2: Identify Bottlenecks

**Check each layer:**
| Layer | Bottleneck Signs |
|-------|------------------|
| Web servers | High CPU, connection limits |
| Application | Slow requests, thread pool exhaustion |
| Database | Slow queries, lock contention |
| Cache | High miss rate, memory pressure |
| Network | Bandwidth saturation, latency |

### Step 3: Load Test

**Test scenarios:**
```
1. Baseline: Current production load
2. 2x load: Expected growth in 6 months
3. 5x load: Stress test
4. Spike: Sudden 10x for 5 minutes
```

**Tools:**
- k6, Locust, JMeter for HTTP
- pgbench for PostgreSQL
- redis-benchmark for Redis

### Step 4: Identify Scaling Strategy

**Vertical scaling (scale up):**
- Add more CPU, memory, disk
- Simpler but has limits
- Use when: Single server can handle more

**Horizontal scaling (scale out):**
- Add more servers
- Requires stateless design
- Use when: Need linear scaling

### Step 5: Create Scaling Plan

**Document:**
```
Trigger: When average CPU > 70% for 15 minutes

Action:
1. Add 2 more web servers
2. Update load balancer
3. Verify health checks pass

Rollback:
1. Remove added servers
2. Update load balancer
3. Investigate issue
```

---

## 6. Migration Planning Workflow

Use when migrating to new infrastructure, database, or architecture.

### Step 1: Assess Current State

**Document:**
- Current architecture diagram
- Data volumes
- Dependencies
- Integration points
- Performance baselines

### Step 2: Define Target State

**Document:**
- New architecture diagram
- Technology changes
- Expected improvements
- Success criteria

### Step 3: Plan Migration Strategy

**Strategies:**

| Strategy | Risk | Downtime | Complexity |
|----------|------|----------|------------|
| Big bang | High | Yes | Low |
| Blue-green | Medium | Minimal | Medium |
| Canary | Low | None | High |
| Strangler fig | Low | None | High |

**Strangler fig pattern (recommended for large systems):**
```
1. Add facade in front of old system
2. Route small percentage of traffic to new system
3. Gradually increase traffic to new system
4. Retire old system when 100% migrated
```

### Step 4: Create Rollback Plan

**For each step, define:**
```
Step: Migrate user service to new database

Rollback trigger:
- Error rate > 1%
- Latency > 500ms P99
- Data inconsistency detected

Rollback steps:
1. Route traffic back to old database
2. Sync any new data back
3. Investigate root cause

Rollback time estimate: 15 minutes
```

### Step 5: Execute with Checkpoints

**Migration checklist:**
```
□ Backup current system
□ Verify backup restoration works
□ Deploy new infrastructure
□ Run smoke tests on new system
□ Migrate small percentage (1%)
□ Monitor for 24 hours
□ Increase to 10%
□ Monitor for 24 hours
□ Increase to 50%
□ Monitor for 24 hours
□ Complete migration (100%)
□ Decommission old system
□ Document lessons learned
```

---

## Quick Reference

| Task | Start Here |
|------|------------|
| New system design | [System Design Interview Approach](#1-system-design-interview-approach) |
| Infrastructure sizing | [Capacity Planning](#2-capacity-planning-workflow) |
| New API | [API Design](#3-api-design-workflow) |
| Database design | [Database Schema Design](#4-database-schema-design-workflow) |
| Handle growth | [Scalability Assessment](#5-scalability-assessment-workflow) |
| System migration | [Migration Planning](#6-migration-planning-workflow) |