Addresses SkillzWave feedback and Anthropic best practices: SKILL.md (343 lines): - Third-person description with trigger phrases - Added Table of Contents for navigation - Concrete tool descriptions with usage examples - Decision workflows: Database, Architecture Pattern, Monolith vs Microservices - Removed marketing fluff, added actionable content References (rewritten with real content): - architecture_patterns.md: 9 patterns with trade-offs, code examples (Monolith, Modular Monolith, Microservices, Event-Driven, CQRS, Event Sourcing, Hexagonal, Clean Architecture, API Gateway) - system_design_workflows.md: 6 step-by-step workflows (System Design Interview, Capacity Planning, API Design, Database Schema, Scalability Assessment, Migration Planning) - tech_decision_guide.md: 7 decision frameworks with matrices (Database, Cache, Message Queue, Auth, Frontend, Cloud, API) Scripts (fully functional, standard library only): - architecture_diagram_generator.py: Mermaid + PlantUML + ASCII output Scans project structure, detects components, relationships - dependency_analyzer.py: npm/pip/go/cargo support Circular dependency detection, coupling score calculation - project_architect.py: Pattern detection (7 patterns) Layer violation detection, code quality metrics All scripts tested and working. Closes #48 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
System Design Workflows
Step-by-step workflows for common system design tasks.
Workflows Index
- System Design Interview Approach
- Capacity Planning Workflow
- API Design Workflow
- Database Schema Design
- Scalability Assessment
- Migration Planning
1. System Design Interview Approach
Use when designing a system from scratch or explaining architecture decisions.
Step 1: Clarify Requirements (3-5 minutes)
Functional requirements:
- What are the core features?
- Who are the users?
- What actions can users take?
Non-functional requirements:
- Expected scale (users, requests/sec, data size)
- Latency requirements
- Availability requirements (99.9%? 99.99%?)
- Consistency requirements (strong? eventual?)
Example questions to ask:
- How many users? Daily active users?
- Read/write ratio?
- Data retention period?
- Geographic distribution?
- Peak vs average load?
Step 2: Estimate Scale (2-3 minutes)
Calculate key metrics:
Users: 10M monthly active users
DAU: 1M daily active users
Requests: 100 req/user/day = 100M req/day
= 1,200 req/sec (avg)
= 3,600 req/sec (peak, 3x)
Storage: 1KB/request × 100M = 100GB/day
= 36TB/year
Bandwidth: 100GB/day = 1.2 MB/sec (avg)
Step 3: Design High-Level Architecture (5-10 minutes)
Start with basic components:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Client │────▶│ API │────▶│ Database │
└──────────┘ └──────────┘ └──────────┘
Add components as needed:
- Load balancer for traffic distribution
- Cache for read-heavy workloads
- CDN for static content
- Message queue for async processing
- Search index for complex queries
Step 4: Deep Dive into Components (10-15 minutes)
For each major component, discuss:
- Why this technology choice?
- How does it handle failures?
- How does it scale?
- What are the trade-offs?
Step 5: Address Bottlenecks (5 minutes)
Common bottlenecks:
- Database read/write capacity
- Network bandwidth
- Single points of failure
- Hot spots in data distribution
Solutions:
- Caching (Redis, Memcached)
- Database sharding
- Read replicas
- CDN for static content
- Async processing for non-critical paths
2. Capacity Planning Workflow
Use when estimating infrastructure requirements for a new system or feature.
Step 1: Gather Requirements
| Metric | Current | 6 months | 1 year |
|---|---|---|---|
| Monthly active users | |||
| Peak concurrent users | |||
| Requests per second | |||
| Data storage (GB) | |||
| Bandwidth (Mbps) |
Step 2: Calculate Compute Requirements
Web/API servers:
Peak RPS: 3,600
Requests per server: 500 (conservative)
Servers needed: 3,600 / 500 = 8 servers
With redundancy (N+2): 10 servers
CPU estimation:
Per request: 50ms CPU time
Peak RPS: 3,600
CPU cores: 3,600 × 0.05 = 180 cores
With headroom (70% target utilization):
180 / 0.7 = 257 cores
= 32 servers × 8 cores
Step 3: Calculate Storage Requirements
Database storage:
Records per day: 100,000
Record size: 2KB
Daily growth: 200MB
With indexes (2x): 400MB/day
Retention (1 year): 146GB
With replication (3x): 438GB
File storage:
Files per day: 10,000
Average file size: 500KB
Daily growth: 5GB
Retention (1 year): 1.8TB
Step 4: Calculate Network Requirements
Bandwidth:
Response size: 10KB average
Peak RPS: 3,600
Outbound: 3,600 × 10KB = 36MB/s = 288 Mbps
With headroom (50%): 432 Mbps ≈ 500 Mbps connection
Step 5: Document and Review
Create capacity plan document:
- Current requirements
- Growth projections
- Infrastructure recommendations
- Cost estimates
- Review triggers (when to re-evaluate)
3. API Design Workflow
Use when designing new APIs or refactoring existing ones.
Step 1: Identify Resources
List the nouns in your domain:
E-commerce example:
- Users
- Products
- Orders
- Payments
- Reviews
Step 2: Define Operations
Map CRUD to HTTP methods:
| Operation | HTTP Method | URL Pattern |
|---|---|---|
| List | GET | /resources |
| Get one | GET | /resources/{id} |
| Create | POST | /resources |
| Update | PUT/PATCH | /resources/{id} |
| Delete | DELETE | /resources/{id} |
Step 3: Design Request/Response Formats
Request example:
POST /api/v1/orders
Content-Type: application/json
{
"customer_id": "cust-123",
"items": [
{"product_id": "prod-456", "quantity": 2}
],
"shipping_address": {
"street": "123 Main St",
"city": "San Francisco",
"state": "CA",
"zip": "94102"
}
}
Response example:
HTTP/1.1 201 Created
Content-Type: application/json
{
"id": "ord-789",
"status": "pending",
"customer_id": "cust-123",
"items": [...],
"total": 99.99,
"created_at": "2024-01-15T10:30:00Z",
"_links": {
"self": "/api/v1/orders/ord-789",
"customer": "/api/v1/customers/cust-123"
}
}
Step 4: Handle Errors Consistently
Error response format:
HTTP/1.1 400 Bad Request
Content-Type: application/json
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid request parameters",
"details": [
{
"field": "quantity",
"message": "must be greater than 0"
}
]
},
"request_id": "req-abc123"
}
Standard error codes:
| HTTP Status | Use Case |
|---|---|
| 400 | Validation errors |
| 401 | Authentication required |
| 403 | Permission denied |
| 404 | Resource not found |
| 409 | Conflict (duplicate, etc.) |
| 429 | Rate limit exceeded |
| 500 | Internal server error |
Step 5: Document the API
Include:
- Authentication method
- Base URL and versioning
- Endpoints with examples
- Error codes and meanings
- Rate limits
- Pagination format
4. Database Schema Design Workflow
Use when designing a new database or major schema changes.
Step 1: Identify Entities
List the things you need to store:
E-commerce:
- User (id, email, name, created_at)
- Product (id, name, price, stock)
- Order (id, user_id, status, total)
- OrderItem (id, order_id, product_id, quantity, price)
Step 2: Define Relationships
Relationship types:
User ──1:N──▶ Order (one user, many orders)
Order ──1:N──▶ OrderItem (one order, many items)
Product ──1:N──▶ OrderItem (one product, many order items)
Step 3: Choose Primary Keys
Options:
| Type | Pros | Cons |
|---|---|---|
| Auto-increment | Simple, ordered | Not distributed-friendly |
| UUID | Globally unique | Larger, random |
| ULID | Globally unique, sortable | Larger |
Step 4: Add Indexes
Index selection rules:
-- Index columns used in WHERE clauses
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Index columns used in JOINs
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
-- Index columns used in ORDER BY with WHERE
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Consider composite indexes for common queries
-- Query: SELECT * FROM orders WHERE user_id = ? AND status = 'active'
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
Step 5: Plan for Scale
Partitioning strategies:
-- Partition by date (time-series data)
CREATE TABLE events (
id BIGINT,
created_at TIMESTAMP,
data JSONB
) PARTITION BY RANGE (created_at);
-- Partition by hash (distribute evenly)
CREATE TABLE users (
id BIGINT,
email VARCHAR(255)
) PARTITION BY HASH (id);
Sharding considerations:
- Shard key selection (user_id, tenant_id, etc.)
- Cross-shard query limitations
- Rebalancing strategy
5. Scalability Assessment Workflow
Use when evaluating if current architecture can handle growth.
Step 1: Profile Current System
Metrics to collect:
Current load:
- Average requests/sec: ___
- Peak requests/sec: ___
- Average latency: ___ ms
- P99 latency: ___ ms
- Error rate: ___%
Resource utilization:
- CPU: ___%
- Memory: ___%
- Disk I/O: ___%
- Network: ___%
Step 2: Identify Bottlenecks
Check each layer:
| Layer | Bottleneck Signs |
|---|---|
| Web servers | High CPU, connection limits |
| Application | Slow requests, thread pool exhaustion |
| Database | Slow queries, lock contention |
| Cache | High miss rate, memory pressure |
| Network | Bandwidth saturation, latency |
Step 3: Load Test
Test scenarios:
1. Baseline: Current production load
2. 2x load: Expected growth in 6 months
3. 5x load: Stress test
4. Spike: Sudden 10x for 5 minutes
Tools:
- k6, Locust, JMeter for HTTP
- pgbench for PostgreSQL
- redis-benchmark for Redis
Step 4: Identify Scaling Strategy
Vertical scaling (scale up):
- Add more CPU, memory, disk
- Simpler but has limits
- Use when: Single server can handle more
Horizontal scaling (scale out):
- Add more servers
- Requires stateless design
- Use when: Need linear scaling
Step 5: Create Scaling Plan
Document:
Trigger: When average CPU > 70% for 15 minutes
Action:
1. Add 2 more web servers
2. Update load balancer
3. Verify health checks pass
Rollback:
1. Remove added servers
2. Update load balancer
3. Investigate issue
6. Migration Planning Workflow
Use when migrating to new infrastructure, database, or architecture.
Step 1: Assess Current State
Document:
- Current architecture diagram
- Data volumes
- Dependencies
- Integration points
- Performance baselines
Step 2: Define Target State
Document:
- New architecture diagram
- Technology changes
- Expected improvements
- Success criteria
Step 3: Plan Migration Strategy
Strategies:
| Strategy | Risk | Downtime | Complexity |
|---|---|---|---|
| Big bang | High | Yes | Low |
| Blue-green | Medium | Minimal | Medium |
| Canary | Low | None | High |
| Strangler fig | Low | None | High |
Strangler fig pattern (recommended for large systems):
1. Add facade in front of old system
2. Route small percentage of traffic to new system
3. Gradually increase traffic to new system
4. Retire old system when 100% migrated
Step 4: Create Rollback Plan
For each step, define:
Step: Migrate user service to new database
Rollback trigger:
- Error rate > 1%
- Latency > 500ms P99
- Data inconsistency detected
Rollback steps:
1. Route traffic back to old database
2. Sync any new data back
3. Investigate root cause
Rollback time estimate: 15 minutes
Step 5: Execute with Checkpoints
Migration checklist:
□ Backup current system
□ Verify backup restoration works
□ Deploy new infrastructure
□ Run smoke tests on new system
□ Migrate small percentage (1%)
□ Monitor for 24 hours
□ Increase to 10%
□ Monitor for 24 hours
□ Increase to 50%
□ Monitor for 24 hours
□ Complete migration (100%)
□ Decommission old system
□ Document lessons learned
Quick Reference
| Task | Start Here |
|---|---|
| New system design | System Design Interview Approach |
| Infrastructure sizing | Capacity Planning |
| New API | API Design |
| Database design | Database Schema Design |
| Handle growth | Scalability Assessment |
| System migration | Migration Planning |