Merge branch 'feature/sprint-phase-3-gaps' into dev

# Conflicts:
#	docs/skills/engineering-team/index.md
#	docs/skills/engineering/index.md
#	mkdocs.yml
This commit is contained in:
Reza Rezvani
2026-03-25 14:23:21 +01:00
36 changed files with 13448 additions and 4 deletions

View File

@@ -0,0 +1,429 @@
---
title: "GCP Cloud Architect — Agent Skill & Codex Plugin"
description: "Design GCP architectures for startups and enterprises. Use when asked to design Google Cloud infrastructure, deploy to GKE or Cloud Run, configure. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
---
# GCP Cloud Architect
<div class="page-meta" markdown>
<span class="meta-badge">:material-code-braces: Engineering - Core</span>
<span class="meta-badge">:material-identifier: `gcp-cloud-architect`</span>
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/gcp-cloud-architect/SKILL.md">Source</a></span>
</div>
<div class="install-banner" markdown>
<span class="install-label">Install:</span> <code>claude /plugin install engineering-skills</code>
</div>
Design scalable, cost-effective Google Cloud architectures for startups and enterprises with infrastructure-as-code templates.
---
## Workflow
### Step 1: Gather Requirements
Collect application specifications:
```
- Application type (web app, mobile backend, data pipeline, SaaS)
- Expected users and requests per second
- Budget constraints (monthly spend limit)
- Team size and GCP experience level
- Compliance requirements (GDPR, HIPAA, SOC 2)
- Availability requirements (SLA, RPO/RTO)
```
### Step 2: Design Architecture
Run the architecture designer to get pattern recommendations:
```bash
python scripts/architecture_designer.py --input requirements.json
```
**Example output:**
```json
{
"recommended_pattern": "serverless_web",
"service_stack": ["Cloud Storage", "Cloud CDN", "Cloud Run", "Firestore", "Identity Platform"],
"estimated_monthly_cost_usd": 30,
"pros": ["Low ops overhead", "Pay-per-use", "Auto-scaling", "No cold starts on Cloud Run min instances"],
"cons": ["Vendor lock-in", "Regional limitations", "Eventual consistency with Firestore"]
}
```
Select from recommended patterns:
- **Serverless Web**: Cloud Storage + Cloud CDN + Cloud Run + Firestore
- **Microservices on GKE**: GKE Autopilot + Cloud SQL + Memorystore + Cloud Pub/Sub
- **Serverless Data Pipeline**: Pub/Sub + Dataflow + BigQuery + Looker
- **ML Platform**: Vertex AI + Cloud Storage + BigQuery + Cloud Functions
See `references/architecture_patterns.md` for detailed pattern specifications.
**Validation checkpoint:** Confirm the recommended pattern matches the team's operational maturity and compliance requirements before proceeding to Step 3.
### Step 3: Estimate Cost
Analyze estimated costs and optimization opportunities:
```bash
python scripts/cost_optimizer.py --resources current_setup.json --monthly-spend 2000
```
**Example output:**
```json
{
"current_monthly_usd": 2000,
"recommendations": [
{ "action": "Right-size Cloud SQL db-custom-4-16384 to db-custom-2-8192", "savings_usd": 380, "priority": "high" },
{ "action": "Purchase 1-yr committed use discount for GKE nodes", "savings_usd": 290, "priority": "high" },
{ "action": "Move Cloud Storage objects >90 days to Nearline", "savings_usd": 75, "priority": "medium" }
],
"total_potential_savings_usd": 745
}
```
Output includes:
- Monthly cost breakdown by service
- Right-sizing recommendations
- Committed use discount opportunities
- Sustained use discount analysis
- Potential monthly savings
Use the [GCP Pricing Calculator](https://cloud.google.com/products/calculator) for detailed estimates.
### Step 4: Generate IaC
Create infrastructure-as-code for the selected pattern:
```bash
python scripts/deployment_manager.py --app-name my-app --pattern serverless_web --region us-central1
```
**Example Terraform HCL output (Cloud Run + Firestore):**
```hcl
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
}
variable "project_id" {
description = "GCP project ID"
type = string
}
variable "region" {
description = "GCP region"
type = string
default = "us-central1"
}
resource "google_cloud_run_v2_service" "api" {
name = "${var.environment}-${var.app_name}-api"
location = var.region
template {
containers {
image = "gcr.io/${var.project_id}/${var.app_name}:latest"
resources {
limits = {
cpu = "1000m"
memory = "512Mi"
}
}
env {
name = "FIRESTORE_PROJECT"
value = var.project_id
}
}
scaling {
min_instance_count = 0
max_instance_count = 10
}
}
}
resource "google_firestore_database" "default" {
project = var.project_id
name = "(default)"
location_id = var.region
type = "FIRESTORE_NATIVE"
}
```
**Example gcloud CLI deployment:**
```bash
# Deploy Cloud Run service
gcloud run deploy my-app-api \
--image gcr.io/$PROJECT_ID/my-app:latest \
--region us-central1 \
--platform managed \
--allow-unauthenticated \
--memory 512Mi \
--cpu 1 \
--min-instances 0 \
--max-instances 10
# Create Firestore database
gcloud firestore databases create --location=us-central1
```
> Full templates including Cloud CDN, Identity Platform, IAM, and Cloud Monitoring are generated by `deployment_manager.py` and also available in `references/architecture_patterns.md`.
### Step 5: Configure CI/CD
Set up automated deployment with Cloud Build or GitHub Actions:
```yaml
# cloudbuild.yaml
steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA', '.']
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA']
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: gcloud
args:
- 'run'
- 'deploy'
- 'my-app-api'
- '--image=gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA'
- '--region=us-central1'
- '--platform=managed'
images:
- 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA'
```
```bash
# Connect repo and create trigger
gcloud builds triggers create github \
--repo-name=my-app \
--repo-owner=my-org \
--branch-pattern="^main$" \
--build-config=cloudbuild.yaml
```
### Step 6: Security Review
Verify security configuration:
```bash
# Review IAM bindings
gcloud projects get-iam-policy $PROJECT_ID --format=json
# Check service account permissions
gcloud iam service-accounts list --project=$PROJECT_ID
# Verify VPC Service Controls (if applicable)
gcloud access-context-manager perimeters list --policy=$POLICY_ID
```
**Security checklist:**
- IAM roles follow least privilege (prefer predefined roles over basic roles)
- Service accounts use Workload Identity for GKE
- VPC Service Controls configured for sensitive APIs
- Cloud KMS encryption keys for customer-managed encryption
- Cloud Audit Logs enabled for all admin activity
- Organization policies restrict public access
- Secret Manager used for all credentials
**If deployment fails:**
1. Check the failure reason:
```bash
gcloud run services describe my-app-api --region us-central1
gcloud logging read "resource.type=cloud_run_revision" --limit=20
```
2. Review Cloud Logging for application errors.
3. Fix the configuration or container image.
4. Redeploy:
```bash
gcloud run deploy my-app-api --image gcr.io/$PROJECT_ID/my-app:latest --region us-central1
```
**Common failure causes:**
- IAM permission errors -- verify service account roles and `--allow-unauthenticated` flag
- Quota exceeded -- request quota increase via IAM & Admin > Quotas
- Container startup failure -- check container logs and health check configuration
- Region not enabled -- enable the required APIs with `gcloud services enable`
---
## Tools
### architecture_designer.py
Recommends GCP services based on workload requirements.
```bash
python scripts/architecture_designer.py --input requirements.json --output design.json
```
**Input:** JSON with app type, scale, budget, compliance needs
**Output:** Recommended pattern, service stack, cost estimate, pros/cons
### cost_optimizer.py
Analyzes GCP resources for cost savings.
```bash
python scripts/cost_optimizer.py --resources inventory.json --monthly-spend 5000
```
**Output:** Recommendations for:
- Idle resource removal
- Machine type right-sizing
- Committed use discounts
- Storage class transitions
- Network egress optimization
### deployment_manager.py
Generates gcloud CLI deployment scripts and Terraform configurations.
```bash
python scripts/deployment_manager.py --app-name my-app --pattern serverless_web --region us-central1
```
**Output:** Production-ready deployment scripts with:
- Cloud Run or GKE deployment
- Firestore or Cloud SQL setup
- Identity Platform configuration
- IAM roles with least privilege
- Cloud Monitoring and Logging
---
## Quick Start
### Web App on Cloud Run (< $100/month)
```
Ask: "Design a serverless web backend for a mobile app with 1000 users"
Result:
- Cloud Run for API (auto-scaling, no cold start with min instances)
- Firestore for data (pay-per-operation)
- Identity Platform for authentication
- Cloud Storage + Cloud CDN for static assets
- Estimated: $15-40/month
```
### Microservices on GKE ($500-2000/month)
```
Ask: "Design a scalable architecture for a SaaS platform with 50k users"
Result:
- GKE Autopilot for containerized workloads
- Cloud SQL (PostgreSQL) with read replicas
- Memorystore (Redis) for session caching
- Cloud CDN for global delivery
- Cloud Build for CI/CD
- Multi-zone deployment
```
### Serverless Data Pipeline
```
Ask: "Design a real-time analytics pipeline for event data"
Result:
- Pub/Sub for event ingestion
- Dataflow (Apache Beam) for stream processing
- BigQuery for analytics and warehousing
- Looker for dashboards
- Cloud Functions for lightweight transforms
```
### ML Platform
```
Ask: "Design a machine learning platform for model training and serving"
Result:
- Vertex AI for training and prediction
- Cloud Storage for datasets and model artifacts
- BigQuery for feature store
- Cloud Functions for preprocessing triggers
- Cloud Monitoring for model drift detection
```
---
## Input Requirements
Provide these details for architecture design:
| Requirement | Description | Example |
|-------------|-------------|---------|
| Application type | What you're building | SaaS platform, mobile backend |
| Expected scale | Users, requests/sec | 10k users, 100 RPS |
| Budget | Monthly GCP limit | $500/month max |
| Team context | Size, GCP experience | 3 devs, intermediate |
| Compliance | Regulatory needs | HIPAA, GDPR, SOC 2 |
| Availability | Uptime requirements | 99.9% SLA, 1hr RPO |
**JSON Format:**
```json
{
"application_type": "saas_platform",
"expected_users": 10000,
"requests_per_second": 100,
"budget_monthly_usd": 500,
"team_size": 3,
"gcp_experience": "intermediate",
"compliance": ["SOC2"],
"availability_sla": "99.9%"
}
```
---
## Output Formats
### Architecture Design
- Pattern recommendation with rationale
- Service stack diagram (ASCII)
- Monthly cost estimate and trade-offs
### IaC Templates
- **Terraform HCL**: Production-ready Google provider configs
- **gcloud CLI**: Scripted deployment commands
- **Cloud Build YAML**: CI/CD pipeline definitions
### Cost Analysis
- Current spend breakdown with optimization recommendations
- Priority action list (high/medium/low) and implementation checklist
---
## Reference Documentation
| Document | Contents |
|----------|----------|
| `references/architecture_patterns.md` | 6 patterns: serverless, GKE microservices, three-tier, data pipeline, ML platform, multi-region |
| `references/service_selection.md` | Decision matrices for compute, database, storage, messaging |
| `references/best_practices.md` | Naming, labels, IAM, networking, monitoring, disaster recovery |

View File

@@ -1,13 +1,13 @@
---
title: "Engineering - Core Skills — Agent Skills & Codex Plugins"
description: "43 engineering - core skills — engineering agent skill and Claude Code plugin for code generation, DevOps, architecture, and testing. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
description: "44 engineering - core skills — engineering agent skill and Claude Code plugin for code generation, DevOps, architecture, and testing. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
---
<div class="domain-header" markdown>
# :material-code-braces: Engineering - Core
<p class="domain-count">43 skills in this domain</p>
<p class="domain-count">44 skills in this domain</p>
</div>
@@ -59,6 +59,12 @@ description: "43 engineering - core skills — engineering agent skill and Claud
You are now a world-class epic design expert. You build cinematic, immersive websites that feel premium and alive — u...
- **[GCP Cloud Architect](gcp-cloud-architect.md)**
---
Design scalable, cost-effective Google Cloud architectures for startups and enterprises with infrastructure-as-code t...
- **[Google Workspace CLI](google-workspace-cli.md)**
---

View File

@@ -191,6 +191,12 @@ description: "46 engineering - powerful skills — advanced agent-native skill a
Tier: POWERFUL
- **[Secrets Vault Manager](secrets-vault-manager.md)**
---
Tier: POWERFUL
- **[Skill Security Auditor](skill-security-auditor.md)**
---
@@ -209,6 +215,12 @@ description: "46 engineering - powerful skills — advanced agent-native skill a
Spec-driven workflow enforces a single, non-negotiable rule: write the specification BEFORE you write any code. Not a...
- **[SQL Database Assistant - POWERFUL Tier Skill](sql-database-assistant.md)**
---
The operational companion to database design. While database-designer focuses on schema architecture and database-sch...
- **[Tech Debt Tracker](tech-debt-tracker.md)**
---

View File

@@ -0,0 +1,414 @@
---
title: "Secrets Vault Manager — Agent Skill for Codex & OpenClaw"
description: "Use when the user asks to set up secret management infrastructure, integrate HashiCorp Vault, configure cloud secret stores (AWS Secrets Manager. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
---
# Secrets Vault Manager
<div class="page-meta" markdown>
<span class="meta-badge">:material-rocket-launch: Engineering - POWERFUL</span>
<span class="meta-badge">:material-identifier: `secrets-vault-manager`</span>
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering/secrets-vault-manager/SKILL.md">Source</a></span>
</div>
<div class="install-banner" markdown>
<span class="install-label">Install:</span> <code>claude /plugin install engineering-advanced-skills</code>
</div>
**Tier:** POWERFUL
**Category:** Engineering
**Domain:** Security / Infrastructure / DevOps
---
## Overview
Production secret infrastructure management for teams running HashiCorp Vault, cloud-native secret stores, or hybrid architectures. This skill covers policy authoring, auth method configuration, automated rotation, dynamic secrets, audit logging, and incident response.
**Distinct from env-secrets-manager** which handles local `.env` file hygiene and leak detection. This skill operates at the infrastructure layer — Vault clusters, cloud KMS, certificate authorities, and CI/CD secret injection.
### When to Use
- Standing up a new Vault cluster or migrating to a managed secret store
- Designing auth methods for services, CI runners, and human operators
- Implementing automated credential rotation (database, API keys, certificates)
- Auditing secret access patterns for compliance (SOC 2, ISO 27001, HIPAA)
- Responding to a secret leak that requires mass revocation
- Integrating secrets into Kubernetes workloads or CI/CD pipelines
---
## HashiCorp Vault Patterns
### Architecture Decisions
| Decision | Recommendation | Rationale |
|----------|---------------|-----------|
| Deployment mode | HA with Raft storage | No external dependency, built-in leader election |
| Auto-unseal | Cloud KMS (AWS KMS / Azure Key Vault / GCP KMS) | Eliminates manual unseal, enables automated restarts |
| Namespaces | One per environment (dev/staging/prod) | Blast-radius isolation, independent policies |
| Audit devices | File + syslog (dual) | Vault refuses requests if all audit devices fail — dual prevents outages |
### Auth Methods
**AppRole** — Machine-to-machine authentication for services and batch jobs.
```hcl
# Enable AppRole
path "auth/approle/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
# Application-specific role
vault write auth/approle/role/payment-service \
token_ttl=1h \
token_max_ttl=4h \
secret_id_num_uses=1 \
secret_id_ttl=10m \
token_policies="payment-service-read"
```
**Kubernetes** — Pod-native authentication via service account tokens.
```hcl
vault write auth/kubernetes/role/api-server \
bound_service_account_names=api-server \
bound_service_account_namespaces=production \
policies=api-server-secrets \
ttl=1h
```
**OIDC** — Human operator access via SSO provider (Okta, Azure AD, Google Workspace).
```hcl
vault write auth/oidc/role/engineering \
bound_audiences="vault" \
allowed_redirect_uris="https://vault.example.com/ui/vault/auth/oidc/oidc/callback" \
user_claim="email" \
oidc_scopes="openid,profile,email" \
policies="engineering-read" \
ttl=8h
```
### Secret Engines
| Engine | Use Case | TTL Strategy |
|--------|----------|-------------|
| KV v2 | Static secrets (API keys, config) | Versioned, manual rotation |
| Database | Dynamic DB credentials | 1h default, 24h max |
| PKI | TLS certificates | 90d leaf certs, 5y intermediate CA |
| Transit | Encryption-as-a-service | Key rotation every 90d |
| SSH | Signed SSH certificates | 30m for interactive, 8h for automation |
### Policy Design
Follow least-privilege with path-based granularity:
```hcl
# payment-service-read policy
path "secret/data/production/payment/*" {
capabilities = ["read"]
}
path "database/creds/payment-readonly" {
capabilities = ["read"]
}
# Deny access to admin paths explicitly
path "sys/*" {
capabilities = ["deny"]
}
```
**Policy naming convention:** `{service}-{access-level}` (e.g., `payment-service-read`, `api-gateway-admin`).
---
## Cloud Secret Store Integration
### Comparison Matrix
| Feature | AWS Secrets Manager | Azure Key Vault | GCP Secret Manager |
|---------|--------------------|-----------------|--------------------|
| Rotation | Built-in Lambda | Custom logic via Functions | Cloud Functions |
| Versioning | Automatic | Manual or automatic | Automatic |
| Encryption | AWS KMS (default or CMK) | HSM-backed | Google-managed or CMEK |
| Access control | IAM policies + resource policy | RBAC + Access Policies | IAM bindings |
| Cross-region | Replication supported | Geo-redundant by default | Replication supported |
| Audit | CloudTrail | Azure Monitor + Diagnostic Logs | Cloud Audit Logs |
| Pricing model | Per-secret + per-API call | Per-operation + per-key | Per-secret version + per-access |
### When to Use Which
- **AWS Secrets Manager**: RDS/Aurora credential rotation out of the box. Best when fully on AWS.
- **Azure Key Vault**: Certificate management strength. Required for Azure AD integrated workloads.
- **GCP Secret Manager**: Simplest API surface. Best for GKE-native workloads with Workload Identity.
- **HashiCorp Vault**: Multi-cloud, dynamic secrets, PKI, transit encryption. Best for complex or hybrid environments.
### SDK Access Patterns
**Principle:** Always fetch secrets at startup or via sidecar — never bake into images or config files.
```python
# AWS Secrets Manager pattern
import boto3, json
def get_secret(secret_name, region="us-east-1"):
client = boto3.client("secretsmanager", region_name=region)
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response["SecretString"])
```
```python
# GCP Secret Manager pattern
from google.cloud import secretmanager
def get_secret(project_id, secret_id, version="latest"):
client = secretmanager.SecretManagerServiceClient()
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version}"
response = client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
```
```python
# Azure Key Vault pattern
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
def get_secret(vault_url, secret_name):
credential = DefaultAzureCredential()
client = SecretClient(vault_url=vault_url, credential=credential)
return client.get_secret(secret_name).value
```
---
## Secret Rotation Workflows
### Rotation Strategy by Secret Type
| Secret Type | Rotation Frequency | Method | Downtime Risk |
|-------------|-------------------|--------|---------------|
| Database passwords | 30 days | Dual-account swap | Zero (A/B rotation) |
| API keys | 90 days | Generate new, deprecate old | Zero (overlap window) |
| TLS certificates | 60 days before expiry | ACME or Vault PKI | Zero (graceful reload) |
| SSH keys | 90 days | Vault-signed certificates | Zero (CA-based) |
| Service tokens | 24 hours | Dynamic generation | Zero (short-lived) |
| Encryption keys | 90 days | Key versioning (rewrap) | Zero (version coexistence) |
### Database Credential Rotation (Dual-Account)
1. Two database accounts exist: `app_user_a` and `app_user_b`
2. Application currently uses `app_user_a`
3. Rotation rotates `app_user_b` password, updates secret store
4. Application switches to `app_user_b` on next credential fetch
5. After grace period, `app_user_a` password is rotated
6. Cycle repeats
### API Key Rotation (Overlap Window)
1. Generate new API key with provider
2. Store new key in secret store as `current`, move old to `previous`
3. Deploy applications — they read `current`
4. After all instances restarted (or TTL expired), revoke `previous`
5. Monitoring confirms zero usage of old key before revocation
---
## Dynamic Secrets
Dynamic secrets are generated on-demand with automatic expiration. Prefer dynamic secrets over static credentials wherever possible.
### Database Dynamic Credentials (Vault)
```hcl
# Configure database engine
vault write database/config/postgres \
plugin_name=postgresql-database-plugin \
connection_url="postgresql://{{username}}:{{password}}@db.example.com:5432/app" \
allowed_roles="app-readonly,app-readwrite" \
username="vault_admin" \
password="<admin-password>"
# Create role with TTL
vault write database/roles/app-readonly \
db_name=postgres \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
default_ttl=1h \
max_ttl=24h
```
### Cloud IAM Dynamic Credentials
Vault can generate short-lived AWS IAM credentials, Azure service principal passwords, or GCP service account keys — eliminating long-lived cloud credentials entirely.
### SSH Certificate Authority
Replace SSH key distribution with a Vault-signed certificate model:
1. Vault acts as SSH CA
2. Users/machines request signed certificates with short TTL (30 min)
3. SSH servers trust the CA public key — no `authorized_keys` management
4. Certificates expire automatically — no revocation needed for normal operations
---
## Audit Logging
### What to Log
| Event | Priority | Retention |
|-------|----------|-----------|
| Secret read access | HIGH | 1 year minimum |
| Secret creation/update | HIGH | 1 year minimum |
| Auth method login | MEDIUM | 90 days |
| Policy changes | CRITICAL | 2 years (compliance) |
| Failed access attempts | CRITICAL | 1 year |
| Token creation/revocation | MEDIUM | 90 days |
| Seal/unseal operations | CRITICAL | Indefinite |
### Anomaly Detection Signals
- Secret accessed from new IP/CIDR range
- Access volume spike (>3x baseline for a path)
- Off-hours access for human auth methods
- Service accessing secrets outside its policy scope (denied requests)
- Multiple failed auth attempts from single source
- Token created with unusually long TTL
### Compliance Reporting
Generate periodic reports covering:
1. **Access inventory** — Which identities accessed which secrets, when
2. **Rotation compliance** — Secrets overdue for rotation
3. **Policy drift** — Policies modified since last review
4. **Orphaned secrets** — Secrets with no recent access (>90 days)
Use `audit_log_analyzer.py` to parse Vault or cloud audit logs for these signals.
---
## Emergency Procedures
### Secret Leak Response (Immediate)
**Time target: Contain within 15 minutes of detection.**
1. **Identify scope** — Which secret(s) leaked, where (repo, log, error message, third party)
2. **Revoke immediately** — Rotate the compromised credential at the source (provider API, Vault, cloud SM)
3. **Invalidate tokens** — Revoke all Vault tokens that accessed the leaked secret
4. **Audit blast radius** — Query audit logs for usage of the compromised secret in the exposure window
5. **Notify stakeholders** — Security team, affected service owners, compliance (if PII/regulated data)
6. **Post-mortem** — Document root cause, update controls to prevent recurrence
### Vault Seal Operations
**When to seal:** Active security incident affecting Vault infrastructure, suspected key compromise.
**Sealing** stops all Vault operations. Use only as last resort.
**Unseal procedure:**
1. Gather quorum of unseal key holders (Shamir threshold)
2. Or confirm auto-unseal KMS key is accessible
3. Unseal via `vault operator unseal` or restart with auto-unseal
4. Verify audit devices reconnected
5. Check active leases and token validity
See `references/emergency_procedures.md` for complete playbooks.
---
## CI/CD Integration
### Vault Agent Sidecar (Kubernetes)
Vault Agent runs alongside application pods, handles authentication and secret rendering:
```yaml
# Pod annotation for Vault Agent Injector
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "api-server"
vault.hashicorp.com/agent-inject-secret-db: "database/creds/app-readonly"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "database/creds/app-readonly" -}}
postgresql://{{ .Data.username }}:{{ .Data.password }}@db:5432/app
{{- end }}
```
### External Secrets Operator (Kubernetes)
For teams preferring declarative GitOps over agent sidecars:
```yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: api-credentials
data:
- secretKey: api-key
remoteRef:
key: secret/data/production/api
property: key
```
### GitHub Actions OIDC
Eliminate long-lived secrets in CI by using OIDC federation:
```yaml
- name: Authenticate to Vault
uses: hashicorp/vault-action@v2
with:
url: https://vault.example.com
method: jwt
role: github-ci
jwtGithubAudience: https://vault.example.com
secrets: |
secret/data/ci/deploy api_key | DEPLOY_API_KEY ;
secret/data/ci/deploy db_password | DB_PASSWORD
```
---
## Anti-Patterns
| Anti-Pattern | Risk | Correct Approach |
|-------------|------|-----------------|
| Hardcoded secrets in source code | Leak via repo, logs, error output | Fetch from secret store at runtime |
| Long-lived static tokens (>30 days) | Stale credentials, no accountability | Dynamic secrets or short TTL + rotation |
| Shared service accounts | No audit trail per consumer | Per-service identity with unique credentials |
| No rotation policy | Compromised creds persist indefinitely | Automated rotation on schedule |
| Secrets in environment variables on CI | Visible in build logs, process table | Vault Agent or OIDC-based injection |
| Single unseal key holder | Bus factor of 1, recovery blocked | Shamir split (3-of-5) or auto-unseal |
| No audit device configured | Zero visibility into access | Dual audit devices (file + syslog) |
| Wildcard policies (`path "*"`) | Over-permissioned, violates least privilege | Explicit path-based policies per service |
---
## Tools
| Script | Purpose |
|--------|---------|
| `vault_config_generator.py` | Generate Vault policy and auth config from application requirements |
| `rotation_planner.py` | Create rotation schedule from a secret inventory file |
| `audit_log_analyzer.py` | Analyze audit logs for anomalies and compliance gaps |
---
## Cross-References
- **env-secrets-manager** — Local `.env` file hygiene, leak detection, drift awareness
- **senior-secops** — Security operations, incident response, threat modeling
- **ci-cd-pipeline-builder** — Pipeline design where secrets are consumed
- **docker-development** — Container secret injection patterns
- **helm-chart-builder** — Kubernetes secret management in Helm charts

View File

@@ -0,0 +1,468 @@
---
title: "SQL Database Assistant - POWERFUL Tier Skill — Agent Skill for Codex & OpenClaw"
description: "Use when the user asks to write SQL queries, optimize database performance, generate migrations, explore database schemas, or work with ORMs like. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
---
# SQL Database Assistant - POWERFUL Tier Skill
<div class="page-meta" markdown>
<span class="meta-badge">:material-rocket-launch: Engineering - POWERFUL</span>
<span class="meta-badge">:material-identifier: `sql-database-assistant`</span>
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering/sql-database-assistant/SKILL.md">Source</a></span>
</div>
<div class="install-banner" markdown>
<span class="install-label">Install:</span> <code>claude /plugin install engineering-advanced-skills</code>
</div>
## Overview
The operational companion to database design. While **database-designer** focuses on schema architecture and **database-schema-designer** handles ERD modeling, this skill covers the day-to-day: writing queries, optimizing performance, generating migrations, and bridging the gap between application code and database engines.
### Core Capabilities
- **Natural Language to SQL** — translate requirements into correct, performant queries
- **Schema Exploration** — introspect live databases across PostgreSQL, MySQL, SQLite, SQL Server
- **Query Optimization** — EXPLAIN analysis, index recommendations, N+1 detection, rewrite patterns
- **Migration Generation** — up/down scripts, zero-downtime strategies, rollback plans
- **ORM Integration** — Prisma, Drizzle, TypeORM, SQLAlchemy patterns and escape hatches
- **Multi-Database Support** — dialect-aware SQL with compatibility guidance
### Tools
| Script | Purpose |
|--------|---------|
| `scripts/query_optimizer.py` | Static analysis of SQL queries for performance issues |
| `scripts/migration_generator.py` | Generate migration file templates from change descriptions |
| `scripts/schema_explorer.py` | Generate schema documentation from introspection queries |
---
## Natural Language to SQL
### Translation Patterns
When converting requirements to SQL, follow this sequence:
1. **Identify entities** — map nouns to tables
2. **Identify relationships** — map verbs to JOINs or subqueries
3. **Identify filters** — map adjectives/conditions to WHERE clauses
4. **Identify aggregations** — map "total", "average", "count" to GROUP BY
5. **Identify ordering** — map "top", "latest", "highest" to ORDER BY + LIMIT
### Common Query Templates
**Top-N per group (window function)**
```sql
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn
FROM employees
) ranked WHERE rn <= 3;
```
**Running totals**
```sql
SELECT date, amount,
SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;
```
**Gap detection**
```sql
SELECT curr.id, curr.seq_num, prev.seq_num AS prev_seq
FROM records curr
LEFT JOIN records prev ON prev.seq_num = curr.seq_num - 1
WHERE prev.id IS NULL AND curr.seq_num > 1;
```
**UPSERT (PostgreSQL)**
```sql
INSERT INTO settings (key, value, updated_at)
VALUES ('theme', 'dark', NOW())
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value, updated_at = EXCLUDED.updated_at;
```
**UPSERT (MySQL)**
```sql
INSERT INTO settings (key_name, value, updated_at)
VALUES ('theme', 'dark', NOW())
ON DUPLICATE KEY UPDATE value = VALUES(value), updated_at = VALUES(updated_at);
```
> See references/query_patterns.md for JOINs, CTEs, window functions, JSON operations, and more.
---
## Schema Exploration
### Introspection Queries
**PostgreSQL — list tables and columns**
```sql
SELECT table_name, column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;
```
**PostgreSQL — foreign keys**
```sql
SELECT tc.table_name, kcu.column_name,
ccu.table_name AS foreign_table, ccu.column_name AS foreign_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
```
**MySQL — table sizes**
```sql
SELECT table_name, table_rows,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY data_length DESC;
```
**SQLite — schema dump**
```sql
SELECT name, sql FROM sqlite_master WHERE type = 'table' ORDER BY name;
```
**SQL Server — columns with types**
```sql
SELECT t.name AS table_name, c.name AS column_name,
ty.name AS data_type, c.max_length, c.is_nullable
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
ORDER BY t.name, c.column_id;
```
### Generating Documentation from Schema
Use `scripts/schema_explorer.py` to produce markdown or JSON documentation:
```bash
python scripts/schema_explorer.py --dialect postgres --tables all --format md
python scripts/schema_explorer.py --dialect mysql --tables users,orders --format json --json
```
---
## Query Optimization
### EXPLAIN Analysis Workflow
1. **Run EXPLAIN ANALYZE** (PostgreSQL) or **EXPLAIN FORMAT=JSON** (MySQL)
2. **Identify the costliest node** — Seq Scan on large tables, Nested Loop with high row estimates
3. **Check for missing indexes** — sequential scans on filtered columns
4. **Look for estimation errors** — planned vs actual rows divergence signals stale statistics
5. **Evaluate JOIN order** — ensure the smallest result set drives the join
### Index Recommendation Checklist
- Columns in WHERE clauses with high selectivity
- Columns in JOIN conditions (foreign keys)
- Columns in ORDER BY when combined with LIMIT
- Composite indexes matching multi-column WHERE predicates (most selective column first)
- Partial indexes for queries with constant filters (e.g., `WHERE status = 'active'`)
- Covering indexes to avoid table lookups for read-heavy queries
### Query Rewriting Patterns
| Anti-Pattern | Rewrite |
|-------------|---------|
| `SELECT * FROM orders` | `SELECT id, status, total FROM orders` (explicit columns) |
| `WHERE YEAR(created_at) = 2025` | `WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'` (sargable) |
| Correlated subquery in SELECT | LEFT JOIN with aggregation |
| `NOT IN (SELECT ...)` with NULLs | `NOT EXISTS (SELECT 1 ...)` |
| `UNION` (dedup) when not needed | `UNION ALL` |
| `LIKE '%search%'` | Full-text search index (GIN/FULLTEXT) |
| `ORDER BY RAND()` | Application-side random sampling or `TABLESAMPLE` |
### N+1 Detection
**Symptoms:**
- Application loop that executes one query per parent row
- ORM lazy-loading related entities inside a loop
- Query log shows hundreds of identical SELECT patterns with different IDs
**Fixes:**
- Use eager loading (`include` in Prisma, `joinedload` in SQLAlchemy)
- Batch queries with `WHERE id IN (...)`
- Use DataLoader pattern for GraphQL resolvers
### Static Analysis Tool
```bash
python scripts/query_optimizer.py --query "SELECT * FROM orders WHERE status = 'pending'" --dialect postgres
python scripts/query_optimizer.py --query queries.sql --dialect mysql --json
```
> See references/optimization_guide.md for EXPLAIN plan reading, index types, and connection pooling.
---
## Migration Generation
### Zero-Downtime Migration Patterns
**Adding a column (safe)**
```sql
-- Up
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Down
ALTER TABLE users DROP COLUMN phone;
```
**Renaming a column (expand-contract)**
```sql
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Step 2: Backfill
UPDATE users SET full_name = name;
-- Step 3: Deploy app reading both columns
-- Step 4: Deploy app writing only new column
-- Step 5: Drop old column
ALTER TABLE users DROP COLUMN name;
```
**Adding a NOT NULL column (safe sequence)**
```sql
-- Step 1: Add nullable
ALTER TABLE orders ADD COLUMN region VARCHAR(50);
-- Step 2: Backfill with default
UPDATE orders SET region = 'unknown' WHERE region IS NULL;
-- Step 3: Add constraint
ALTER TABLE orders ALTER COLUMN region SET NOT NULL;
ALTER TABLE orders ALTER COLUMN region SET DEFAULT 'unknown';
```
**Index creation (non-blocking, PostgreSQL)**
```sql
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);
```
### Data Backfill Strategies
- **Batch updates** — process in chunks of 1000-10000 rows to avoid lock contention
- **Background jobs** — run backfills asynchronously with progress tracking
- **Dual-write** — write to old and new columns during transition period
- **Validation queries** — verify row counts and data integrity after each batch
### Rollback Strategies
Every migration must have a reversible down script. For irreversible changes:
1. **Backup before execution**`pg_dump` the affected tables
2. **Feature flags** — application can switch between old/new schema reads
3. **Shadow tables** — keep a copy of the original table during migration window
### Migration Generator Tool
```bash
python scripts/migration_generator.py --change "add email_verified boolean to users" --dialect postgres --format sql
python scripts/migration_generator.py --change "rename column name to full_name in customers" --dialect mysql --format alembic --json
```
---
## Multi-Database Support
### Dialect Differences
| Feature | PostgreSQL | MySQL | SQLite | SQL Server |
|---------|-----------|-------|--------|------------|
| UPSERT | `ON CONFLICT DO UPDATE` | `ON DUPLICATE KEY UPDATE` | `ON CONFLICT DO UPDATE` | `MERGE` |
| Boolean | Native `BOOLEAN` | `TINYINT(1)` | `INTEGER` | `BIT` |
| Auto-increment | `SERIAL` / `GENERATED` | `AUTO_INCREMENT` | `INTEGER PRIMARY KEY` | `IDENTITY` |
| JSON | `JSONB` (indexed) | `JSON` | Text (ext) | `NVARCHAR(MAX)` |
| Array | Native `ARRAY` | Not supported | Not supported | Not supported |
| CTE (recursive) | Full support | 8.0+ | 3.8.3+ | Full support |
| Window functions | Full support | 8.0+ | 3.25.0+ | Full support |
| Full-text search | `tsvector` + GIN | `FULLTEXT` index | FTS5 extension | Full-text catalog |
| LIMIT/OFFSET | `LIMIT n OFFSET m` | `LIMIT n OFFSET m` | `LIMIT n OFFSET m` | `OFFSET m ROWS FETCH NEXT n ROWS ONLY` |
### Compatibility Tips
- **Always use parameterized queries** — prevents SQL injection across all dialects
- **Avoid dialect-specific functions in shared code** — wrap in adapter layer
- **Test migrations on target engine** — `information_schema` varies between engines
- **Use ISO date format** — `'YYYY-MM-DD'` works everywhere
- **Quote identifiers** — use double quotes (SQL standard) or backticks (MySQL)
---
## ORM Patterns
### Prisma
**Schema definition**
```prisma
model User {
id Int @id @default(autoincrement())
email String @unique
name String?
posts Post[]
createdAt DateTime @default(now())
}
model Post {
id Int @id @default(autoincrement())
title String
author User @relation(fields: [authorId], references: [id])
authorId Int
}
```
**Migrations**: `npx prisma migrate dev --name add_user_email`
**Query API**: `prisma.user.findMany({ where: { email: { contains: '@' } }, include: { posts: true } })`
**Raw SQL escape hatch**: `prisma.$queryRaw\`SELECT * FROM users WHERE id = ${userId}\``
### Drizzle
**Schema-first definition**
```typescript
export const users = pgTable('users', {
id: serial('id').primaryKey(),
email: varchar('email', { length: 255 }).notNull().unique(),
name: text('name'),
createdAt: timestamp('created_at').defaultNow(),
});
```
**Query builder**: `db.select().from(users).where(eq(users.email, email))`
**Migrations**: `npx drizzle-kit generate:pg` then `npx drizzle-kit push:pg`
### TypeORM
**Entity decorators**
```typescript
@Entity()
export class User {
@PrimaryGeneratedColumn()
id: number;
@Column({ unique: true })
email: string;
@OneToMany(() => Post, post => post.author)
posts: Post[];
}
```
**Repository pattern**: `userRepo.find({ where: { email }, relations: ['posts'] })`
**Migrations**: `npx typeorm migration:generate -n AddUserEmail`
### SQLAlchemy
**Declarative models**
```python
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
email = Column(String(255), unique=True, nullable=False)
name = Column(String(255))
posts = relationship('Post', back_populates='author')
```
**Session management**: Always use `with Session() as session:` context manager
**Alembic migrations**: `alembic revision --autogenerate -m "add user email"`
> See references/orm_patterns.md for side-by-side comparisons and migration workflows per ORM.
---
## Data Integrity
### Constraint Strategy
- **Primary keys** — every table must have one; prefer surrogate keys (serial/UUID)
- **Foreign keys** — enforce referential integrity; define ON DELETE behavior explicitly
- **UNIQUE constraints** — for business-level uniqueness (email, slug, API key)
- **CHECK constraints** — validate ranges, enums, and business rules at the DB level
- **NOT NULL** — default to NOT NULL; make nullable only when genuinely optional
### Transaction Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use Case |
|-------|-----------|-------------------|-------------|----------|
| READ UNCOMMITTED | Yes | Yes | Yes | Never recommended |
| READ COMMITTED | No | Yes | Yes | Default for PostgreSQL, general OLTP |
| REPEATABLE READ | No | No | Yes (InnoDB: No) | Financial calculations |
| SERIALIZABLE | No | No | No | Critical consistency (billing, inventory) |
### Deadlock Prevention
1. **Consistent lock ordering** — always acquire locks in the same table/row order
2. **Short transactions** — minimize time between first lock and commit
3. **Advisory locks** — use `pg_advisory_lock()` for application-level coordination
4. **Retry logic** — catch deadlock errors and retry with exponential backoff
---
## Backup & Restore
### PostgreSQL
```bash
# Full backup
pg_dump -Fc --no-owner dbname > backup.dump
# Restore
pg_restore -d dbname --clean --no-owner backup.dump
# Point-in-time recovery: configure WAL archiving + restore_command
```
### MySQL
```bash
# Full backup
mysqldump --single-transaction --routines --triggers dbname > backup.sql
# Restore
mysql dbname < backup.sql
# Binary log for PITR: mysqlbinlog --start-datetime="2025-01-01 00:00:00" binlog.000001
```
### SQLite
```bash
# Backup (safe with concurrent reads)
sqlite3 dbname ".backup backup.db"
```
### Backup Best Practices
- **Automate** — cron or systemd timer, never manual-only
- **Test restores** — untested backups are not backups
- **Offsite copies** — S3, GCS, or separate region
- **Retention policy** — daily for 7 days, weekly for 4 weeks, monthly for 12 months
- **Monitor backup size and duration** — sudden changes signal issues
---
## Anti-Patterns
| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| `SELECT *` | Transfers unnecessary data, breaks on schema changes | Explicit column list |
| Missing indexes on FK columns | Slow JOINs and cascading deletes | Add indexes on all foreign keys |
| N+1 queries | 1 + N round trips to database | Eager loading or batch queries |
| Implicit type coercion | `WHERE id = '123'` prevents index use | Match types in predicates |
| No connection pooling | Exhausts connections under load | PgBouncer, ProxySQL, or ORM pool |
| Unbounded queries | No LIMIT risks returning millions of rows | Always paginate |
| Storing money as FLOAT | Rounding errors | Use `DECIMAL(19,4)` or integer cents |
| God tables | One table with 50+ columns | Normalize or use vertical partitioning |
| Soft deletes everywhere | Complicates every query with `WHERE deleted_at IS NULL` | Archive tables or event sourcing |
| Raw string concatenation | SQL injection | Parameterized queries always |
---
## Cross-References
| Skill | Relationship |
|-------|-------------|
| **database-designer** | Schema architecture, normalization analysis, ERD generation |
| **database-schema-designer** | Visual ERD modeling, relationship mapping |
| **migration-architect** | Complex multi-step migration orchestration |
| **api-design-reviewer** | Ensuring API endpoints align with query patterns |
| **observability-platform** | Query performance monitoring, slow query alerts |

View File

@@ -1,13 +1,13 @@
---
title: "Regulatory & Quality Skills — Agent Skills & Codex Plugins"
description: "13 regulatory & quality skills — regulatory and quality management agent skill for ISO 13485, MDR, FDA, and GDPR compliance. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
description: "14 regulatory & quality skills — regulatory and quality management agent skill for ISO 13485, MDR, FDA, and GDPR compliance. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
---
<div class="domain-header" markdown>
# :material-shield-check-outline: Regulatory & Quality
<p class="domain-count">13 skills in this domain</p>
<p class="domain-count">14 skills in this domain</p>
</div>
@@ -95,4 +95,10 @@ description: "13 regulatory & quality skills — regulatory and quality manageme
ISO 14971:2019 risk management implementation throughout the medical device lifecycle.
- **[SOC 2 Compliance](soc2-compliance.md)**
---
SOC 2 Type I and Type II compliance preparation for SaaS companies. Covers Trust Service Criteria mapping, control ma...
</div>

View File

@@ -0,0 +1,428 @@
---
title: "SOC 2 Compliance — Agent Skill for Compliance"
description: "Use when the user asks to prepare for SOC 2 audits, map Trust Service Criteria, build control matrices, collect audit evidence, perform gap analysis. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
---
# SOC 2 Compliance
<div class="page-meta" markdown>
<span class="meta-badge">:material-shield-check-outline: Regulatory & Quality</span>
<span class="meta-badge">:material-identifier: `soc2-compliance`</span>
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/soc2-compliance/SKILL.md">Source</a></span>
</div>
<div class="install-banner" markdown>
<span class="install-label">Install:</span> <code>claude /plugin install ra-qm-skills</code>
</div>
SOC 2 Type I and Type II compliance preparation for SaaS companies. Covers Trust Service Criteria mapping, control matrix generation, evidence collection, gap analysis, and audit readiness assessment.
## Table of Contents
- [Overview](#overview)
- [Trust Service Criteria](#trust-service-criteria)
- [Control Matrix Generation](#control-matrix-generation)
- [Gap Analysis Workflow](#gap-analysis-workflow)
- [Evidence Collection](#evidence-collection)
- [Audit Readiness Checklist](#audit-readiness-checklist)
- [Vendor Management](#vendor-management)
- [Continuous Compliance](#continuous-compliance)
- [Anti-Patterns](#anti-patterns)
- [Tools](#tools)
- [References](#references)
- [Cross-References](#cross-references)
---
## Overview
### What Is SOC 2?
SOC 2 (System and Organization Controls 2) is an auditing framework developed by the AICPA that evaluates how a service organization manages customer data. It applies to any technology company that stores, processes, or transmits customer information — primarily SaaS, cloud infrastructure, and managed service providers.
### Type I vs Type II
| Aspect | Type I | Type II |
|--------|--------|---------|
| **Scope** | Design of controls at a point in time | Design AND operating effectiveness over a period |
| **Duration** | Snapshot (single date) | Observation window (3-12 months, typically 6) |
| **Evidence** | Control descriptions, policies | Control descriptions + operating evidence (logs, tickets, screenshots) |
| **Cost** | $20K-$50K (audit fees) | $30K-$100K+ (audit fees) |
| **Timeline** | 1-2 months (audit phase) | 6-12 months (observation + audit) |
| **Best For** | First-time compliance, rapid market need | Mature organizations, enterprise customers |
### Who Needs SOC 2?
- **SaaS companies** selling to enterprise customers
- **Cloud infrastructure providers** handling customer workloads
- **Data processors** managing PII, PHI, or financial data
- **Managed service providers** with access to client systems
- **Any vendor** whose customers require third-party assurance
### Typical Journey
```
Gap Assessment → Remediation → Type I Audit → Observation Period → Type II Audit → Annual Renewal
(4-8 wk) (8-16 wk) (4-6 wk) (6-12 mo) (4-6 wk) (ongoing)
```
---
## Trust Service Criteria
SOC 2 is organized around five Trust Service Criteria (TSC) categories. **Security** is required for every SOC 2 report; the remaining four are optional and selected based on business need.
### Security (Common Criteria CC1-CC9) — Required
The foundation of every SOC 2 report. Maps to COSO 2013 principles.
| Criteria | Domain | Key Controls |
|----------|--------|-------------|
| **CC1** | Control Environment | Integrity/ethics, board oversight, org structure, competence, accountability |
| **CC2** | Communication & Information | Internal/external communication, information quality |
| **CC3** | Risk Assessment | Risk identification, fraud risk, change impact analysis |
| **CC4** | Monitoring Activities | Ongoing monitoring, deficiency evaluation, corrective actions |
| **CC5** | Control Activities | Policies/procedures, technology controls, deployment through policies |
| **CC6** | Logical & Physical Access | Access provisioning, authentication, encryption, physical restrictions |
| **CC7** | System Operations | Vulnerability management, anomaly detection, incident response |
| **CC8** | Change Management | Change authorization, testing, approval, emergency changes |
| **CC9** | Risk Mitigation | Vendor/business partner risk management |
### Availability (A1) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **A1.1** | Capacity management | Infrastructure scaling, resource monitoring, capacity planning |
| **A1.2** | Recovery operations | Backup procedures, disaster recovery, BCP testing |
| **A1.3** | Recovery testing | DR drills, failover testing, RTO/RPO validation |
**Select when:** Customers depend on your uptime; you have SLAs; downtime causes direct business impact.
### Confidentiality (C1) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **C1.1** | Identification | Data classification policy, confidential data inventory |
| **C1.2** | Protection | Encryption at rest and in transit, DLP, access restrictions |
| **C1.3** | Disposal | Secure deletion procedures, media sanitization, retention enforcement |
**Select when:** You handle trade secrets, proprietary data, or contractually confidential information.
### Processing Integrity (PI1) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **PI1.1** | Accuracy | Input validation, processing checks, output verification |
| **PI1.2** | Completeness | Transaction monitoring, reconciliation, error handling |
| **PI1.3** | Timeliness | SLA monitoring, processing delay alerts, batch job monitoring |
| **PI1.4** | Authorization | Processing authorization controls, segregation of duties |
**Select when:** Data accuracy is critical (financial processing, healthcare records, analytics platforms).
### Privacy (P1-P8) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **P1** | Notice | Privacy policy, data collection notice, purpose limitation |
| **P2** | Choice & Consent | Opt-in/opt-out, consent management, preference tracking |
| **P3** | Collection | Minimal collection, lawful basis, purpose specification |
| **P4** | Use, Retention, Disposal | Purpose limitation, retention schedules, secure disposal |
| **P5** | Access | Data subject access requests, correction rights |
| **P6** | Disclosure & Notification | Third-party sharing, breach notification |
| **P7** | Quality | Data accuracy verification, correction mechanisms |
| **P8** | Monitoring & Enforcement | Privacy program monitoring, complaint handling |
**Select when:** You process PII and customers expect privacy assurance (complements GDPR compliance).
---
## Control Matrix Generation
A control matrix maps each TSC criterion to specific controls, owners, evidence, and testing procedures.
### Matrix Structure
| Field | Description |
|-------|-------------|
| **Control ID** | Unique identifier (e.g., SEC-001, AVL-003) |
| **TSC Mapping** | Which criteria the control addresses (e.g., CC6.1, A1.2) |
| **Control Description** | What the control does |
| **Control Type** | Preventive, Detective, or Corrective |
| **Owner** | Responsible person/team |
| **Frequency** | Continuous, Daily, Weekly, Monthly, Quarterly, Annual |
| **Evidence Type** | Screenshot, Log, Policy, Config, Ticket |
| **Testing Procedure** | How the auditor verifies the control |
### Control Naming Convention
```
{CATEGORY}-{NUMBER}
SEC-001 through SEC-NNN → Security
AVL-001 through AVL-NNN → Availability
CON-001 through CON-NNN → Confidentiality
PRI-001 through PRI-NNN → Processing Integrity
PRV-001 through PRV-NNN → Privacy
```
### Workflow
1. Select applicable TSC categories based on business needs
2. Run `control_matrix_builder.py` to generate the baseline matrix
3. Customize controls to match your actual environment
4. Assign owners and evidence requirements
5. Validate coverage — every selected TSC criterion must have at least one control
---
## Gap Analysis Workflow
### Phase 1: Current State Assessment
1. **Document existing controls** — inventory all security policies, procedures, and technical controls
2. **Map to TSC** — align existing controls to Trust Service Criteria
3. **Collect evidence samples** — gather proof that controls exist and operate
4. **Interview control owners** — verify understanding and execution
### Phase 2: Gap Identification
Run `gap_analyzer.py` against your current controls to identify:
- **Missing controls** — TSC criteria with no corresponding control
- **Partially implemented** — Control exists but lacks evidence or consistency
- **Design gaps** — Control designed but does not adequately address the criteria
- **Operating gaps** (Type II only) — Control designed correctly but not operating effectively
### Phase 3: Remediation Planning
For each gap, define:
| Field | Description |
|-------|-------------|
| Gap ID | Reference identifier |
| TSC Criteria | Affected criteria |
| Gap Description | What is missing or insufficient |
| Remediation Action | Specific steps to close the gap |
| Owner | Person responsible for remediation |
| Priority | Critical / High / Medium / Low |
| Target Date | Completion deadline |
| Dependencies | Other gaps or projects that must complete first |
### Phase 4: Timeline Planning
| Priority | Target Remediation |
|----------|--------------------|
| Critical | 2-4 weeks |
| High | 4-8 weeks |
| Medium | 8-12 weeks |
| Low | 12-16 weeks |
---
## Evidence Collection
### Evidence Types by Control Category
| Control Area | Primary Evidence | Secondary Evidence |
|--------------|-----------------|-------------------|
| Access Management | User access reviews, provisioning tickets | Role matrix, access logs |
| Change Management | Change tickets, approval records | Deployment logs, test results |
| Incident Response | Incident tickets, postmortems | Runbooks, escalation records |
| Vulnerability Management | Scan reports, patch records | Remediation timelines |
| Encryption | Configuration screenshots, certificate inventory | Key rotation logs |
| Backup & Recovery | Backup logs, DR test results | Recovery time measurements |
| Monitoring | Alert configurations, dashboard screenshots | On-call schedules, escalation records |
| Policy Management | Signed policies, version history | Training completion records |
| Vendor Management | Vendor assessments, SOC 2 reports | Contract reviews, risk registers |
### Automation Opportunities
| Area | Automation Approach |
|------|-------------------|
| Access reviews | Integrate IAM with ticketing (automatic quarterly review triggers) |
| Configuration evidence | Infrastructure-as-code snapshots, compliance-as-code tools |
| Vulnerability scans | Scheduled scanning with auto-generated reports |
| Change management | Git-based audit trail (commits, PRs, approvals) |
| Uptime monitoring | Automated SLA dashboards with historical data |
| Backup verification | Automated restore tests with success/failure logging |
### Continuous Monitoring
Move from point-in-time evidence collection to continuous compliance:
1. **Automated evidence gathering** — scripts that pull evidence on schedule
2. **Control dashboards** — real-time visibility into control status
3. **Alert-based monitoring** — notify when a control drifts out of compliance
4. **Evidence repository** — centralized, timestamped evidence storage
---
## Audit Readiness Checklist
### Pre-Audit Preparation (4-6 Weeks Before)
- [ ] All controls documented with descriptions, owners, and frequencies
- [ ] Evidence collected for the entire observation period (Type II)
- [ ] Control matrix reviewed and gaps remediated
- [ ] Policies signed and distributed within the last 12 months
- [ ] Access reviews completed within the required frequency
- [ ] Vulnerability scans current (no critical/high unpatched > SLA)
- [ ] Incident response plan tested within the last 12 months
- [ ] Vendor risk assessments current for all subservice organizations
- [ ] DR/BCP tested and documented within the last 12 months
- [ ] Employee security training completed for all staff
### Readiness Scoring
| Score | Rating | Meaning |
|-------|--------|---------|
| 90-100% | Audit Ready | Proceed with confidence |
| 75-89% | Minor Gaps | Address before scheduling audit |
| 50-74% | Significant Gaps | Remediation required |
| < 50% | Not Ready | Major program build-out needed |
### Common Audit Findings
| Finding | Root Cause | Prevention |
|---------|-----------|-----------|
| Incomplete access reviews | Manual process, no reminders | Automate quarterly review triggers |
| Missing change approvals | Emergency changes bypass process | Define emergency change procedure with post-hoc approval |
| Stale vulnerability scans | Scanner misconfigured | Automated weekly scans with alerting |
| Policy not acknowledged | No tracking mechanism | Annual e-signature workflow |
| Missing vendor assessments | No vendor inventory | Maintain vendor register with review schedule |
---
## Vendor Management
### Third-Party Risk Assessment
Every vendor that accesses, stores, or processes customer data must be assessed:
1. **Vendor inventory** — maintain a register of all service providers
2. **Risk classification** — categorize vendors by data access level
3. **Due diligence** — collect SOC 2 reports, security questionnaires, certifications
4. **Contractual protections** — ensure DPAs, security requirements, breach notification clauses
5. **Ongoing monitoring** — annual reassessment, continuous news monitoring
### Vendor Risk Tiers
| Tier | Data Access | Assessment Frequency | Requirements |
|------|-------------|---------------------|-------------|
| Critical | Processes/stores customer data | Annual + continuous monitoring | SOC 2 Type II, penetration test, security review |
| High | Accesses customer environment | Annual | SOC 2 Type II or equivalent, questionnaire |
| Medium | Indirect access, support tools | Annual questionnaire | Security certifications, questionnaire |
| Low | No data access | Biennial questionnaire | Basic security questionnaire |
### Subservice Organizations
When your SOC 2 report relies on controls at a subservice organization (e.g., AWS, GCP, Azure):
- **Inclusive method** — your report covers the subservice org's controls (requires their cooperation)
- **Carve-out method** — your report excludes their controls but references their SOC 2 report
- Most companies use **carve-out** and include complementary user entity controls (CUECs)
---
## Continuous Compliance
### From Point-in-Time to Continuous
| Aspect | Point-in-Time | Continuous |
|--------|---------------|-----------|
| Evidence collection | Manual, before audit | Automated, ongoing |
| Control monitoring | Periodic review | Real-time dashboards |
| Drift detection | Found during audit | Alert-based, immediate |
| Remediation | Reactive | Proactive |
| Audit preparation | 4-8 week scramble | Always ready |
### Implementation Steps
1. **Automate evidence gathering** — cron jobs, API integrations, IaC snapshots
2. **Build control dashboards** — aggregate control status into a single view
3. **Configure drift alerts** — notify when controls fall out of compliance
4. **Establish review cadence** — weekly control owner check-ins, monthly steering
5. **Maintain evidence repository** — centralized, timestamped, auditor-accessible
### Annual Re-Assessment Cycle
| Quarter | Activities |
|---------|-----------|
| Q1 | Annual risk assessment, policy refresh, vendor reassessment launch |
| Q2 | Internal control testing, remediation of findings |
| Q3 | Pre-audit readiness review, evidence completeness check |
| Q4 | External audit, management assertion, report distribution |
---
## Anti-Patterns
| Anti-Pattern | Why It Fails | Better Approach |
|--------------|-------------|----------------|
| Point-in-time compliance | Controls degrade between audits; gaps found during audit | Implement continuous monitoring and automated evidence |
| Manual evidence collection | Time-consuming, inconsistent, error-prone | Automate with scripts, IaC, and compliance platforms |
| Missing vendor assessments | Auditors flag incomplete vendor due diligence | Maintain vendor register with risk-tiered assessment schedule |
| Copy-paste policies | Generic policies don't match actual operations | Tailor policies to your actual environment and technology stack |
| Security theater | Controls exist on paper but aren't followed | Verify operating effectiveness; build controls into workflows |
| Skipping Type I | Jumping to Type II without foundational readiness | Start with Type I to validate control design before observation |
| Over-scoping TSC | Including all 5 categories when only Security is needed | Select categories based on actual customer/business requirements |
| Treating audit as a project | Compliance degrades after the report is issued | Build compliance into daily operations and engineering culture |
---
## Tools
### Control Matrix Builder
Generates a SOC 2 control matrix from selected TSC categories.
```bash
# Generate full security matrix in markdown
python scripts/control_matrix_builder.py --categories security --format md
# Generate matrix for multiple categories as JSON
python scripts/control_matrix_builder.py --categories security,availability,confidentiality --format json
# All categories, CSV output
python scripts/control_matrix_builder.py --categories security,availability,confidentiality,processing-integrity,privacy --format csv
```
### Evidence Tracker
Tracks evidence collection status per control.
```bash
# Check evidence status from a control matrix
python scripts/evidence_tracker.py --matrix controls.json --status
# JSON output for integration
python scripts/evidence_tracker.py --matrix controls.json --status --json
```
### Gap Analyzer
Analyzes current controls against SOC 2 requirements and identifies gaps.
```bash
# Type I gap analysis
python scripts/gap_analyzer.py --controls current_controls.json --type type1
# Type II gap analysis (includes operating effectiveness)
python scripts/gap_analyzer.py --controls current_controls.json --type type2 --json
```
---
## References
- [Trust Service Criteria Reference](https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/soc2-compliance/references/trust_service_criteria.md) — All 5 TSC categories with sub-criteria, control objectives, and evidence examples
- [Evidence Collection Guide](https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/soc2-compliance/references/evidence_collection_guide.md) — Evidence types per control, automation tools, documentation requirements
- [Type I vs Type II Comparison](https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/soc2-compliance/references/type1_vs_type2.md) — Detailed comparison, timeline, cost analysis, and upgrade path
---
## Cross-References
- **[gdpr-dsgvo-expert](https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/gdpr-dsgvo-expert/SKILL.md)** — SOC 2 Privacy criteria overlaps significantly with GDPR requirements; use together when processing EU personal data
- **[information-security-manager-iso27001](https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/information-security-manager-iso27001/SKILL.md)** — ISO 27001 Annex A controls map closely to SOC 2 Security criteria; organizations pursuing both can share evidence
- **[isms-audit-expert](https://github.com/alirezarezvani/claude-skills/tree/main/ra-qm-team/isms-audit-expert/SKILL.md)** — Audit methodology and finding management patterns transfer directly to SOC 2 audit preparation

View File

@@ -0,0 +1,418 @@
---
name: "gcp-cloud-architect"
description: "Design GCP architectures for startups and enterprises. Use when asked to design Google Cloud infrastructure, deploy to GKE or Cloud Run, configure BigQuery pipelines, optimize GCP costs, or migrate to GCP. Covers Cloud Run, GKE, Cloud Functions, Cloud SQL, BigQuery, and cost optimization."
---
# GCP Cloud Architect
Design scalable, cost-effective Google Cloud architectures for startups and enterprises with infrastructure-as-code templates.
---
## Workflow
### Step 1: Gather Requirements
Collect application specifications:
```
- Application type (web app, mobile backend, data pipeline, SaaS)
- Expected users and requests per second
- Budget constraints (monthly spend limit)
- Team size and GCP experience level
- Compliance requirements (GDPR, HIPAA, SOC 2)
- Availability requirements (SLA, RPO/RTO)
```
### Step 2: Design Architecture
Run the architecture designer to get pattern recommendations:
```bash
python scripts/architecture_designer.py --input requirements.json
```
**Example output:**
```json
{
"recommended_pattern": "serverless_web",
"service_stack": ["Cloud Storage", "Cloud CDN", "Cloud Run", "Firestore", "Identity Platform"],
"estimated_monthly_cost_usd": 30,
"pros": ["Low ops overhead", "Pay-per-use", "Auto-scaling", "No cold starts on Cloud Run min instances"],
"cons": ["Vendor lock-in", "Regional limitations", "Eventual consistency with Firestore"]
}
```
Select from recommended patterns:
- **Serverless Web**: Cloud Storage + Cloud CDN + Cloud Run + Firestore
- **Microservices on GKE**: GKE Autopilot + Cloud SQL + Memorystore + Cloud Pub/Sub
- **Serverless Data Pipeline**: Pub/Sub + Dataflow + BigQuery + Looker
- **ML Platform**: Vertex AI + Cloud Storage + BigQuery + Cloud Functions
See `references/architecture_patterns.md` for detailed pattern specifications.
**Validation checkpoint:** Confirm the recommended pattern matches the team's operational maturity and compliance requirements before proceeding to Step 3.
### Step 3: Estimate Cost
Analyze estimated costs and optimization opportunities:
```bash
python scripts/cost_optimizer.py --resources current_setup.json --monthly-spend 2000
```
**Example output:**
```json
{
"current_monthly_usd": 2000,
"recommendations": [
{ "action": "Right-size Cloud SQL db-custom-4-16384 to db-custom-2-8192", "savings_usd": 380, "priority": "high" },
{ "action": "Purchase 1-yr committed use discount for GKE nodes", "savings_usd": 290, "priority": "high" },
{ "action": "Move Cloud Storage objects >90 days to Nearline", "savings_usd": 75, "priority": "medium" }
],
"total_potential_savings_usd": 745
}
```
Output includes:
- Monthly cost breakdown by service
- Right-sizing recommendations
- Committed use discount opportunities
- Sustained use discount analysis
- Potential monthly savings
Use the [GCP Pricing Calculator](https://cloud.google.com/products/calculator) for detailed estimates.
### Step 4: Generate IaC
Create infrastructure-as-code for the selected pattern:
```bash
python scripts/deployment_manager.py --app-name my-app --pattern serverless_web --region us-central1
```
**Example Terraform HCL output (Cloud Run + Firestore):**
```hcl
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
}
variable "project_id" {
description = "GCP project ID"
type = string
}
variable "region" {
description = "GCP region"
type = string
default = "us-central1"
}
resource "google_cloud_run_v2_service" "api" {
name = "${var.environment}-${var.app_name}-api"
location = var.region
template {
containers {
image = "gcr.io/${var.project_id}/${var.app_name}:latest"
resources {
limits = {
cpu = "1000m"
memory = "512Mi"
}
}
env {
name = "FIRESTORE_PROJECT"
value = var.project_id
}
}
scaling {
min_instance_count = 0
max_instance_count = 10
}
}
}
resource "google_firestore_database" "default" {
project = var.project_id
name = "(default)"
location_id = var.region
type = "FIRESTORE_NATIVE"
}
```
**Example gcloud CLI deployment:**
```bash
# Deploy Cloud Run service
gcloud run deploy my-app-api \
--image gcr.io/$PROJECT_ID/my-app:latest \
--region us-central1 \
--platform managed \
--allow-unauthenticated \
--memory 512Mi \
--cpu 1 \
--min-instances 0 \
--max-instances 10
# Create Firestore database
gcloud firestore databases create --location=us-central1
```
> Full templates including Cloud CDN, Identity Platform, IAM, and Cloud Monitoring are generated by `deployment_manager.py` and also available in `references/architecture_patterns.md`.
### Step 5: Configure CI/CD
Set up automated deployment with Cloud Build or GitHub Actions:
```yaml
# cloudbuild.yaml
steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA', '.']
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA']
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: gcloud
args:
- 'run'
- 'deploy'
- 'my-app-api'
- '--image=gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA'
- '--region=us-central1'
- '--platform=managed'
images:
- 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA'
```
```bash
# Connect repo and create trigger
gcloud builds triggers create github \
--repo-name=my-app \
--repo-owner=my-org \
--branch-pattern="^main$" \
--build-config=cloudbuild.yaml
```
### Step 6: Security Review
Verify security configuration:
```bash
# Review IAM bindings
gcloud projects get-iam-policy $PROJECT_ID --format=json
# Check service account permissions
gcloud iam service-accounts list --project=$PROJECT_ID
# Verify VPC Service Controls (if applicable)
gcloud access-context-manager perimeters list --policy=$POLICY_ID
```
**Security checklist:**
- IAM roles follow least privilege (prefer predefined roles over basic roles)
- Service accounts use Workload Identity for GKE
- VPC Service Controls configured for sensitive APIs
- Cloud KMS encryption keys for customer-managed encryption
- Cloud Audit Logs enabled for all admin activity
- Organization policies restrict public access
- Secret Manager used for all credentials
**If deployment fails:**
1. Check the failure reason:
```bash
gcloud run services describe my-app-api --region us-central1
gcloud logging read "resource.type=cloud_run_revision" --limit=20
```
2. Review Cloud Logging for application errors.
3. Fix the configuration or container image.
4. Redeploy:
```bash
gcloud run deploy my-app-api --image gcr.io/$PROJECT_ID/my-app:latest --region us-central1
```
**Common failure causes:**
- IAM permission errors -- verify service account roles and `--allow-unauthenticated` flag
- Quota exceeded -- request quota increase via IAM & Admin > Quotas
- Container startup failure -- check container logs and health check configuration
- Region not enabled -- enable the required APIs with `gcloud services enable`
---
## Tools
### architecture_designer.py
Recommends GCP services based on workload requirements.
```bash
python scripts/architecture_designer.py --input requirements.json --output design.json
```
**Input:** JSON with app type, scale, budget, compliance needs
**Output:** Recommended pattern, service stack, cost estimate, pros/cons
### cost_optimizer.py
Analyzes GCP resources for cost savings.
```bash
python scripts/cost_optimizer.py --resources inventory.json --monthly-spend 5000
```
**Output:** Recommendations for:
- Idle resource removal
- Machine type right-sizing
- Committed use discounts
- Storage class transitions
- Network egress optimization
### deployment_manager.py
Generates gcloud CLI deployment scripts and Terraform configurations.
```bash
python scripts/deployment_manager.py --app-name my-app --pattern serverless_web --region us-central1
```
**Output:** Production-ready deployment scripts with:
- Cloud Run or GKE deployment
- Firestore or Cloud SQL setup
- Identity Platform configuration
- IAM roles with least privilege
- Cloud Monitoring and Logging
---
## Quick Start
### Web App on Cloud Run (< $100/month)
```
Ask: "Design a serverless web backend for a mobile app with 1000 users"
Result:
- Cloud Run for API (auto-scaling, no cold start with min instances)
- Firestore for data (pay-per-operation)
- Identity Platform for authentication
- Cloud Storage + Cloud CDN for static assets
- Estimated: $15-40/month
```
### Microservices on GKE ($500-2000/month)
```
Ask: "Design a scalable architecture for a SaaS platform with 50k users"
Result:
- GKE Autopilot for containerized workloads
- Cloud SQL (PostgreSQL) with read replicas
- Memorystore (Redis) for session caching
- Cloud CDN for global delivery
- Cloud Build for CI/CD
- Multi-zone deployment
```
### Serverless Data Pipeline
```
Ask: "Design a real-time analytics pipeline for event data"
Result:
- Pub/Sub for event ingestion
- Dataflow (Apache Beam) for stream processing
- BigQuery for analytics and warehousing
- Looker for dashboards
- Cloud Functions for lightweight transforms
```
### ML Platform
```
Ask: "Design a machine learning platform for model training and serving"
Result:
- Vertex AI for training and prediction
- Cloud Storage for datasets and model artifacts
- BigQuery for feature store
- Cloud Functions for preprocessing triggers
- Cloud Monitoring for model drift detection
```
---
## Input Requirements
Provide these details for architecture design:
| Requirement | Description | Example |
|-------------|-------------|---------|
| Application type | What you're building | SaaS platform, mobile backend |
| Expected scale | Users, requests/sec | 10k users, 100 RPS |
| Budget | Monthly GCP limit | $500/month max |
| Team context | Size, GCP experience | 3 devs, intermediate |
| Compliance | Regulatory needs | HIPAA, GDPR, SOC 2 |
| Availability | Uptime requirements | 99.9% SLA, 1hr RPO |
**JSON Format:**
```json
{
"application_type": "saas_platform",
"expected_users": 10000,
"requests_per_second": 100,
"budget_monthly_usd": 500,
"team_size": 3,
"gcp_experience": "intermediate",
"compliance": ["SOC2"],
"availability_sla": "99.9%"
}
```
---
## Output Formats
### Architecture Design
- Pattern recommendation with rationale
- Service stack diagram (ASCII)
- Monthly cost estimate and trade-offs
### IaC Templates
- **Terraform HCL**: Production-ready Google provider configs
- **gcloud CLI**: Scripted deployment commands
- **Cloud Build YAML**: CI/CD pipeline definitions
### Cost Analysis
- Current spend breakdown with optimization recommendations
- Priority action list (high/medium/low) and implementation checklist
---
## Reference Documentation
| Document | Contents |
|----------|----------|
| `references/architecture_patterns.md` | 6 patterns: serverless, GKE microservices, three-tier, data pipeline, ML platform, multi-region |
| `references/service_selection.md` | Decision matrices for compute, database, storage, messaging |
| `references/best_practices.md` | Naming, labels, IAM, networking, monitoring, disaster recovery |

View File

@@ -0,0 +1,512 @@
# GCP Architecture Patterns
Reference guide for selecting the right GCP architecture pattern based on application requirements.
---
## Table of Contents
- [Pattern Selection Matrix](#pattern-selection-matrix)
- [Pattern 1: Serverless Web Application](#pattern-1-serverless-web-application)
- [Pattern 2: Microservices on GKE](#pattern-2-microservices-on-gke)
- [Pattern 3: Three-Tier Application](#pattern-3-three-tier-application)
- [Pattern 4: Serverless Data Pipeline](#pattern-4-serverless-data-pipeline)
- [Pattern 5: ML Platform](#pattern-5-ml-platform)
- [Pattern 6: Multi-Region High Availability](#pattern-6-multi-region-high-availability)
---
## Pattern Selection Matrix
| Pattern | Best For | Users | Monthly Cost | Complexity |
|---------|----------|-------|--------------|------------|
| Serverless Web | MVP, SaaS, mobile backend | <50K | $30-400 | Low |
| Microservices on GKE | Complex services, enterprise | 10K-500K | $400-2500 | Medium |
| Three-Tier | Traditional web, e-commerce | 10K-200K | $300-1500 | Medium |
| Data Pipeline | Analytics, ETL, streaming | Any | $100-2000 | Medium-High |
| ML Platform | Training, serving, MLOps | Any | $200-5000 | High |
| Multi-Region HA | Global apps, DR | >100K | 2x single | High |
---
## Pattern 1: Serverless Web Application
### Use Case
SaaS platforms, mobile backends, low-traffic websites, MVPs
### Architecture Diagram
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Cloud CDN │────▶│ Cloud │ │ Identity │
│ (CDN) │ │ Storage │ │ Platform │
└─────────────┘ │ (Static) │ │ (Auth) │
└─────────────┘ └──────┬──────┘
┌─────────────┐ ┌─────────────┐ ┌──────▼──────┐
│ Cloud DNS │────▶│ Cloud │────▶│ Cloud Run │
│ (DNS) │ │ Load Bal. │ │ (API) │
└─────────────┘ └─────────────┘ └──────┬──────┘
┌──────▼──────┐
│ Firestore │
│ (Database) │
└─────────────┘
```
### Service Stack
| Layer | Service | Configuration |
|-------|---------|---------------|
| Frontend | Cloud Storage + Cloud CDN | Static hosting with HTTPS |
| API | Cloud Run | Containerized API with auto-scaling |
| Database | Firestore | Native mode, pay-per-operation |
| Auth | Identity Platform | Multi-provider authentication |
| CI/CD | Cloud Build | Automated container deployments |
### Terraform Example
```hcl
resource "google_cloud_run_v2_service" "api" {
name = "my-app-api"
location = "us-central1"
template {
containers {
image = "gcr.io/my-project/my-app:latest"
resources {
limits = {
cpu = "1000m"
memory = "512Mi"
}
}
}
scaling {
min_instance_count = 0
max_instance_count = 10
}
}
}
```
### Cost Breakdown (10K users)
| Service | Monthly Cost |
|---------|-------------|
| Cloud Run | $5-25 |
| Firestore | $5-30 |
| Cloud CDN | $5-15 |
| Cloud Storage | $1-5 |
| Identity Platform | $0-10 |
| **Total** | **$16-85** |
### Pros and Cons
**Pros:**
- Scale-to-zero (pay nothing when idle)
- Container-based (no runtime restrictions)
- Built-in HTTPS and custom domains
- Auto-scaling with no configuration
**Cons:**
- Cold starts if min instances = 0
- Firestore query limitations vs SQL
- Vendor lock-in to GCP
---
## Pattern 2: Microservices on GKE
### Use Case
Complex business systems, enterprise applications, platform engineering
### Architecture Diagram
```
┌─────────────┐ ┌─────────────┐
│ Cloud CDN │────▶│ Global │
│ (CDN) │ │ Load Bal. │
└─────────────┘ └──────┬──────┘
┌──────▼──────┐
│ GKE │
│ Autopilot │
└──────┬──────┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Cloud SQL │ │ Memorystore │ │ Pub/Sub │
│ (Postgres) │ │ (Redis) │ │ (Messaging) │
└─────────────┘ └─────────────┘ └─────────────┘
```
### Service Stack
| Layer | Service | Configuration |
|-------|---------|---------------|
| CDN | Cloud CDN | Edge caching, HTTPS |
| Load Balancer | Global Application LB | Backend services, health checks |
| Compute | GKE Autopilot | Managed node provisioning |
| Database | Cloud SQL PostgreSQL | Regional HA, read replicas |
| Cache | Memorystore Redis | Session, query caching |
| Messaging | Pub/Sub | Async service communication |
### GKE Autopilot Configuration
```yaml
# Deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 2
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
serviceAccountName: api-workload-sa
containers:
- name: api
image: us-central1-docker.pkg.dev/my-project/my-app/api:latest
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
### Cost Breakdown (50K users)
| Service | Monthly Cost |
|---------|-------------|
| GKE Autopilot | $150-400 |
| Cloud Load Balancing | $25-50 |
| Cloud SQL | $100-300 |
| Memorystore | $40-80 |
| Pub/Sub | $5-20 |
| **Total** | **$320-850** |
---
## Pattern 3: Three-Tier Application
### Use Case
Traditional web apps, e-commerce, CMS, applications with complex queries
### Architecture Diagram
```
┌─────────────┐ ┌─────────────┐
│ Cloud CDN │────▶│ Global │
│ (CDN) │ │ Load Bal. │
└─────────────┘ └──────┬──────┘
┌──────▼──────┐
│ Cloud Run │
│ (or MIG) │
└──────┬──────┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Cloud SQL │ │ Memorystore │ │ Cloud │
│ (Database) │ │ (Redis) │ │ Storage │
└─────────────┘ └─────────────┘ └─────────────┘
```
### Service Stack
| Layer | Service | Configuration |
|-------|---------|---------------|
| CDN | Cloud CDN | Edge caching, compression |
| Load Balancer | External Application LB | SSL termination, health checks |
| Compute | Cloud Run or Managed Instance Group | Auto-scaling containers or VMs |
| Database | Cloud SQL (MySQL/PostgreSQL) | Regional HA, automated backups |
| Cache | Memorystore Redis | Session store, query cache |
| Storage | Cloud Storage | Uploads, static assets, backups |
### Cost Breakdown (50K users)
| Service | Monthly Cost |
|---------|-------------|
| Cloud Run / MIG | $80-200 |
| Cloud Load Balancing | $25-50 |
| Cloud SQL | $100-250 |
| Memorystore | $30-60 |
| Cloud Storage | $10-30 |
| **Total** | **$245-590** |
---
## Pattern 4: Serverless Data Pipeline
### Use Case
Analytics, IoT data ingestion, log processing, real-time streaming, ETL
### Architecture Diagram
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │────▶│ Pub/Sub │────▶│ Dataflow │
│ (Apps/IoT) │ │ (Ingest) │ │ (Process) │
└─────────────┘ └─────────────┘ └──────┬──────┘
┌─────────────┐ ┌─────────────┐ ┌──────▼──────┐
│ Looker │◀────│ BigQuery │◀────│ Cloud │
│ (Dashbd) │ │(Warehouse) │ │ Storage │
└─────────────┘ └─────────────┘ │ (Data Lake) │
└─────────────┘
```
### Service Stack
| Layer | Service | Purpose |
|-------|---------|---------|
| Ingestion | Pub/Sub | Real-time event capture |
| Processing | Dataflow (Apache Beam) | Stream/batch transforms |
| Warehouse | BigQuery | SQL analytics at scale |
| Storage | Cloud Storage | Raw data lake |
| Visualization | Looker / Looker Studio | Dashboards and reports |
| Orchestration | Cloud Composer (Airflow) | Pipeline scheduling |
### Dataflow Pipeline Example
```python
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
'--runner=DataflowRunner',
'--project=my-project',
'--region=us-central1',
'--temp_location=gs://my-bucket/temp',
'--streaming'
])
with beam.Pipeline(options=options) as p:
(p
| 'ReadPubSub' >> beam.io.ReadFromPubSub(topic='projects/my-project/topics/events')
| 'ParseJSON' >> beam.Map(lambda x: json.loads(x))
| 'WindowInto' >> beam.WindowInto(beam.window.FixedWindows(60))
| 'WriteBQ' >> beam.io.WriteToBigQuery(
'my-project:analytics.events',
schema='event_id:STRING,event_type:STRING,timestamp:TIMESTAMP',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
))
```
### Cost Breakdown
| Service | Monthly Cost |
|---------|-------------|
| Pub/Sub | $5-30 |
| Dataflow | $30-200 |
| BigQuery (on-demand) | $10-100 |
| Cloud Storage | $5-30 |
| Looker Studio | $0 (free) |
| **Total** | **$50-360** |
---
## Pattern 5: ML Platform
### Use Case
Model training, serving, MLOps, feature engineering
### Architecture Diagram
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ BigQuery │────▶│ Vertex AI │────▶│ Vertex AI │
│ (Features) │ │ (Training) │ │ (Endpoints) │
└─────────────┘ └──────┬──────┘ └─────────────┘
┌─────────────┐ ┌──────▼──────┐ ┌─────────────┐
│ Cloud │◀────│ Cloud │────▶│ Vertex AI │
│ Functions │ │ Storage │ │ Pipelines │
│ (Triggers) │ │ (Artifacts) │ │ (MLOps) │
└─────────────┘ └─────────────┘ └─────────────┘
```
### Service Stack
| Layer | Service | Purpose |
|-------|---------|---------|
| Data | BigQuery | Feature engineering, exploration |
| Training | Vertex AI Training | Custom or AutoML training |
| Serving | Vertex AI Endpoints | Online/batch prediction |
| Storage | Cloud Storage | Datasets, model artifacts |
| Orchestration | Vertex AI Pipelines | ML workflow automation |
| Monitoring | Vertex AI Model Monitoring | Drift and skew detection |
### Vertex AI Training Example
```python
from google.cloud import aiplatform
aiplatform.init(project='my-project', location='us-central1')
job = aiplatform.CustomTrainingJob(
display_name='my-model-training',
script_path='train.py',
container_uri='us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12:latest',
requirements=['pandas', 'scikit-learn'],
)
model = job.run(
replica_count=1,
machine_type='n1-standard-8',
accelerator_type='NVIDIA_TESLA_T4',
accelerator_count=1,
)
endpoint = model.deploy(
deployed_model_display_name='my-model-v1',
machine_type='n1-standard-4',
min_replica_count=1,
max_replica_count=5,
)
```
### Cost Breakdown
| Service | Monthly Cost |
|---------|-------------|
| Vertex AI Training (T4 GPU) | $50-500 |
| Vertex AI Prediction | $30-200 |
| BigQuery | $10-50 |
| Cloud Storage | $5-30 |
| **Total** | **$95-780** |
---
## Pattern 6: Multi-Region High Availability
### Use Case
Global applications, disaster recovery, data sovereignty compliance
### Architecture Diagram
```
┌─────────────┐
│ Cloud DNS │
│(Geo routing)│
└──────┬──────┘
┌────────────────┼────────────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│us-central1 │ │europe-west1 │
│ Cloud Run │ │ Cloud Run │
└──────┬──────┘ └──────┬──────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│Cloud Spanner│◀── Replication ──▶│Cloud Spanner│
│ (Region) │ │ (Region) │
└─────────────┘ └─────────────┘
```
### Service Stack
| Component | Service | Configuration |
|-----------|---------|---------------|
| DNS | Cloud DNS | Geolocation or latency routing |
| CDN | Cloud CDN | Multiple regional origins |
| Compute | Cloud Run (multi-region) | Deployed in each region |
| Database | Cloud Spanner (multi-region) | Strong global consistency |
| Storage | Cloud Storage (multi-region) | Automatic geo-redundancy |
### Cloud DNS Geolocation Policy
```bash
# Create geolocation routing policy
gcloud dns record-sets create api.example.com \
--zone=my-zone \
--type=A \
--routing-policy-type=GEO \
--routing-policy-data="us-central1=projects/my-project/regions/us-central1/addresses/api-us;europe-west1=projects/my-project/regions/europe-west1/addresses/api-eu"
```
### Cost Considerations
| Factor | Impact |
|--------|--------|
| Compute | 2x (each region) |
| Cloud Spanner | Multi-region 3x regional price |
| Data Transfer | Cross-region replication costs |
| Cloud DNS | Geolocation queries premium |
| **Total** | **2-3x single region** |
---
## Pattern Comparison Summary
### Latency
| Pattern | Typical Latency |
|---------|-----------------|
| Serverless Web | 30-150ms (Cloud Run) |
| GKE Microservices | 15-80ms |
| Three-Tier | 20-100ms |
| Multi-Region | <50ms (regional) |
### Scaling Characteristics
| Pattern | Scale Limit | Scale Speed |
|---------|-------------|-------------|
| Serverless Web | 1000 instances/service | Seconds |
| GKE Microservices | Cluster node limits | Minutes |
| Data Pipeline | Unlimited (Dataflow) | Seconds |
| Multi-Region | Regional limits | Seconds |
### Operational Complexity
| Pattern | Setup | Maintenance | Debugging |
|---------|-------|-------------|-----------|
| Serverless Web | Low | Low | Medium |
| GKE Microservices | Medium | Medium | Medium |
| Three-Tier | Medium | Medium | Low |
| Data Pipeline | High | Medium | High |
| ML Platform | High | High | High |
| Multi-Region | High | High | High |

View File

@@ -0,0 +1,467 @@
# GCP Best Practices
Production-ready practices for naming, labels, IAM, networking, monitoring, and disaster recovery.
---
## Table of Contents
- [Naming Conventions](#naming-conventions)
- [Labels and Organization](#labels-and-organization)
- [IAM and Security](#iam-and-security)
- [Networking](#networking)
- [Monitoring and Logging](#monitoring-and-logging)
- [Cost Optimization](#cost-optimization)
- [Disaster Recovery](#disaster-recovery)
- [Common Pitfalls](#common-pitfalls)
---
## Naming Conventions
### Resource Naming Pattern
```
{environment}-{project}-{resource-type}-{purpose}
Examples:
prod-myapp-gke-cluster
dev-myapp-sql-primary
staging-myapp-run-api
prod-myapp-gcs-uploads
```
### Project Naming
```
{org}-{team}-{environment}
Examples:
acme-platform-prod
acme-platform-dev
acme-data-prod
```
### Naming Rules
| Resource | Format | Max Length | Example |
|----------|--------|-----------|---------|
| Project ID | lowercase, hyphens | 30 chars | acme-platform-prod |
| GKE Cluster | lowercase, hyphens | 40 chars | prod-api-cluster |
| Cloud Run | lowercase, hyphens | 49 chars | prod-myapp-api |
| Cloud SQL | lowercase, hyphens | 84 chars | prod-myapp-sql-primary |
| GCS Bucket | lowercase, hyphens, dots | 63 chars | acme-prod-myapp-uploads |
| Service Account | lowercase, hyphens | 30 chars | myapp-run-sa |
---
## Labels and Organization
### Required Labels
Apply these labels to all resources:
```
labels:
environment: "prod" # dev, staging, prod
team: "platform" # team owning the resource
app: "myapp" # application name
cost-center: "eng-001" # billing allocation
managed-by: "terraform" # terraform, gcloud, console
```
### Label-Based Cost Reporting
```bash
# Export billing data to BigQuery with labels
# Then query by label:
SELECT
labels.value AS environment,
SUM(cost) AS total_cost
FROM `billing_export.gcp_billing_export_v1_*`
CROSS JOIN UNNEST(labels) AS labels
WHERE labels.key = 'environment'
GROUP BY environment
ORDER BY total_cost DESC
```
### Organization Hierarchy
```
Organization
├── Folder: Production
│ ├── Project: platform-prod
│ ├── Project: data-prod
│ └── Project: ml-prod
├── Folder: Non-Production
│ ├── Project: platform-dev
│ ├── Project: platform-staging
│ └── Project: data-dev
└── Folder: Shared Services
├── Project: shared-networking
├── Project: shared-security
└── Project: shared-monitoring
```
---
## IAM and Security
### Principle of Least Privilege
```bash
# BAD: Basic roles are too broad
gcloud projects add-iam-policy-binding my-project \
--member="user:dev@example.com" \
--role="roles/editor"
# GOOD: Use predefined roles
gcloud projects add-iam-policy-binding my-project \
--member="user:dev@example.com" \
--role="roles/run.developer"
```
### Service Account Best Practices
```bash
# 1. Create dedicated SA per workload
gcloud iam service-accounts create myapp-api-sa \
--display-name="MyApp API Service Account"
# 2. Grant only required roles
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:myapp-api-sa@my-project.iam.gserviceaccount.com" \
--role="roles/datastore.user"
# 3. Use Workload Identity for GKE (no key files)
gcloud iam service-accounts add-iam-policy-binding \
myapp-api-sa@my-project.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:my-project.svc.id.goog[default/myapp-api-ksa]"
# 4. NEVER download SA key files in production
# Instead, use attached service accounts or impersonation
```
### VPC Service Controls
```bash
# Create a service perimeter to restrict data exfiltration
gcloud access-context-manager perimeters create my-perimeter \
--title="Production Data Perimeter" \
--resources="projects/123456" \
--restricted-services="bigquery.googleapis.com,storage.googleapis.com" \
--policy=$POLICY_ID
```
### Organization Policies
```bash
# Restrict external IPs on VMs
gcloud resource-manager org-policies set-policy \
--project=my-project policy.yaml
# policy.yaml
constraint: compute.vmExternalIpAccess
listPolicy:
allValues: DENY
# Restrict public Cloud Storage
constraint: storage.publicAccessPrevention
booleanPolicy:
enforced: true
```
### Encryption
| Layer | Service | Default |
|-------|---------|---------|
| At rest | Google-managed keys | Always enabled |
| At rest | CMEK (Cloud KMS) | Optional, recommended |
| In transit | TLS 1.3 | Always enabled |
| Application | Cloud KMS | Encrypt sensitive fields |
```bash
# Create CMEK key for Cloud SQL
gcloud kms keys create myapp-sql-key \
--keyring=myapp-keyring \
--location=us-central1 \
--purpose=encryption
# Use CMEK with Cloud SQL
gcloud sql instances create myapp-db \
--disk-encryption-key=projects/my-project/locations/us-central1/keyRings/myapp-keyring/cryptoKeys/myapp-sql-key
```
---
## Networking
### VPC Design
```bash
# Create custom VPC (avoid default network)
gcloud compute networks create myapp-vpc \
--subnet-mode=custom
# Create subnets with secondary ranges for GKE
gcloud compute networks subnets create myapp-subnet \
--network=myapp-vpc \
--region=us-central1 \
--range=10.0.0.0/20 \
--secondary-range pods=10.4.0.0/14,services=10.8.0.0/20 \
--enable-private-google-access
```
### Shared VPC
Use Shared VPC for multi-project environments:
```
Host Project (shared-networking)
├── VPC: shared-vpc
│ ├── Subnet: prod-us-central1 → Service Project: platform-prod
│ ├── Subnet: prod-europe-west1 → Service Project: platform-prod
│ └── Subnet: dev-us-central1 → Service Project: platform-dev
```
### Firewall Rules
```bash
# Allow internal traffic
gcloud compute firewall-rules create allow-internal \
--network=myapp-vpc \
--allow=tcp,udp,icmp \
--source-ranges=10.0.0.0/8
# Allow health checks from Google load balancers
gcloud compute firewall-rules create allow-health-checks \
--network=myapp-vpc \
--allow=tcp:8080 \
--source-ranges=35.191.0.0/16,130.211.0.0/22 \
--target-tags=allow-health-check
# Deny all other ingress (implicit, but be explicit)
gcloud compute firewall-rules create deny-all-ingress \
--network=myapp-vpc \
--action=DENY \
--rules=all \
--direction=INGRESS \
--priority=65534
```
### Private Google Access
Always enable Private Google Access to reach GCP APIs without public IPs:
```bash
gcloud compute networks subnets update myapp-subnet \
--region=us-central1 \
--enable-private-google-access
```
---
## Monitoring and Logging
### Cloud Monitoring Setup
```bash
# Create uptime check
gcloud monitoring uptime create \
--display-name="API Health Check" \
--resource-type=cloud-run-revision \
--resource-labels="service_name=myapp-api,location=us-central1" \
--check-request-path="/health" \
--period=60s
# Create alerting policy
gcloud alpha monitoring policies create \
--display-name="High Error Rate" \
--condition-display-name="Cloud Run 5xx > 1%" \
--condition-filter='resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/request_count" AND metric.labels.response_code_class="5xx"' \
--condition-threshold-value=1 \
--notification-channels="projects/my-project/notificationChannels/12345"
```
### Key Metrics to Monitor
| Service | Metric | Alert Threshold |
|---------|--------|-----------------|
| Cloud Run | request_latencies (p99) | >2s |
| Cloud Run | request_count (5xx) | >1% of total |
| Cloud SQL | cpu/utilization | >80% |
| Cloud SQL | disk/utilization | >85% |
| GKE | container/cpu/utilization | >80% |
| GKE | node/cpu/allocatable_utilization | >85% |
| Pub/Sub | subscription/oldest_unacked_message_age | >300s |
| BigQuery | query/execution_time | >60s |
### Log-Based Metrics
```bash
# Create a metric for application errors
gcloud logging metrics create app-errors \
--description="Application error count" \
--log-filter='resource.type="cloud_run_revision" AND severity>=ERROR'
# Create log sink to BigQuery for analysis
gcloud logging sinks create audit-logs-bq \
bigquery.googleapis.com/projects/my-project/datasets/audit_logs \
--log-filter='logName="projects/my-project/logs/cloudaudit.googleapis.com%2Factivity"'
```
### Log Exclusion (Cost Reduction)
```bash
# Exclude verbose debug logs to save on Cloud Logging costs
gcloud logging sinks create _Default \
--log-filter='NOT (severity="DEBUG" OR severity="DEFAULT")' \
--description="Exclude debug-level logs"
# Or create exclusion filters
gcloud logging exclusions create exclude-debug \
--log-filter='severity="DEBUG"' \
--description="Exclude debug logs to reduce costs"
```
---
## Cost Optimization
### Committed Use Discounts
| Term | Compute Discount | Memory Discount |
|------|-----------------|-----------------|
| 1 year | 37% | 37% |
| 3 years | 55% | 55% |
```bash
# Check recommendations
gcloud recommender recommendations list \
--project=my-project \
--location=us-central1 \
--recommender=google.compute.commitment.UsageCommitmentRecommender
```
### Sustained Use Discounts
Automatic discounts for resources running >25% of the month:
| Usage | Discount |
|-------|----------|
| 25-50% | 20% |
| 50-75% | 40% |
| 75-100% | 60% |
### BigQuery Cost Control
```sql
-- Use partitioning to limit data scanned
CREATE TABLE my_dataset.events
PARTITION BY DATE(timestamp)
CLUSTER BY event_type
AS SELECT * FROM raw_events;
-- Estimate query cost before running
-- Use --dry_run flag
bq query --dry_run --use_legacy_sql=false \
'SELECT * FROM my_dataset.events WHERE DATE(timestamp) = "2026-01-01"'
```
### Cloud Storage Optimization
```bash
# Enable Autoclass for automatic class management
gsutil mb -l us-central1 --autoclass gs://my-bucket/
# Set lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket/
```
---
## Disaster Recovery
### RPO/RTO Targets
| Tier | RPO | RTO | Strategy |
|------|-----|-----|----------|
| Tier 1 (Critical) | 0 | <1 hour | Multi-region active-active |
| Tier 2 (Important) | <1 hour | <4 hours | Regional HA + cross-region backup |
| Tier 3 (Standard) | <24 hours | <24 hours | Automated backups + restore |
### Backup Strategy
```bash
# Cloud SQL automated backups
gcloud sql instances patch myapp-db \
--backup-start-time=02:00 \
--enable-point-in-time-recovery
# Firestore scheduled exports
gcloud firestore export gs://myapp-backups/firestore/$(date +%Y%m%d)
# GKE cluster backup with Backup for GKE
gcloud beta container backup-restore backup-plans create myapp-plan \
--project=my-project \
--location=us-central1 \
--cluster=projects/my-project/locations/us-central1/clusters/myapp-cluster \
--all-namespaces \
--cron-schedule="0 2 * * *"
```
### Multi-Region Failover
```bash
# Cloud SQL cross-region replica for DR
gcloud sql instances create myapp-db-replica \
--master-instance-name=myapp-db \
--region=us-east1
# Promote replica during failover
gcloud sql instances promote-replica myapp-db-replica
```
---
## Common Pitfalls
### Technical Debt
| Pitfall | Solution |
|---------|----------|
| Using default VPC | Always create custom VPCs |
| Not enabling audit logs | Enable Cloud Audit Logs from day one |
| Single-region deployment | Plan for multi-zone at minimum |
| No IaC | Use Terraform from the start |
### Security Mistakes
| Mistake | Prevention |
|---------|------------|
| SA key files in code | Use Workload Identity, attached SAs |
| Public GCS buckets | Enable org policy for public access prevention |
| Basic roles (Owner/Editor) | Use predefined or custom roles |
| No encryption key management | Use CMEK for sensitive data |
| Default service account | Create dedicated SAs per workload |
### Performance Issues
| Issue | Solution |
|-------|----------|
| Cold starts on Cloud Run | Set min-instances=1 for latency-critical services |
| Slow BigQuery queries | Partition tables, use clustering, avoid SELECT * |
| GKE pod scheduling delays | Use PodDisruptionBudget, pre-provision with Autopilot |
| Firestore hotspots | Distribute writes across document IDs evenly |
### Cost Surprises
| Surprise | Prevention |
|----------|------------|
| Undeleted resources | Label everything, review weekly |
| Egress costs | Keep traffic in same region, use Private Google Access |
| Cloud NAT charges | Use Private Google Access for GCP service traffic |
| Log ingestion costs | Set exclusion filters for debug/verbose logs |
| BigQuery full scans | Always use partitioning and clustering |
| Idle GKE clusters | Delete dev clusters nightly, use Autopilot |

View File

@@ -0,0 +1,547 @@
# GCP Service Selection Guide
Quick reference for choosing the right GCP service based on requirements.
---
## Table of Contents
- [Compute Services](#compute-services)
- [Database Services](#database-services)
- [Storage Services](#storage-services)
- [Messaging and Events](#messaging-and-events)
- [API and Integration](#api-and-integration)
- [Networking](#networking)
- [Security and Identity](#security-and-identity)
---
## Compute Services
### Decision Matrix
| Requirement | Recommended Service |
|-------------|---------------------|
| HTTP-triggered containers, auto-scaling | Cloud Run |
| Event-driven, short tasks (<9 min) | Cloud Functions (2nd gen) |
| Kubernetes workloads, microservices | GKE Autopilot |
| Custom VMs, GPU/TPU | Compute Engine |
| Batch processing, HPC | Batch |
| Kubernetes with full control | GKE Standard |
### Cloud Run
**Best for:** Containerized HTTP services, APIs, web backends
```
Limits:
- vCPU: 1-8 per instance
- Memory: 128 MiB - 32 GiB
- Request timeout: 3600 seconds
- Concurrency: 1-1000 per instance
- Min instances: 0 (scale-to-zero)
- Max instances: 1000
Pricing: Per vCPU-second + GiB-second (free tier: 2M requests/month)
```
**Use when:**
- Containerized apps with HTTP endpoints
- Variable/unpredictable traffic
- Want scale-to-zero capability
- No Kubernetes expertise needed
**Avoid when:**
- Non-HTTP workloads (use Cloud Functions or GKE)
- Need GPU/TPU (use Compute Engine or GKE)
- Require persistent local storage
### Cloud Functions (2nd gen)
**Best for:** Event-driven functions, lightweight triggers, webhooks
```
Limits:
- Execution: 9 minutes max (2nd gen), 9 minutes (1st gen)
- Memory: 128 MB - 32 GB
- Concurrency: Up to 1000 per instance (2nd gen)
- Runtimes: Node.js, Python, Go, Java, .NET, Ruby, PHP
Pricing: $0.40 per million invocations + compute time
```
**Use when:**
- Event-driven processing (Pub/Sub, Cloud Storage, Firestore)
- Lightweight API endpoints
- Scheduled tasks (Cloud Scheduler triggers)
- Minimal infrastructure management
**Avoid when:**
- Long-running processes (>9 min)
- Complex multi-container apps
- Need fine-grained scaling control
### GKE Autopilot
**Best for:** Kubernetes workloads with managed node provisioning
```
Limits:
- Pod resources: 0.25-112 vCPU, 0.5-896 GiB memory
- GPU support: NVIDIA T4, L4, A100, H100
- Management fee: $0.10/hour per cluster ($74.40/month)
Pricing: Per pod vCPU-hour + GiB-hour (no node management)
```
**Use when:**
- Team has Kubernetes expertise
- Need pod-level resource control
- Multi-container services
- GPU workloads
### Compute Engine
**Best for:** Custom configurations, specialized hardware
```
Machine Types:
- General: e2, n2, n2d, c3
- Compute: c2, c2d
- Memory: m1, m2, m3
- Accelerator: a2 (GPU), a3 (GPU)
- Storage: z3
Pricing Options:
- On-demand, Spot (60-91% discount), Committed Use (37-55% discount)
```
**Use when:**
- Need GPU/TPU
- Windows workloads
- Specific hardware requirements
- Lift-and-shift migrations
---
## Database Services
### Decision Matrix
| Data Type | Query Pattern | Scale | Recommended |
|-----------|--------------|-------|-------------|
| Key-value, document | Simple lookups, real-time | Any | Firestore |
| Wide-column | High-throughput reads/writes | >1TB | Cloud Bigtable |
| Relational | Complex joins, ACID | Variable | Cloud SQL |
| Relational, global | Strong consistency, global | Large | Cloud Spanner |
| Time-series | Time-based queries | Any | Bigtable or BigQuery |
| Analytics, warehouse | SQL analytics | Petabytes | BigQuery |
### Firestore
**Best for:** Document data, mobile/web apps, real-time sync
```
Limits:
- Document size: 1 MiB max
- Field depth: 20 nested levels
- Write rate: 10,000 writes/sec per database
- Indexes: Automatic single-field, manual composite
Pricing:
- Reads: $0.036 per 100K reads
- Writes: $0.108 per 100K writes
- Storage: $0.108 per GiB/month
- Free tier: 50K reads, 20K writes, 1 GiB storage per day
```
**Use when:**
- Mobile/web apps needing offline sync
- Real-time data updates
- Flexible schema
- Serverless architecture
**Avoid when:**
- Complex SQL queries with joins
- Heavy analytics workloads
- Data >1 MiB per document
### Cloud SQL
**Best for:** Relational data with familiar SQL
| Engine | Version | Max Storage | Max Connections |
|--------|---------|-------------|-----------------|
| PostgreSQL | 15 | 64 TB | Instance-dependent |
| MySQL | 8.0 | 64 TB | Instance-dependent |
| SQL Server | 2022 | 64 TB | Instance-dependent |
```
Pricing:
- Machine type + storage + networking
- HA: 2x cost (regional instance)
- Read replicas: Per-replica pricing
```
**Use when:**
- Relational data with complex queries
- Existing SQL expertise
- Need ACID transactions
- Migration from on-premises databases
### Cloud Spanner
**Best for:** Globally distributed relational data
```
Limits:
- Storage: Unlimited
- Nodes: 1-100+ per instance
- Consistency: Strong global consistency
Pricing:
- Regional: $0.90/node-hour (~$657/month per node)
- Multi-region: $2.70/node-hour (~$1,971/month per node)
- Storage: $0.30/GiB/month
```
**Use when:**
- Global applications needing strong consistency
- Relational data at massive scale
- 99.999% availability requirement
- Horizontal scaling with SQL
### BigQuery
**Best for:** Analytics, data warehouse, SQL on massive datasets
```
Limits:
- Query: 6-hour timeout
- Concurrent queries: 100 default
- Streaming inserts: 100K rows/sec per table
Pricing:
- On-demand: $6.25 per TB queried (first 1 TB free/month)
- Editions: Autoscale slots starting at $0.04/slot-hour
- Storage: $0.02/GiB (active), $0.01/GiB (long-term)
```
### Firestore vs Cloud SQL vs Spanner
| Factor | Firestore | Cloud SQL | Cloud Spanner |
|--------|-----------|-----------|---------------|
| Query flexibility | Document-based | Full SQL | Full SQL |
| Scaling | Automatic | Vertical + read replicas | Horizontal |
| Consistency | Strong (single region) | ACID | Strong (global) |
| Cost model | Per-operation | Per-hour | Per-node-hour |
| Operational | Zero management | Managed (some ops) | Managed |
| Best for | Mobile/web apps | Traditional apps | Global scale |
---
## Storage Services
### Cloud Storage Classes
| Class | Access Pattern | Min Duration | Cost (GiB/mo) |
|-------|---------------|--------------|----------------|
| Standard | Frequent | None | $0.020 |
| Nearline | Monthly access | 30 days | $0.010 |
| Coldline | Quarterly access | 90 days | $0.004 |
| Archive | Annual access | 365 days | $0.0012 |
### Lifecycle Policy Example
```json
{
"lifecycle": {
"rule": [
{
"action": { "type": "SetStorageClass", "storageClass": "NEARLINE" },
"condition": { "age": 30, "matchesStorageClass": ["STANDARD"] }
},
{
"action": { "type": "SetStorageClass", "storageClass": "COLDLINE" },
"condition": { "age": 90, "matchesStorageClass": ["NEARLINE"] }
},
{
"action": { "type": "SetStorageClass", "storageClass": "ARCHIVE" },
"condition": { "age": 365, "matchesStorageClass": ["COLDLINE"] }
},
{
"action": { "type": "Delete" },
"condition": { "age": 2555 }
}
]
}
}
```
### Autoclass
Automatically transitions objects between storage classes based on access patterns. Recommended for mixed or unknown access patterns.
```bash
gsutil mb -l us-central1 --autoclass gs://my-bucket/
```
### Block and File Storage
| Service | Use Case | Access |
|---------|----------|--------|
| Persistent Disk | GCE/GKE block storage | Single instance (RW) or multi (RO) |
| Filestore | NFS shared file system | Multiple instances |
| Parallelstore | HPC parallel file system | High throughput |
| Cloud Storage FUSE | Mount GCS as filesystem | Any compute |
---
## Messaging and Events
### Decision Matrix
| Pattern | Service | Use Case |
|---------|---------|----------|
| Pub/sub messaging | Pub/Sub | Event streaming, microservice decoupling |
| Task queue | Cloud Tasks | Asynchronous task execution with retries |
| Workflow orchestration | Workflows | Multi-step service orchestration |
| Batch orchestration | Cloud Composer | Complex DAG-based pipelines (Airflow) |
| Event triggers | Eventarc | Route events to Cloud Run, GKE, Workflows |
### Pub/Sub
**Best for:** Event-driven architectures, stream processing
```
Limits:
- Message size: 10 MB max
- Throughput: Unlimited (auto-scaling)
- Retention: 7 days default (up to 31 days)
- Ordering: Per ordering key
Pricing: $40/TiB for message delivery
```
```python
# Pub/Sub publisher example
from google.cloud import pubsub_v1
import json
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'events')
def publish_event(event_type, payload):
data = json.dumps(payload).encode('utf-8')
future = publisher.publish(
topic_path,
data,
event_type=event_type
)
return future.result()
```
### Cloud Tasks
**Best for:** Asynchronous task execution with delivery guarantees
```
Features:
- Configurable retry policies
- Rate limiting
- Scheduled delivery
- HTTP and App Engine targets
Pricing: $0.40 per million operations
```
### Eventarc
**Best for:** Routing cloud events to services
```python
# Eventarc routes events from 130+ Google Cloud sources
# to Cloud Run, GKE, or Workflows
# Example: Trigger Cloud Run on Cloud Storage upload
# gcloud eventarc triggers create my-trigger \
# --destination-run-service=my-service \
# --event-filters="type=google.cloud.storage.object.v1.finalized" \
# --event-filters="bucket=my-bucket"
```
---
## API and Integration
### API Gateway vs Cloud Endpoints vs Cloud Run
| Factor | API Gateway | Cloud Endpoints | Cloud Run (direct) |
|--------|-------------|-----------------|---------------------|
| Protocol | REST, gRPC | REST, gRPC | Any HTTP |
| Auth | API keys, JWT, Firebase | API keys, JWT | IAM, custom |
| Rate limiting | Built-in | Built-in | Manual |
| Cost | Per-call pricing | Per-call pricing | Per-request |
| Best for | External APIs | Internal APIs | Simple services |
### Cloud Endpoints Configuration
```yaml
# openapi.yaml
swagger: "2.0"
info:
title: "My API"
version: "1.0.0"
host: "my-api-xyz.apigateway.my-project.cloud.goog"
schemes:
- "https"
paths:
/users:
get:
summary: "List users"
operationId: "listUsers"
x-google-backend:
address: "https://my-app-api-xyz.a.run.app"
security:
- api_key: []
securityDefinitions:
api_key:
type: "apiKey"
name: "key"
in: "query"
```
### Workflows
**Best for:** Orchestrating multi-service processes
```yaml
# workflow.yaml
main:
steps:
- processOrder:
call: http.post
args:
url: https://orders-service.run.app/process
body:
orderId: ${args.orderId}
result: orderResult
- checkInventory:
switch:
- condition: ${orderResult.body.inStock}
next: shipOrder
next: backOrder
- shipOrder:
call: http.post
args:
url: https://shipping-service.run.app/ship
body:
orderId: ${args.orderId}
result: shipResult
- backOrder:
call: http.post
args:
url: https://inventory-service.run.app/backorder
body:
orderId: ${args.orderId}
```
---
## Networking
### VPC Components
| Component | Purpose |
|-----------|---------|
| VPC | Isolated network (global resource) |
| Subnet | Regional network segment |
| Cloud NAT | Outbound internet for private instances |
| Cloud Router | Dynamic routing (BGP) |
| Private Google Access | Access GCP APIs without public IP |
| VPC Peering | Connect two VPC networks |
| Shared VPC | Share VPC across projects |
### VPC Design Pattern
```
VPC: 10.0.0.0/16 (global)
Subnet us-central1:
10.0.0.0/20 (primary)
10.4.0.0/14 (pods - secondary)
10.8.0.0/20 (services - secondary)
- GKE cluster, Cloud Run (VPC connector)
Subnet us-east1:
10.0.16.0/20 (primary)
- Cloud SQL (private IP), Memorystore
Subnet europe-west1:
10.0.32.0/20 (primary)
- DR / multi-region workloads
```
### Private Google Access
```bash
# Enable Private Google Access on a subnet
gcloud compute networks subnets update my-subnet \
--region=us-central1 \
--enable-private-google-access
```
---
## Security and Identity
### IAM Best Practices
```bash
# Prefer predefined roles over basic roles
# BAD: roles/editor (too broad)
# GOOD: roles/run.invoker (specific)
# Grant role to service account
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:my-sa@my-project.iam.gserviceaccount.com" \
--role="roles/datastore.user" \
--condition='expression=resource.name.startsWith("projects/my-project/databases/(default)/documents/users"),title=firestore-users-only'
```
### Service Account Best Practices
| Practice | Description |
|----------|-------------|
| One SA per service | Separate service accounts per workload |
| Workload Identity | Bind K8s SAs to GCP SAs in GKE |
| Short-lived tokens | Use impersonation instead of key files |
| No SA keys | Avoid downloading JSON key files |
### Secret Manager vs Environment Variables
| Factor | Secret Manager | Env Variables |
|--------|---------------|---------------|
| Rotation | Automatic versioning | Manual redeploy |
| Audit | Cloud Audit Logs | No audit trail |
| Access control | IAM per-secret | Per-service |
| Pricing | $0.06/10K access ops | Free |
| Use case | Credentials, API keys | Non-sensitive config |
### Secret Manager Usage
```python
from google.cloud import secretmanager
def get_secret(project_id, secret_id, version="latest"):
client = secretmanager.SecretManagerServiceClient()
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version}"
response = client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
# Usage
db_password = get_secret("my-project", "db-password")
```

View File

@@ -0,0 +1,805 @@
"""
GCP architecture design and service recommendation module.
Generates architecture patterns based on application requirements.
"""
import argparse
import json
import sys
from typing import Dict, List, Any
from enum import Enum
class ApplicationType(Enum):
"""Types of applications supported."""
WEB_APP = "web_application"
MOBILE_BACKEND = "mobile_backend"
DATA_PIPELINE = "data_pipeline"
MICROSERVICES = "microservices"
SAAS_PLATFORM = "saas_platform"
ML_PLATFORM = "ml_platform"
class ArchitectureDesigner:
"""Design GCP architectures based on requirements."""
def __init__(self, requirements: Dict[str, Any]):
"""
Initialize with application requirements.
Args:
requirements: Dictionary containing app type, traffic, budget, etc.
"""
self.app_type = requirements.get('application_type', 'web_application')
self.expected_users = requirements.get('expected_users', 1000)
self.requests_per_second = requirements.get('requests_per_second', 10)
self.budget_monthly = requirements.get('budget_monthly_usd', 500)
self.team_size = requirements.get('team_size', 3)
self.gcp_experience = requirements.get('gcp_experience', 'beginner')
self.compliance_needs = requirements.get('compliance', [])
self.data_size_gb = requirements.get('data_size_gb', 10)
def recommend_architecture_pattern(self) -> Dict[str, Any]:
"""
Recommend architecture pattern based on requirements.
Returns:
Dictionary with recommended pattern and services
"""
if self.app_type in ['web_application', 'saas_platform']:
if self.expected_users < 10000:
return self._serverless_web_architecture()
elif self.expected_users < 100000:
return self._gke_microservices_architecture()
else:
return self._multi_region_architecture()
elif self.app_type == 'mobile_backend':
return self._serverless_mobile_backend()
elif self.app_type == 'data_pipeline':
return self._data_pipeline_architecture()
elif self.app_type == 'microservices':
return self._gke_microservices_architecture()
elif self.app_type == 'ml_platform':
return self._ml_platform_architecture()
else:
return self._serverless_web_architecture()
def _serverless_web_architecture(self) -> Dict[str, Any]:
"""Serverless web application pattern using Cloud Run."""
return {
'pattern_name': 'Serverless Web Application',
'description': 'Fully serverless architecture with Cloud Run and Firestore',
'use_case': 'SaaS platforms, low to medium traffic websites, MVPs',
'services': {
'frontend': {
'service': 'Cloud Storage + Cloud CDN',
'purpose': 'Static website hosting with global CDN',
'configuration': {
'bucket': 'Website bucket with public access',
'cdn': 'Cloud CDN with custom domain and HTTPS',
'caching': 'Cache-Control headers, edge caching'
}
},
'api': {
'service': 'Cloud Run',
'purpose': 'Containerized API backend with auto-scaling',
'configuration': {
'cpu': '1 vCPU',
'memory': '512 Mi',
'min_instances': '0 (scale to zero)',
'max_instances': '10',
'concurrency': '80 requests per instance',
'timeout': '300 seconds'
}
},
'database': {
'service': 'Firestore',
'purpose': 'NoSQL document database with real-time sync',
'configuration': {
'mode': 'Native mode',
'location': 'Regional or multi-region',
'security_rules': 'Firestore security rules',
'backup': 'Scheduled exports to Cloud Storage'
}
},
'authentication': {
'service': 'Identity Platform',
'purpose': 'User authentication and authorization',
'configuration': {
'providers': 'Email/password, Google, Apple, OIDC',
'mfa': 'SMS or TOTP multi-factor authentication',
'token_expiration': '1 hour access, 30 days refresh'
}
},
'cicd': {
'service': 'Cloud Build',
'purpose': 'Automated build and deployment from Git',
'configuration': {
'source': 'GitHub or Cloud Source Repositories',
'build': 'Automatic on commit',
'environments': 'dev, staging, production'
}
}
},
'estimated_cost': {
'monthly_usd': self._calculate_serverless_cost(),
'breakdown': {
'Cloud CDN': '5-20 USD',
'Cloud Run': '5-25 USD',
'Firestore': '5-30 USD',
'Identity Platform': '0-10 USD (free tier: 50k MAU)',
'Cloud Storage': '1-5 USD'
}
},
'pros': [
'No server management',
'Auto-scaling with scale-to-zero',
'Pay only for what you use',
'No cold starts with min instances',
'Container-based (no runtime restrictions)'
],
'cons': [
'Vendor lock-in to GCP',
'Regional availability considerations',
'Debugging distributed systems complex',
'Firestore query limitations vs SQL'
],
'scaling_characteristics': {
'users_supported': '1k - 100k',
'requests_per_second': '100 - 10,000',
'scaling_method': 'Automatic (Cloud Run auto-scaling)'
}
}
def _gke_microservices_architecture(self) -> Dict[str, Any]:
"""GKE-based microservices architecture."""
return {
'pattern_name': 'Microservices on GKE',
'description': 'Kubernetes-native architecture with managed services',
'use_case': 'SaaS platforms, complex microservices, enterprise applications',
'services': {
'load_balancer': {
'service': 'Cloud Load Balancing',
'purpose': 'Global HTTP(S) load balancing',
'configuration': {
'type': 'External Application Load Balancer',
'ssl': 'Google-managed SSL certificate',
'health_checks': '/health endpoint, 10s interval',
'cdn': 'Cloud CDN enabled for static content'
}
},
'compute': {
'service': 'GKE Autopilot',
'purpose': 'Managed Kubernetes for containerized workloads',
'configuration': {
'mode': 'Autopilot (fully managed node provisioning)',
'scaling': 'Horizontal Pod Autoscaler',
'networking': 'VPC-native with Alias IPs',
'workload_identity': 'Enabled for secure service account binding'
}
},
'database': {
'service': 'Cloud SQL (PostgreSQL)',
'purpose': 'Managed relational database',
'configuration': {
'tier': 'db-custom-2-8192 (2 vCPU, 8 GB RAM)',
'high_availability': 'Regional with automatic failover',
'read_replicas': '1-2 for read scaling',
'backup': 'Automated daily backups, 7-day retention',
'encryption': 'Customer-managed encryption key (CMEK)'
}
},
'cache': {
'service': 'Memorystore (Redis)',
'purpose': 'Session storage, application caching',
'configuration': {
'tier': 'Basic (1 GB) or Standard (HA)',
'version': 'Redis 7.0',
'eviction_policy': 'allkeys-lru'
}
},
'messaging': {
'service': 'Pub/Sub',
'purpose': 'Asynchronous messaging between services',
'configuration': {
'topics': 'Per-domain event topics',
'subscriptions': 'Pull or push delivery',
'dead_letter': 'Dead letter topic after 5 retries',
'ordering': 'Ordering keys for ordered delivery'
}
},
'storage': {
'service': 'Cloud Storage',
'purpose': 'User uploads, backups, logs',
'configuration': {
'storage_class': 'Standard with lifecycle policies',
'versioning': 'Enabled for important buckets',
'lifecycle': 'Transition to Nearline after 30 days'
}
}
},
'estimated_cost': {
'monthly_usd': self._calculate_gke_cost(),
'breakdown': {
'Cloud Load Balancing': '20-40 USD',
'GKE Autopilot': '75-250 USD',
'Cloud SQL': '80-250 USD',
'Memorystore': '30-80 USD',
'Pub/Sub': '5-20 USD',
'Cloud Storage': '5-20 USD'
}
},
'pros': [
'Kubernetes ecosystem compatibility',
'Fine-grained scaling control',
'Multi-cloud portability',
'Rich service mesh (Anthos Service Mesh)',
'Managed node provisioning with Autopilot'
],
'cons': [
'Higher baseline costs than serverless',
'Kubernetes learning curve',
'More operational complexity',
'GKE management fee ($74.40/month per cluster)'
],
'scaling_characteristics': {
'users_supported': '10k - 500k',
'requests_per_second': '1,000 - 50,000',
'scaling_method': 'HPA + Cluster Autoscaler'
}
}
def _serverless_mobile_backend(self) -> Dict[str, Any]:
"""Serverless mobile backend with Firebase."""
return {
'pattern_name': 'Serverless Mobile Backend',
'description': 'Mobile-first backend with Firebase and Cloud Functions',
'use_case': 'Mobile apps, real-time applications, offline-first apps',
'services': {
'api': {
'service': 'Cloud Functions (2nd gen)',
'purpose': 'Event-driven API handlers',
'configuration': {
'runtime': 'Node.js 20 or Python 3.12',
'memory': '256 MB - 1 GB',
'timeout': '60 seconds',
'concurrency': 'Up to 1000 concurrent'
}
},
'database': {
'service': 'Firestore',
'purpose': 'Real-time NoSQL database with offline sync',
'configuration': {
'mode': 'Native mode',
'multi_region': 'nam5 or eur3 for HA',
'security_rules': 'Client-side access control',
'indexes': 'Composite indexes for queries'
}
},
'file_storage': {
'service': 'Cloud Storage (Firebase)',
'purpose': 'User uploads (images, videos, documents)',
'configuration': {
'access': 'Firebase Security Rules',
'resumable_uploads': 'Enabled for large files',
'cdn': 'Automatic via Firebase Hosting CDN'
}
},
'authentication': {
'service': 'Firebase Authentication',
'purpose': 'User management and federation',
'configuration': {
'providers': 'Email, Google, Apple, Phone',
'anonymous_auth': 'Enabled for guest access',
'custom_claims': 'Role-based access control',
'multi_tenancy': 'Supported via Identity Platform'
}
},
'push_notifications': {
'service': 'Firebase Cloud Messaging (FCM)',
'purpose': 'Push notifications to mobile devices',
'configuration': {
'platforms': 'iOS (APNs), Android, Web',
'topics': 'Topic-based group messaging',
'analytics': 'Notification delivery tracking'
}
},
'analytics': {
'service': 'Google Analytics (Firebase)',
'purpose': 'User analytics and event tracking',
'configuration': {
'events': 'Custom and automatic events',
'audiences': 'User segmentation',
'bigquery_export': 'Raw event export to BigQuery'
}
}
},
'estimated_cost': {
'monthly_usd': 40 + (self.expected_users * 0.004),
'breakdown': {
'Cloud Functions': '5-30 USD',
'Firestore': '10-50 USD',
'Cloud Storage': '5-20 USD',
'Identity Platform': '0-15 USD',
'FCM': '0 USD (free)',
'Analytics': '0 USD (free)'
}
},
'pros': [
'Real-time data sync built-in',
'Offline-first support',
'Firebase SDKs for iOS/Android/Web',
'Free tier covers most MVPs',
'Rapid development with Firebase console'
],
'cons': [
'Firestore query limitations',
'Vendor lock-in to Firebase/GCP',
'Cost scaling can be unpredictable',
'Limited server-side control'
],
'scaling_characteristics': {
'users_supported': '1k - 1M',
'requests_per_second': '100 - 100,000',
'scaling_method': 'Automatic (Firebase managed)'
}
}
def _data_pipeline_architecture(self) -> Dict[str, Any]:
"""Serverless data pipeline with BigQuery."""
return {
'pattern_name': 'Serverless Data Pipeline',
'description': 'Scalable data ingestion, processing, and analytics',
'use_case': 'Analytics, IoT data, log processing, ETL, data warehousing',
'services': {
'ingestion': {
'service': 'Pub/Sub',
'purpose': 'Real-time event and data ingestion',
'configuration': {
'throughput': 'Unlimited (auto-scaling)',
'retention': '7 days (configurable to 31 days)',
'ordering': 'Ordering keys for ordered delivery',
'dead_letter': 'Dead letter topic for failed messages'
}
},
'processing': {
'service': 'Dataflow (Apache Beam)',
'purpose': 'Stream and batch data processing',
'configuration': {
'mode': 'Streaming or batch',
'autoscaling': 'Horizontal autoscaling',
'workers': f'{max(1, self.data_size_gb // 20)} initial workers',
'sdk': 'Python or Java Apache Beam SDK'
}
},
'warehouse': {
'service': 'BigQuery',
'purpose': 'Serverless data warehouse and analytics',
'configuration': {
'pricing': 'On-demand ($6.25/TB queried) or slots',
'partitioning': 'By ingestion time or custom field',
'clustering': 'Up to 4 clustering columns',
'streaming_insert': 'Real-time data availability'
}
},
'storage': {
'service': 'Cloud Storage (Data Lake)',
'purpose': 'Raw data lake and archival storage',
'configuration': {
'format': 'Parquet or Avro (columnar)',
'partitioning': 'By date (year/month/day)',
'lifecycle': 'Transition to Coldline after 90 days',
'catalog': 'Dataplex for data governance'
}
},
'visualization': {
'service': 'Looker / Looker Studio',
'purpose': 'Business intelligence dashboards',
'configuration': {
'source': 'BigQuery direct connection',
'refresh': 'Real-time or scheduled',
'sharing': 'Embedded or web dashboards'
}
},
'orchestration': {
'service': 'Cloud Composer (Airflow)',
'purpose': 'Workflow orchestration for batch pipelines',
'configuration': {
'environment': 'Cloud Composer 2 (auto-scaling)',
'dags': 'Python DAG definitions',
'scheduling': 'Cron-based scheduling'
}
}
},
'estimated_cost': {
'monthly_usd': self._calculate_data_pipeline_cost(),
'breakdown': {
'Pub/Sub': '5-30 USD',
'Dataflow': '20-150 USD',
'BigQuery': '10-100 USD (on-demand)',
'Cloud Storage': '5-30 USD',
'Looker Studio': '0 USD (free)',
'Cloud Composer': '300+ USD (if used)'
}
},
'pros': [
'Fully serverless data stack',
'BigQuery scales to petabytes',
'Real-time and batch in same pipeline',
'Cost-effective with on-demand pricing',
'ML integration via BigQuery ML'
],
'cons': [
'Dataflow has steep learning curve (Beam SDK)',
'BigQuery costs based on data scanned',
'Cloud Composer expensive for small workloads',
'Schema evolution requires planning'
],
'scaling_characteristics': {
'events_per_second': '1,000 - 10,000,000',
'data_volume': '1 GB - 1 PB per day',
'scaling_method': 'Automatic (all services auto-scale)'
}
}
def _ml_platform_architecture(self) -> Dict[str, Any]:
"""ML platform architecture with Vertex AI."""
return {
'pattern_name': 'ML Platform',
'description': 'End-to-end machine learning platform',
'use_case': 'Model training, serving, MLOps, feature engineering',
'services': {
'ml_platform': {
'service': 'Vertex AI',
'purpose': 'Training, tuning, and serving ML models',
'configuration': {
'training': 'Custom or AutoML training jobs',
'prediction': 'Online or batch prediction endpoints',
'pipelines': 'Vertex AI Pipelines for MLOps',
'feature_store': 'Vertex AI Feature Store'
}
},
'data': {
'service': 'BigQuery',
'purpose': 'Feature engineering and data exploration',
'configuration': {
'ml': 'BigQuery ML for in-warehouse models',
'export': 'Export to Cloud Storage for training',
'feature_engineering': 'SQL-based transformations'
}
},
'storage': {
'service': 'Cloud Storage',
'purpose': 'Datasets, model artifacts, experiment logs',
'configuration': {
'buckets': 'Separate buckets for data/models/logs',
'versioning': 'Enabled for model artifacts',
'lifecycle': 'Archive old experiment data'
}
},
'triggers': {
'service': 'Cloud Functions',
'purpose': 'Event-driven preprocessing and triggers',
'configuration': {
'triggers': 'Cloud Storage, Pub/Sub, Scheduler',
'preprocessing': 'Data validation and transforms',
'notifications': 'Training completion alerts'
}
},
'monitoring': {
'service': 'Vertex AI Model Monitoring',
'purpose': 'Detect data drift and model degradation',
'configuration': {
'skew_detection': 'Training-serving skew alerts',
'drift_detection': 'Feature drift monitoring',
'alerting': 'Cloud Monitoring integration'
}
}
},
'estimated_cost': {
'monthly_usd': 200 + (self.data_size_gb * 2),
'breakdown': {
'Vertex AI Training': '50-500 USD (GPU dependent)',
'Vertex AI Prediction': '30-200 USD',
'BigQuery': '20-100 USD',
'Cloud Storage': '10-50 USD',
'Cloud Functions': '5-20 USD'
}
},
'pros': [
'End-to-end ML lifecycle management',
'AutoML for rapid prototyping',
'Integrated with BigQuery and Cloud Storage',
'Managed model serving with autoscaling',
'Built-in experiment tracking'
],
'cons': [
'GPU costs can escalate quickly',
'Vertex AI pricing is complex',
'Limited customization vs self-managed',
'Vendor lock-in for model artifacts'
],
'scaling_characteristics': {
'training': 'Multi-GPU, distributed training',
'prediction': '1 - 1000+ replicas',
'scaling_method': 'Automatic endpoint scaling'
}
}
def _multi_region_architecture(self) -> Dict[str, Any]:
"""Multi-region high availability architecture."""
return {
'pattern_name': 'Multi-Region High Availability',
'description': 'Global deployment with disaster recovery',
'use_case': 'Global applications, 99.99% uptime, compliance',
'services': {
'dns': {
'service': 'Cloud DNS',
'purpose': 'Global DNS with health-checked routing',
'configuration': {
'routing_policy': 'Geolocation or weighted routing',
'health_checks': 'HTTP health checks per region',
'failover': 'Automatic DNS failover'
}
},
'cdn': {
'service': 'Cloud CDN',
'purpose': 'Edge caching and acceleration',
'configuration': {
'origins': 'Multiple regional backends',
'cache_modes': 'CACHE_ALL_STATIC or USE_ORIGIN_HEADERS',
'edge_locations': 'Global (100+ locations)'
}
},
'compute': {
'service': 'Multi-region GKE or Cloud Run',
'purpose': 'Active-active deployment across regions',
'configuration': {
'regions': 'us-central1 (primary), europe-west1 (secondary)',
'deployment': 'Cloud Deploy for multi-region rollout',
'traffic_split': 'Global Load Balancer with traffic management'
}
},
'database': {
'service': 'Cloud Spanner or Firestore multi-region',
'purpose': 'Globally consistent database',
'configuration': {
'spanner': 'Multi-region config (nam-eur-asia1)',
'firestore': 'Multi-region location (nam5, eur3)',
'consistency': 'Strong consistency (Spanner) or eventual (Firestore)',
'replication': 'Automatic cross-region replication'
}
},
'storage': {
'service': 'Cloud Storage (dual-region or multi-region)',
'purpose': 'Geo-redundant object storage',
'configuration': {
'location': 'Dual-region (us-central1+us-east1) or multi-region (US)',
'turbo_replication': '15-minute RPO with turbo replication',
'versioning': 'Enabled for critical data'
}
}
},
'estimated_cost': {
'monthly_usd': self._calculate_gke_cost() * 2.0,
'breakdown': {
'Cloud DNS': '5-15 USD',
'Cloud CDN': '20-100 USD',
'Compute (2 regions)': '150-500 USD',
'Cloud Spanner': '500-2000 USD (multi-region)',
'Data transfer (cross-region)': '50-200 USD'
}
},
'pros': [
'Global low latency',
'High availability (99.99%+)',
'Disaster recovery built-in',
'Data sovereignty compliance',
'Automatic failover'
],
'cons': [
'2x+ costs vs single region',
'Cloud Spanner is expensive',
'Complex deployment pipeline',
'Cross-region data transfer costs',
'Operational overhead'
],
'scaling_characteristics': {
'users_supported': '100k - 100M',
'requests_per_second': '10,000 - 10,000,000',
'scaling_method': 'Per-region auto-scaling + global load balancing'
}
}
def _calculate_serverless_cost(self) -> float:
"""Estimate serverless architecture cost."""
requests_per_month = self.requests_per_second * 2_592_000
cloud_run_cost = max(5, (requests_per_month / 1_000_000) * 0.40)
firestore_cost = max(5, self.data_size_gb * 0.18)
cdn_cost = max(5, self.expected_users * 0.008)
storage_cost = max(1, self.data_size_gb * 0.02)
total = cloud_run_cost + firestore_cost + cdn_cost + storage_cost
return min(total, self.budget_monthly)
def _calculate_gke_cost(self) -> float:
"""Estimate GKE microservices architecture cost."""
gke_management = 74.40 # Autopilot cluster fee
pod_cost = max(2, self.expected_users // 5000) * 35
cloud_sql_cost = 120 # db-custom-2-8192 baseline
memorystore_cost = 35 # Basic 1 GB
lb_cost = 25
total = gke_management + pod_cost + cloud_sql_cost + memorystore_cost + lb_cost
return min(total, self.budget_monthly)
def _calculate_data_pipeline_cost(self) -> float:
"""Estimate data pipeline cost."""
pubsub_cost = max(5, self.data_size_gb * 0.5)
dataflow_cost = max(20, self.data_size_gb * 1.5)
bigquery_cost = max(10, self.data_size_gb * 0.02 * 6.25)
storage_cost = self.data_size_gb * 0.02
total = pubsub_cost + dataflow_cost + bigquery_cost + storage_cost
return min(total, self.budget_monthly)
def generate_service_checklist(self) -> list:
"""Generate implementation checklist for recommended architecture."""
architecture = self.recommend_architecture_pattern()
checklist = [
{
'phase': 'Planning',
'tasks': [
'Review architecture pattern and services',
'Estimate costs using GCP Pricing Calculator',
'Define environment strategy (dev, staging, prod)',
'Set up GCP Organization and projects',
'Define labeling strategy for resources'
]
},
{
'phase': 'Foundation',
'tasks': [
'Create VPC with subnets (if using GKE/Compute)',
'Configure Cloud NAT for private resources',
'Set up IAM roles and service accounts',
'Enable Cloud Audit Logs',
'Configure Organization policies'
]
},
{
'phase': 'Core Services',
'tasks': [
f"Deploy {service['service']}"
for service in architecture['services'].values()
]
},
{
'phase': 'Security',
'tasks': [
'Configure firewall rules and VPC Service Controls',
'Enable encryption (Cloud KMS) for all services',
'Set up Cloud Armor WAF rules',
'Configure Secret Manager for credentials',
'Enable Security Command Center'
]
},
{
'phase': 'Monitoring',
'tasks': [
'Create Cloud Monitoring dashboards',
'Set up alerting policies for critical metrics',
'Configure notification channels (email, Slack, PagerDuty)',
'Enable Cloud Trace for distributed tracing',
'Set up log-based metrics and log sinks'
]
},
{
'phase': 'CI/CD',
'tasks': [
'Set up Cloud Build triggers',
'Configure automated testing',
'Implement canary or rolling deployments',
'Set up rollback procedures',
'Document deployment process'
]
}
]
return checklist
def main():
parser = argparse.ArgumentParser(
description='GCP Architecture Designer - Recommends GCP services based on workload requirements'
)
parser.add_argument(
'--input', '-i',
type=str,
help='Path to JSON file with application requirements'
)
parser.add_argument(
'--output', '-o',
type=str,
help='Path to write design output JSON'
)
parser.add_argument(
'--json',
action='store_true',
help='Output as JSON format'
)
parser.add_argument(
'--app-type',
type=str,
choices=['web_application', 'mobile_backend', 'data_pipeline',
'microservices', 'saas_platform', 'ml_platform'],
default='web_application',
help='Application type (default: web_application)'
)
parser.add_argument(
'--users',
type=int,
default=1000,
help='Expected number of users (default: 1000)'
)
parser.add_argument(
'--budget',
type=float,
default=500,
help='Monthly budget in USD (default: 500)'
)
args = parser.parse_args()
if args.input:
try:
with open(args.input, 'r') as f:
requirements = json.load(f)
except FileNotFoundError:
print(f"Error: File '{args.input}' not found.", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError:
print(f"Error: File '{args.input}' is not valid JSON.", file=sys.stderr)
sys.exit(1)
else:
requirements = {
'application_type': args.app_type,
'expected_users': args.users,
'budget_monthly_usd': args.budget
}
designer = ArchitectureDesigner(requirements)
result = designer.recommend_architecture_pattern()
checklist = designer.generate_service_checklist()
output = {
'architecture': result,
'implementation_checklist': checklist
}
if args.output:
with open(args.output, 'w') as f:
json.dump(output, f, indent=2)
print(f"Design written to {args.output}")
elif args.json:
print(json.dumps(output, indent=2))
else:
print(f"\nRecommended Pattern: {result['pattern_name']}")
print(f"Description: {result['description']}")
print(f"Use Case: {result['use_case']}")
print(f"\nServices:")
for name, svc in result['services'].items():
print(f" - {name}: {svc['service']} ({svc['purpose']})")
print(f"\nEstimated Monthly Cost: ${result['estimated_cost']['monthly_usd']:.2f}")
print(f"\nPros: {', '.join(result['pros'])}")
print(f"Cons: {', '.join(result['cons'])}")
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,465 @@
"""
GCP cost optimization analyzer.
Provides cost-saving recommendations for GCP resources.
"""
import argparse
import json
import sys
from typing import Dict, List, Any
class CostOptimizer:
"""Analyze GCP costs and provide optimization recommendations."""
def __init__(self, current_resources: Dict[str, Any], monthly_spend: float):
"""
Initialize with current GCP resources and spending.
Args:
current_resources: Dictionary of current GCP resources
monthly_spend: Current monthly GCP spend in USD
"""
self.resources = current_resources
self.monthly_spend = monthly_spend
self.recommendations = []
def analyze_and_optimize(self) -> Dict[str, Any]:
"""
Analyze current setup and generate cost optimization recommendations.
Returns:
Dictionary with recommendations and potential savings
"""
self.recommendations = []
potential_savings = 0.0
compute_savings = self._analyze_compute()
potential_savings += compute_savings
storage_savings = self._analyze_storage()
potential_savings += storage_savings
database_savings = self._analyze_database()
potential_savings += database_savings
network_savings = self._analyze_networking()
potential_savings += network_savings
general_savings = self._analyze_general_optimizations()
potential_savings += general_savings
return {
'current_monthly_spend': self.monthly_spend,
'potential_monthly_savings': round(potential_savings, 2),
'optimized_monthly_spend': round(self.monthly_spend - potential_savings, 2),
'savings_percentage': round((potential_savings / self.monthly_spend) * 100, 2) if self.monthly_spend > 0 else 0,
'recommendations': self.recommendations,
'priority_actions': self._prioritize_recommendations()
}
def _analyze_compute(self) -> float:
"""Analyze compute resources (GCE, GKE, Cloud Run)."""
savings = 0.0
gce_instances = self.resources.get('gce_instances', [])
if gce_instances:
idle_count = sum(1 for inst in gce_instances if inst.get('cpu_utilization', 100) < 10)
if idle_count > 0:
idle_cost = idle_count * 50
savings += idle_cost
self.recommendations.append({
'service': 'Compute Engine',
'type': 'Idle Resources',
'issue': f'{idle_count} GCE instances with <10% CPU utilization',
'recommendation': 'Stop or delete idle instances, or downsize to smaller machine types',
'potential_savings': idle_cost,
'priority': 'high'
})
# Check for committed use discounts
on_demand_count = sum(1 for inst in gce_instances if inst.get('pricing', 'on-demand') == 'on-demand')
if on_demand_count >= 2:
cud_savings = on_demand_count * 50 * 0.37 # 37% savings with 1-yr CUD
savings += cud_savings
self.recommendations.append({
'service': 'Compute Engine',
'type': 'Committed Use Discounts',
'issue': f'{on_demand_count} instances on on-demand pricing',
'recommendation': 'Purchase 1-year committed use discounts for predictable workloads (37% savings) or 3-year (55% savings)',
'potential_savings': cud_savings,
'priority': 'medium'
})
# Check for sustained use discounts awareness
short_lived = sum(1 for inst in gce_instances if inst.get('uptime_hours_month', 730) < 200)
if short_lived > 0:
self.recommendations.append({
'service': 'Compute Engine',
'type': 'Scheduling',
'issue': f'{short_lived} instances running <200 hours/month',
'recommendation': 'Use Instance Scheduler to stop dev/test instances outside business hours',
'potential_savings': short_lived * 20,
'priority': 'medium'
})
savings += short_lived * 20
# GKE optimization
gke_clusters = self.resources.get('gke_clusters', [])
for cluster in gke_clusters:
if cluster.get('mode', 'standard') == 'standard':
node_utilization = cluster.get('avg_node_utilization', 100)
if node_utilization < 40:
autopilot_savings = cluster.get('monthly_cost', 500) * 0.30
savings += autopilot_savings
self.recommendations.append({
'service': 'GKE',
'type': 'Cluster Mode',
'issue': f'Standard GKE cluster with <40% node utilization',
'recommendation': 'Migrate to GKE Autopilot to pay only for pod resources, or enable cluster autoscaler',
'potential_savings': autopilot_savings,
'priority': 'high'
})
# Cloud Run optimization
cloud_run_services = self.resources.get('cloud_run_services', [])
for svc in cloud_run_services:
if svc.get('min_instances', 0) > 0 and svc.get('avg_rps', 100) < 1:
min_inst_savings = svc.get('min_instances', 1) * 15
savings += min_inst_savings
self.recommendations.append({
'service': 'Cloud Run',
'type': 'Min Instances',
'issue': f'Service {svc.get("name", "unknown")} has min instances but very low traffic',
'recommendation': 'Set min-instances to 0 for low-traffic services to enable scale-to-zero',
'potential_savings': min_inst_savings,
'priority': 'medium'
})
return savings
def _analyze_storage(self) -> float:
"""Analyze Cloud Storage resources."""
savings = 0.0
gcs_buckets = self.resources.get('gcs_buckets', [])
for bucket in gcs_buckets:
size_gb = bucket.get('size_gb', 0)
storage_class = bucket.get('storage_class', 'STANDARD')
if not bucket.get('has_lifecycle_policy', False) and size_gb > 100:
lifecycle_savings = size_gb * 0.012
savings += lifecycle_savings
self.recommendations.append({
'service': 'Cloud Storage',
'type': 'Lifecycle Policy',
'issue': f'Bucket {bucket.get("name", "unknown")} ({size_gb} GB) has no lifecycle policy',
'recommendation': 'Add lifecycle rule: Transition to Nearline after 30 days, Coldline after 90 days, Archive after 365 days',
'potential_savings': lifecycle_savings,
'priority': 'medium'
})
if storage_class == 'STANDARD' and size_gb > 500:
class_savings = size_gb * 0.006
savings += class_savings
self.recommendations.append({
'service': 'Cloud Storage',
'type': 'Storage Class',
'issue': f'Large bucket ({size_gb} GB) using Standard class',
'recommendation': 'Enable Autoclass for automatic storage class management based on access patterns',
'potential_savings': class_savings,
'priority': 'high'
})
return savings
def _analyze_database(self) -> float:
"""Analyze Cloud SQL, Firestore, and BigQuery costs."""
savings = 0.0
cloud_sql_instances = self.resources.get('cloud_sql_instances', [])
for db in cloud_sql_instances:
if db.get('connections_per_day', 1000) < 10:
db_cost = db.get('monthly_cost', 100)
savings += db_cost * 0.8
self.recommendations.append({
'service': 'Cloud SQL',
'type': 'Idle Resource',
'issue': f'Database {db.get("name", "unknown")} has <10 connections/day',
'recommendation': 'Stop database if not needed, or take a backup and delete',
'potential_savings': db_cost * 0.8,
'priority': 'high'
})
if db.get('utilization', 100) < 30 and not db.get('has_ha', False):
rightsize_savings = db.get('monthly_cost', 200) * 0.35
savings += rightsize_savings
self.recommendations.append({
'service': 'Cloud SQL',
'type': 'Right-sizing',
'issue': f'Cloud SQL instance {db.get("name", "unknown")} has low utilization (<30%)',
'recommendation': 'Downsize to a smaller machine type (e.g., db-custom-2-8192 to db-f1-micro for dev)',
'potential_savings': rightsize_savings,
'priority': 'medium'
})
# BigQuery optimization
bigquery_datasets = self.resources.get('bigquery_datasets', [])
for dataset in bigquery_datasets:
if dataset.get('pricing_model', 'on_demand') == 'on_demand':
monthly_tb_scanned = dataset.get('monthly_tb_scanned', 0)
if monthly_tb_scanned > 10:
slot_savings = (monthly_tb_scanned * 6.25) * 0.30
savings += slot_savings
self.recommendations.append({
'service': 'BigQuery',
'type': 'Pricing Model',
'issue': f'Scanning {monthly_tb_scanned} TB/month on on-demand pricing',
'recommendation': 'Switch to BigQuery editions with slots for predictable costs (30%+ savings at this volume)',
'potential_savings': slot_savings,
'priority': 'high'
})
if not dataset.get('has_partitioning', False):
partition_savings = dataset.get('monthly_query_cost', 50) * 0.50
savings += partition_savings
self.recommendations.append({
'service': 'BigQuery',
'type': 'Table Partitioning',
'issue': f'Tables in {dataset.get("name", "unknown")} lack partitioning',
'recommendation': 'Partition tables by date and add clustering columns to reduce bytes scanned',
'potential_savings': partition_savings,
'priority': 'medium'
})
return savings
def _analyze_networking(self) -> float:
"""Analyze networking costs (egress, Cloud NAT, etc.)."""
savings = 0.0
cloud_nat_gateways = self.resources.get('cloud_nat_gateways', [])
if len(cloud_nat_gateways) > 1:
extra_nats = len(cloud_nat_gateways) - 1
nat_savings = extra_nats * 45
savings += nat_savings
self.recommendations.append({
'service': 'Cloud NAT',
'type': 'Resource Consolidation',
'issue': f'{len(cloud_nat_gateways)} Cloud NAT gateways deployed',
'recommendation': 'Consolidate NAT gateways in dev/staging, or use Private Google Access for GCP services',
'potential_savings': nat_savings,
'priority': 'high'
})
egress_gb = self.resources.get('monthly_egress_gb', 0)
if egress_gb > 1000:
cdn_savings = egress_gb * 0.04 # CDN is cheaper than direct egress
savings += cdn_savings
self.recommendations.append({
'service': 'Networking',
'type': 'CDN Optimization',
'issue': f'High egress volume ({egress_gb} GB/month)',
'recommendation': 'Enable Cloud CDN to serve cached content at lower egress rates',
'potential_savings': cdn_savings,
'priority': 'medium'
})
return savings
def _analyze_general_optimizations(self) -> float:
"""General GCP cost optimizations."""
savings = 0.0
# Log retention
log_sinks = self.resources.get('log_sinks', [])
if not log_sinks:
log_volume_gb = self.resources.get('monthly_log_volume_gb', 0)
if log_volume_gb > 50:
log_savings = log_volume_gb * 0.50 * 0.6
savings += log_savings
self.recommendations.append({
'service': 'Cloud Logging',
'type': 'Log Exclusion',
'issue': f'{log_volume_gb} GB/month of logs without exclusion filters',
'recommendation': 'Create log exclusion filters for verbose/debug logs and route remaining to Cloud Storage via log sinks',
'potential_savings': log_savings,
'priority': 'medium'
})
# Unattached persistent disks
persistent_disks = self.resources.get('persistent_disks', [])
unattached = sum(1 for disk in persistent_disks if not disk.get('attached', True))
if unattached > 0:
disk_savings = unattached * 10 # ~$10/month per 100 GB disk
savings += disk_savings
self.recommendations.append({
'service': 'Compute Engine',
'type': 'Unused Resources',
'issue': f'{unattached} unattached persistent disks',
'recommendation': 'Snapshot and delete unused persistent disks',
'potential_savings': disk_savings,
'priority': 'high'
})
# Static external IPs
static_ips = self.resources.get('static_ips', [])
unused_ips = sum(1 for ip in static_ips if not ip.get('in_use', True))
if unused_ips > 0:
ip_savings = unused_ips * 7.30 # $0.01/hour = $7.30/month
savings += ip_savings
self.recommendations.append({
'service': 'Networking',
'type': 'Unused Resources',
'issue': f'{unused_ips} unused static external IP addresses',
'recommendation': 'Release unused static IPs to avoid hourly charges',
'potential_savings': ip_savings,
'priority': 'high'
})
# Budget alerts
if not self.resources.get('has_budget_alerts', False):
self.recommendations.append({
'service': 'Cloud Billing',
'type': 'Cost Monitoring',
'issue': 'No budget alerts configured',
'recommendation': 'Set up Cloud Billing budgets with alerts at 50%, 80%, 100% of monthly budget',
'potential_savings': 0,
'priority': 'high'
})
# Recommender API
if not self.resources.get('uses_recommender', False):
self.recommendations.append({
'service': 'Active Assist',
'type': 'Visibility',
'issue': 'GCP Recommender not reviewed',
'recommendation': 'Review Active Assist recommendations for right-sizing, idle resources, and committed use discounts',
'potential_savings': 0,
'priority': 'medium'
})
return savings
def _prioritize_recommendations(self) -> List[Dict[str, Any]]:
"""Get top priority recommendations."""
high_priority = [r for r in self.recommendations if r['priority'] == 'high']
high_priority.sort(key=lambda x: x.get('potential_savings', 0), reverse=True)
return high_priority[:5]
def generate_optimization_checklist(self) -> List[Dict[str, Any]]:
"""Generate actionable checklist for cost optimization."""
return [
{
'category': 'Immediate Actions (Today)',
'items': [
'Release unused static IPs',
'Delete unattached persistent disks',
'Stop idle Compute Engine instances',
'Set up billing budget alerts'
]
},
{
'category': 'This Week',
'items': [
'Add Cloud Storage lifecycle policies',
'Create log exclusion filters for verbose logs',
'Right-size Cloud SQL instances',
'Review Active Assist recommendations'
]
},
{
'category': 'This Month',
'items': [
'Evaluate committed use discounts',
'Migrate GKE Standard to Autopilot where applicable',
'Partition and cluster BigQuery tables',
'Enable Cloud CDN for high-egress services'
]
},
{
'category': 'Ongoing',
'items': [
'Review billing reports weekly',
'Label all resources for cost allocation',
'Monitor Active Assist recommendations monthly',
'Conduct quarterly cost optimization reviews'
]
}
]
def main():
parser = argparse.ArgumentParser(
description='GCP Cost Optimizer - Analyzes GCP resources and recommends cost savings'
)
parser.add_argument(
'--resources', '-r',
type=str,
help='Path to JSON file with current GCP resource inventory'
)
parser.add_argument(
'--monthly-spend', '-s',
type=float,
default=1000,
help='Current monthly GCP spend in USD (default: 1000)'
)
parser.add_argument(
'--output', '-o',
type=str,
help='Path to write optimization report JSON'
)
parser.add_argument(
'--json',
action='store_true',
help='Output as JSON format'
)
parser.add_argument(
'--checklist',
action='store_true',
help='Generate optimization checklist'
)
args = parser.parse_args()
if args.resources:
try:
with open(args.resources, 'r') as f:
resources = json.load(f)
except FileNotFoundError:
print(f"Error: File '{args.resources}' not found.", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError:
print(f"Error: File '{args.resources}' is not valid JSON.", file=sys.stderr)
sys.exit(1)
else:
resources = {}
optimizer = CostOptimizer(resources, args.monthly_spend)
result = optimizer.analyze_and_optimize()
if args.checklist:
result['checklist'] = optimizer.generate_optimization_checklist()
if args.output:
with open(args.output, 'w') as f:
json.dump(result, f, indent=2)
print(f"Report written to {args.output}")
elif args.json:
print(json.dumps(result, indent=2))
else:
print(f"\nGCP Cost Optimization Report")
print(f"{'=' * 40}")
print(f"Current Monthly Spend: ${result['current_monthly_spend']:.2f}")
print(f"Potential Savings: ${result['potential_monthly_savings']:.2f}")
print(f"Optimized Spend: ${result['optimized_monthly_spend']:.2f}")
print(f"Savings Percentage: {result['savings_percentage']}%")
print(f"\nTop Priority Actions:")
for i, action in enumerate(result['priority_actions'], 1):
print(f" {i}. [{action['service']}] {action['recommendation']}")
print(f" Savings: ${action['potential_savings']:.2f}/month")
print(f"\nTotal Recommendations: {len(result['recommendations'])}")
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,835 @@
"""
GCP deployment script generator.
Creates gcloud CLI scripts and Terraform configurations for GCP architectures.
"""
import argparse
import json
import sys
from typing import Dict, Any
class DeploymentManager:
"""Generate GCP deployment scripts and IaC configurations."""
def __init__(self, app_name: str, requirements: Dict[str, Any]):
"""
Initialize with application requirements.
Args:
app_name: Application name (used for resource naming)
requirements: Dictionary with pattern, region, project requirements
"""
self.app_name = app_name.lower().replace(' ', '-')
self.requirements = requirements
self.region = requirements.get('region', 'us-central1')
self.project_id = requirements.get('project_id', 'my-project')
self.pattern = requirements.get('pattern', 'serverless_web')
def generate_gcloud_script(self) -> str:
"""
Generate gcloud CLI deployment script.
Returns:
Shell script as string
"""
if self.pattern == 'serverless_web':
return self._gcloud_serverless_web()
elif self.pattern == 'gke_microservices':
return self._gcloud_gke_microservices()
elif self.pattern == 'data_pipeline':
return self._gcloud_data_pipeline()
else:
return self._gcloud_serverless_web()
def _gcloud_serverless_web(self) -> str:
"""Generate gcloud script for serverless web pattern."""
return f"""#!/bin/bash
# GCP Serverless Web Deployment Script
# Application: {self.app_name}
# Region: {self.region}
# Pattern: Cloud Run + Firestore + Cloud Storage + Cloud CDN
set -euo pipefail
PROJECT_ID="{self.project_id}"
REGION="{self.region}"
APP_NAME="{self.app_name}"
ENVIRONMENT="${{ENVIRONMENT:-dev}}"
echo "=== Deploying $APP_NAME to GCP ($ENVIRONMENT) ==="
# 1. Set project
gcloud config set project $PROJECT_ID
# 2. Enable required APIs
echo "Enabling required APIs..."
gcloud services enable \\
run.googleapis.com \\
firestore.googleapis.com \\
cloudbuild.googleapis.com \\
artifactregistry.googleapis.com \\
secretmanager.googleapis.com \\
compute.googleapis.com \\
monitoring.googleapis.com \\
logging.googleapis.com
# 3. Create Artifact Registry repository
echo "Creating Artifact Registry repository..."
gcloud artifacts repositories create $APP_NAME \\
--repository-format=docker \\
--location=$REGION \\
--description="Docker images for $APP_NAME" \\
|| echo "Repository already exists"
# 4. Build and push container image
echo "Building container image..."
gcloud builds submit \\
--tag $REGION-docker.pkg.dev/$PROJECT_ID/$APP_NAME/$APP_NAME:latest \\
.
# 5. Create Firestore database
echo "Creating Firestore database..."
gcloud firestore databases create \\
--location=$REGION \\
--type=firestore-native \\
|| echo "Firestore database already exists"
# 6. Create service account for Cloud Run
echo "Creating service account..."
SA_NAME="${{APP_NAME}}-run-sa"
gcloud iam service-accounts create $SA_NAME \\
--display-name="$APP_NAME Cloud Run Service Account" \\
|| echo "Service account already exists"
# Grant Firestore access
gcloud projects add-iam-policy-binding $PROJECT_ID \\
--member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \\
--role="roles/datastore.user" \\
--condition=None
# Grant Secret Manager access
gcloud projects add-iam-policy-binding $PROJECT_ID \\
--member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \\
--role="roles/secretmanager.secretAccessor" \\
--condition=None
# 7. Deploy Cloud Run service
echo "Deploying Cloud Run service..."
gcloud run deploy $APP_NAME-api \\
--image $REGION-docker.pkg.dev/$PROJECT_ID/$APP_NAME/$APP_NAME:latest \\
--region $REGION \\
--platform managed \\
--service-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \\
--memory 512Mi \\
--cpu 1 \\
--min-instances 0 \\
--max-instances 10 \\
--set-env-vars "PROJECT_ID=$PROJECT_ID,ENVIRONMENT=$ENVIRONMENT" \\
--allow-unauthenticated
# 8. Create Cloud Storage bucket for static assets
echo "Creating static assets bucket..."
BUCKET_NAME="${{PROJECT_ID}}-${{APP_NAME}}-static"
gsutil mb -l $REGION gs://$BUCKET_NAME/ || echo "Bucket already exists"
gsutil iam ch allUsers:objectViewer gs://$BUCKET_NAME
# 9. Set up Cloud Monitoring alerting
echo "Setting up monitoring..."
gcloud alpha monitoring policies create \\
--notification-channels="" \\
--display-name="$APP_NAME High Error Rate" \\
--condition-display-name="Cloud Run 5xx Error Rate" \\
--condition-filter='resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/request_count" AND metric.labels.response_code_class="5xx"' \\
--condition-threshold-value=10 \\
--condition-threshold-duration=60s \\
|| echo "Alert policy creation requires additional configuration"
# 10. Output deployment info
echo ""
echo "=== Deployment Complete ==="
SERVICE_URL=$(gcloud run services describe $APP_NAME-api --region $REGION --format 'value(status.url)')
echo "Cloud Run URL: $SERVICE_URL"
echo "Static Bucket: gs://$BUCKET_NAME"
echo "Firestore: https://console.cloud.google.com/firestore?project=$PROJECT_ID"
echo "Monitoring: https://console.cloud.google.com/monitoring?project=$PROJECT_ID"
"""
def _gcloud_gke_microservices(self) -> str:
"""Generate gcloud script for GKE microservices pattern."""
return f"""#!/bin/bash
# GCP GKE Microservices Deployment Script
# Application: {self.app_name}
# Region: {self.region}
# Pattern: GKE Autopilot + Cloud SQL + Memorystore
set -euo pipefail
PROJECT_ID="{self.project_id}"
REGION="{self.region}"
APP_NAME="{self.app_name}"
ENVIRONMENT="${{ENVIRONMENT:-dev}}"
CLUSTER_NAME="${{APP_NAME}}-cluster"
NETWORK_NAME="${{APP_NAME}}-vpc"
echo "=== Deploying $APP_NAME GKE Microservices ($ENVIRONMENT) ==="
# 1. Set project
gcloud config set project $PROJECT_ID
# 2. Enable required APIs
echo "Enabling required APIs..."
gcloud services enable \\
container.googleapis.com \\
sqladmin.googleapis.com \\
redis.googleapis.com \\
cloudbuild.googleapis.com \\
artifactregistry.googleapis.com \\
secretmanager.googleapis.com \\
servicenetworking.googleapis.com \\
compute.googleapis.com
# 3. Create VPC network
echo "Creating VPC network..."
gcloud compute networks create $NETWORK_NAME \\
--subnet-mode=auto \\
|| echo "Network already exists"
# Allocate IP range for private services
gcloud compute addresses create google-managed-services-$NETWORK_NAME \\
--global \\
--purpose=VPC_PEERING \\
--prefix-length=16 \\
--network=$NETWORK_NAME \\
|| echo "IP range already exists"
gcloud services vpc-peerings connect \\
--service=servicenetworking.googleapis.com \\
--ranges=google-managed-services-$NETWORK_NAME \\
--network=$NETWORK_NAME \\
|| echo "VPC peering already exists"
# 4. Create GKE Autopilot cluster
echo "Creating GKE Autopilot cluster..."
gcloud container clusters create-auto $CLUSTER_NAME \\
--region $REGION \\
--network $NETWORK_NAME \\
--release-channel regular \\
--enable-master-authorized-networks \\
--enable-private-nodes \\
|| echo "Cluster already exists"
# 5. Get cluster credentials
gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION
# 6. Create Cloud SQL instance
echo "Creating Cloud SQL instance..."
gcloud sql instances create $APP_NAME-db \\
--database-version=POSTGRES_15 \\
--tier=db-custom-2-8192 \\
--region=$REGION \\
--network=$NETWORK_NAME \\
--no-assign-ip \\
--availability-type=regional \\
--backup-start-time=02:00 \\
--storage-auto-increase \\
|| echo "Cloud SQL instance already exists"
# Create database
gcloud sql databases create $APP_NAME \\
--instance=$APP_NAME-db \\
|| echo "Database already exists"
# 7. Create Memorystore Redis instance
echo "Creating Memorystore Redis instance..."
gcloud redis instances create $APP_NAME-cache \\
--size=1 \\
--region=$REGION \\
--redis-version=redis_7_0 \\
--network=$NETWORK_NAME \\
--tier=basic \\
|| echo "Redis instance already exists"
# 8. Configure Workload Identity
echo "Configuring Workload Identity..."
SA_NAME="${{APP_NAME}}-workload"
gcloud iam service-accounts create $SA_NAME \\
--display-name="$APP_NAME Workload Identity SA" \\
|| echo "Service account already exists"
gcloud projects add-iam-policy-binding $PROJECT_ID \\
--member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \\
--role="roles/cloudsql.client"
gcloud iam service-accounts add-iam-policy-binding \\
$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \\
--role="roles/iam.workloadIdentityUser" \\
--member="serviceAccount:$PROJECT_ID.svc.id.goog[default/$SA_NAME]"
echo ""
echo "=== GKE Cluster Ready ==="
echo "Cluster: $CLUSTER_NAME"
echo "Cloud SQL: $APP_NAME-db"
echo "Redis: $APP_NAME-cache"
echo ""
echo "Next: Apply Kubernetes manifests with 'kubectl apply -f k8s/'"
"""
def _gcloud_data_pipeline(self) -> str:
"""Generate gcloud script for data pipeline pattern."""
return f"""#!/bin/bash
# GCP Data Pipeline Deployment Script
# Application: {self.app_name}
# Region: {self.region}
# Pattern: Pub/Sub + Dataflow + BigQuery
set -euo pipefail
PROJECT_ID="{self.project_id}"
REGION="{self.region}"
APP_NAME="{self.app_name}"
echo "=== Deploying $APP_NAME Data Pipeline ==="
# 1. Set project
gcloud config set project $PROJECT_ID
# 2. Enable required APIs
echo "Enabling required APIs..."
gcloud services enable \\
pubsub.googleapis.com \\
dataflow.googleapis.com \\
bigquery.googleapis.com \\
storage.googleapis.com \\
monitoring.googleapis.com
# 3. Create Pub/Sub topic and subscription
echo "Creating Pub/Sub resources..."
gcloud pubsub topics create $APP_NAME-events \\
|| echo "Topic already exists"
gcloud pubsub subscriptions create $APP_NAME-events-sub \\
--topic=$APP_NAME-events \\
--ack-deadline=60 \\
--message-retention-duration=7d \\
|| echo "Subscription already exists"
# Dead letter topic
gcloud pubsub topics create $APP_NAME-events-dlq \\
|| echo "DLQ topic already exists"
gcloud pubsub subscriptions update $APP_NAME-events-sub \\
--dead-letter-topic=$APP_NAME-events-dlq \\
--max-delivery-attempts=5
# 4. Create BigQuery dataset and table
echo "Creating BigQuery resources..."
bq mk --dataset --location=$REGION $PROJECT_ID:${{APP_NAME//-/_}}_analytics \\
|| echo "Dataset already exists"
bq mk --table \\
$PROJECT_ID:${{APP_NAME//-/_}}_analytics.events \\
event_id:STRING,event_type:STRING,payload:STRING,timestamp:TIMESTAMP,processed_at:TIMESTAMP \\
--time_partitioning_type=DAY \\
--time_partitioning_field=timestamp \\
--clustering_fields=event_type \\
|| echo "Table already exists"
# 5. Create Cloud Storage bucket for Dataflow temp/staging
echo "Creating staging bucket..."
STAGING_BUCKET="${{PROJECT_ID}}-${{APP_NAME}}-dataflow"
gsutil mb -l $REGION gs://$STAGING_BUCKET/ || echo "Bucket already exists"
# 6. Create service account for Dataflow
echo "Creating Dataflow service account..."
SA_NAME="${{APP_NAME}}-dataflow-sa"
gcloud iam service-accounts create $SA_NAME \\
--display-name="$APP_NAME Dataflow Worker SA" \\
|| echo "Service account already exists"
for ROLE in roles/dataflow.worker roles/bigquery.dataEditor roles/pubsub.subscriber roles/storage.objectAdmin; do
gcloud projects add-iam-policy-binding $PROJECT_ID \\
--member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \\
--role="$ROLE" \\
--condition=None
done
echo ""
echo "=== Data Pipeline Infrastructure Ready ==="
echo "Pub/Sub Topic: $APP_NAME-events"
echo "BigQuery Dataset: ${{APP_NAME//-/_}}_analytics"
echo "Staging Bucket: gs://$STAGING_BUCKET"
echo ""
echo "Next: Deploy Dataflow job with Apache Beam pipeline"
echo " python -m apache_beam.examples.streaming_wordcount \\\\"
echo " --runner DataflowRunner \\\\"
echo " --project $PROJECT_ID \\\\"
echo " --region $REGION \\\\"
echo " --temp_location gs://$STAGING_BUCKET/temp"
"""
def generate_terraform_configuration(self) -> str:
"""
Generate Terraform configuration for the selected pattern.
Returns:
Terraform HCL configuration as string
"""
if self.pattern == 'serverless_web':
return self._terraform_serverless_web()
elif self.pattern == 'gke_microservices':
return self._terraform_gke_microservices()
else:
return self._terraform_serverless_web()
def _terraform_serverless_web(self) -> str:
"""Generate Terraform for serverless web pattern."""
return f"""terraform {{
required_version = ">= 1.0"
required_providers {{
google = {{
source = "hashicorp/google"
version = "~> 5.0"
}}
}}
}}
provider "google" {{
project = var.project_id
region = var.region
}}
variable "project_id" {{
description = "GCP project ID"
type = string
}}
variable "region" {{
description = "GCP region"
type = string
default = "{self.region}"
}}
variable "environment" {{
description = "Environment name"
type = string
default = "dev"
}}
variable "app_name" {{
description = "Application name"
type = string
default = "{self.app_name}"
}}
# Enable required APIs
resource "google_project_service" "apis" {{
for_each = toset([
"run.googleapis.com",
"firestore.googleapis.com",
"secretmanager.googleapis.com",
"artifactregistry.googleapis.com",
"monitoring.googleapis.com",
])
project = var.project_id
service = each.value
}}
# Service Account for Cloud Run
resource "google_service_account" "cloud_run" {{
account_id = "${{var.app_name}}-run-sa"
display_name = "${{var.app_name}} Cloud Run Service Account"
}}
resource "google_project_iam_member" "firestore_user" {{
project = var.project_id
role = "roles/datastore.user"
member = "serviceAccount:${{google_service_account.cloud_run.email}}"
}}
resource "google_project_iam_member" "secret_accessor" {{
project = var.project_id
role = "roles/secretmanager.secretAccessor"
member = "serviceAccount:${{google_service_account.cloud_run.email}}"
}}
# Firestore Database
resource "google_firestore_database" "default" {{
project = var.project_id
name = "(default)"
location_id = var.region
type = "FIRESTORE_NATIVE"
depends_on = [google_project_service.apis["firestore.googleapis.com"]]
}}
# Cloud Run Service
resource "google_cloud_run_v2_service" "api" {{
name = "${{var.environment}}-${{var.app_name}}-api"
location = var.region
template {{
service_account = google_service_account.cloud_run.email
containers {{
image = "${{var.region}}-docker.pkg.dev/${{var.project_id}}/${{var.app_name}}/${{var.app_name}}:latest"
resources {{
limits = {{
cpu = "1000m"
memory = "512Mi"
}}
}}
env {{
name = "PROJECT_ID"
value = var.project_id
}}
env {{
name = "ENVIRONMENT"
value = var.environment
}}
}}
scaling {{
min_instance_count = 0
max_instance_count = 10
}}
}}
depends_on = [google_project_service.apis["run.googleapis.com"]]
labels = {{
environment = var.environment
app = var.app_name
}}
}}
# Allow unauthenticated access (public API)
resource "google_cloud_run_v2_service_iam_member" "public" {{
project = var.project_id
location = var.region
name = google_cloud_run_v2_service.api.name
role = "roles/run.invoker"
member = "allUsers"
}}
# Cloud Storage bucket for static assets
resource "google_storage_bucket" "static" {{
name = "${{var.project_id}}-${{var.app_name}}-static"
location = var.region
uniform_bucket_level_access = true
website {{
main_page_suffix = "index.html"
not_found_page = "404.html"
}}
lifecycle_rule {{
condition {{
age = 30
}}
action {{
type = "SetStorageClass"
storage_class = "NEARLINE"
}}
}}
labels = {{
environment = var.environment
app = var.app_name
}}
}}
# Outputs
output "cloud_run_url" {{
description = "Cloud Run service URL"
value = google_cloud_run_v2_service.api.uri
}}
output "static_bucket" {{
description = "Static assets bucket name"
value = google_storage_bucket.static.name
}}
output "service_account" {{
description = "Cloud Run service account email"
value = google_service_account.cloud_run.email
}}
"""
def _terraform_gke_microservices(self) -> str:
"""Generate Terraform for GKE microservices pattern."""
return f"""terraform {{
required_version = ">= 1.0"
required_providers {{
google = {{
source = "hashicorp/google"
version = "~> 5.0"
}}
}}
}}
provider "google" {{
project = var.project_id
region = var.region
}}
variable "project_id" {{
description = "GCP project ID"
type = string
}}
variable "region" {{
description = "GCP region"
type = string
default = "{self.region}"
}}
variable "environment" {{
description = "Environment name"
type = string
default = "dev"
}}
variable "app_name" {{
description = "Application name"
type = string
default = "{self.app_name}"
}}
# Enable required APIs
resource "google_project_service" "apis" {{
for_each = toset([
"container.googleapis.com",
"sqladmin.googleapis.com",
"redis.googleapis.com",
"servicenetworking.googleapis.com",
"secretmanager.googleapis.com",
])
project = var.project_id
service = each.value
}}
# VPC Network
resource "google_compute_network" "main" {{
name = "${{var.app_name}}-vpc"
auto_create_subnetworks = false
}}
resource "google_compute_subnetwork" "main" {{
name = "${{var.app_name}}-subnet"
ip_cidr_range = "10.0.0.0/20"
region = var.region
network = google_compute_network.main.id
secondary_ip_range {{
range_name = "pods"
ip_cidr_range = "10.4.0.0/14"
}}
secondary_ip_range {{
range_name = "services"
ip_cidr_range = "10.8.0.0/20"
}}
}}
# GKE Autopilot Cluster
resource "google_container_cluster" "main" {{
name = "${{var.environment}}-${{var.app_name}}-cluster"
location = var.region
enable_autopilot = true
network = google_compute_network.main.name
subnetwork = google_compute_subnetwork.main.name
ip_allocation_policy {{
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}}
release_channel {{
channel = "REGULAR"
}}
depends_on = [google_project_service.apis["container.googleapis.com"]]
}}
# Private Services Access for Cloud SQL
resource "google_compute_global_address" "private_ip" {{
name = "private-ip-range"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
prefix_length = 16
network = google_compute_network.main.id
}}
resource "google_service_networking_connection" "private_vpc" {{
network = google_compute_network.main.id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.private_ip.name]
}}
# Cloud SQL PostgreSQL
resource "google_sql_database_instance" "main" {{
name = "${{var.environment}}-${{var.app_name}}-db"
database_version = "POSTGRES_15"
region = var.region
settings {{
tier = "db-custom-2-8192"
availability_type = "REGIONAL"
backup_configuration {{
enabled = true
start_time = "02:00"
point_in_time_recovery_enabled = true
}}
ip_configuration {{
ipv4_enabled = false
private_network = google_compute_network.main.id
}}
disk_autoresize = true
}}
depends_on = [google_service_networking_connection.private_vpc]
}}
resource "google_sql_database" "app" {{
name = var.app_name
instance = google_sql_database_instance.main.name
}}
# Memorystore Redis
resource "google_redis_instance" "cache" {{
name = "${{var.environment}}-${{var.app_name}}-cache"
tier = "BASIC"
memory_size_gb = 1
region = var.region
redis_version = "REDIS_7_0"
authorized_network = google_compute_network.main.id
depends_on = [google_project_service.apis["redis.googleapis.com"]]
labels = {{
environment = var.environment
app = var.app_name
}}
}}
# Outputs
output "cluster_name" {{
description = "GKE cluster name"
value = google_container_cluster.main.name
}}
output "cloud_sql_connection" {{
description = "Cloud SQL connection name"
value = google_sql_database_instance.main.connection_name
}}
output "redis_host" {{
description = "Memorystore Redis host"
value = google_redis_instance.cache.host
}}
"""
def main():
parser = argparse.ArgumentParser(
description='GCP Deployment Manager - Generates gcloud CLI scripts and Terraform configurations'
)
parser.add_argument(
'--app-name', '-a',
type=str,
required=True,
help='Application name'
)
parser.add_argument(
'--pattern', '-p',
type=str,
choices=['serverless_web', 'gke_microservices', 'data_pipeline'],
default='serverless_web',
help='Architecture pattern (default: serverless_web)'
)
parser.add_argument(
'--region', '-r',
type=str,
default='us-central1',
help='GCP region (default: us-central1)'
)
parser.add_argument(
'--project-id',
type=str,
default='my-project',
help='GCP project ID (default: my-project)'
)
parser.add_argument(
'--format', '-f',
type=str,
choices=['gcloud', 'terraform', 'both'],
default='both',
help='Output format (default: both)'
)
parser.add_argument(
'--output', '-o',
type=str,
help='Output directory for generated files'
)
parser.add_argument(
'--json',
action='store_true',
help='Output as JSON format'
)
args = parser.parse_args()
requirements = {
'pattern': args.pattern,
'region': args.region,
'project_id': args.project_id
}
manager = DeploymentManager(args.app_name, requirements)
if args.json:
output = {}
if args.format in ('gcloud', 'both'):
output['gcloud_script'] = manager.generate_gcloud_script()
if args.format in ('terraform', 'both'):
output['terraform_config'] = manager.generate_terraform_configuration()
print(json.dumps(output, indent=2))
elif args.output:
import os
os.makedirs(args.output, exist_ok=True)
if args.format in ('gcloud', 'both'):
gcloud_path = os.path.join(args.output, 'deploy.sh')
with open(gcloud_path, 'w') as f:
f.write(manager.generate_gcloud_script())
os.chmod(gcloud_path, 0o755)
print(f"gcloud script written to {gcloud_path}")
if args.format in ('terraform', 'both'):
tf_path = os.path.join(args.output, 'main.tf')
with open(tf_path, 'w') as f:
f.write(manager.generate_terraform_configuration())
print(f"Terraform config written to {tf_path}")
else:
if args.format in ('gcloud', 'both'):
print("# ===== gcloud CLI Script =====")
print(manager.generate_gcloud_script())
if args.format in ('terraform', 'both'):
print("# ===== Terraform Configuration =====")
print(manager.generate_terraform_configuration())
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,403 @@
---
name: "secrets-vault-manager"
description: "Use when the user asks to set up secret management infrastructure, integrate HashiCorp Vault, configure cloud secret stores (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager), implement secret rotation, or audit secret access patterns."
---
# Secrets Vault Manager
**Tier:** POWERFUL
**Category:** Engineering
**Domain:** Security / Infrastructure / DevOps
---
## Overview
Production secret infrastructure management for teams running HashiCorp Vault, cloud-native secret stores, or hybrid architectures. This skill covers policy authoring, auth method configuration, automated rotation, dynamic secrets, audit logging, and incident response.
**Distinct from env-secrets-manager** which handles local `.env` file hygiene and leak detection. This skill operates at the infrastructure layer — Vault clusters, cloud KMS, certificate authorities, and CI/CD secret injection.
### When to Use
- Standing up a new Vault cluster or migrating to a managed secret store
- Designing auth methods for services, CI runners, and human operators
- Implementing automated credential rotation (database, API keys, certificates)
- Auditing secret access patterns for compliance (SOC 2, ISO 27001, HIPAA)
- Responding to a secret leak that requires mass revocation
- Integrating secrets into Kubernetes workloads or CI/CD pipelines
---
## HashiCorp Vault Patterns
### Architecture Decisions
| Decision | Recommendation | Rationale |
|----------|---------------|-----------|
| Deployment mode | HA with Raft storage | No external dependency, built-in leader election |
| Auto-unseal | Cloud KMS (AWS KMS / Azure Key Vault / GCP KMS) | Eliminates manual unseal, enables automated restarts |
| Namespaces | One per environment (dev/staging/prod) | Blast-radius isolation, independent policies |
| Audit devices | File + syslog (dual) | Vault refuses requests if all audit devices fail — dual prevents outages |
### Auth Methods
**AppRole** — Machine-to-machine authentication for services and batch jobs.
```hcl
# Enable AppRole
path "auth/approle/*" {
capabilities = ["create", "read", "update", "delete", "list"]
}
# Application-specific role
vault write auth/approle/role/payment-service \
token_ttl=1h \
token_max_ttl=4h \
secret_id_num_uses=1 \
secret_id_ttl=10m \
token_policies="payment-service-read"
```
**Kubernetes** — Pod-native authentication via service account tokens.
```hcl
vault write auth/kubernetes/role/api-server \
bound_service_account_names=api-server \
bound_service_account_namespaces=production \
policies=api-server-secrets \
ttl=1h
```
**OIDC** — Human operator access via SSO provider (Okta, Azure AD, Google Workspace).
```hcl
vault write auth/oidc/role/engineering \
bound_audiences="vault" \
allowed_redirect_uris="https://vault.example.com/ui/vault/auth/oidc/oidc/callback" \
user_claim="email" \
oidc_scopes="openid,profile,email" \
policies="engineering-read" \
ttl=8h
```
### Secret Engines
| Engine | Use Case | TTL Strategy |
|--------|----------|-------------|
| KV v2 | Static secrets (API keys, config) | Versioned, manual rotation |
| Database | Dynamic DB credentials | 1h default, 24h max |
| PKI | TLS certificates | 90d leaf certs, 5y intermediate CA |
| Transit | Encryption-as-a-service | Key rotation every 90d |
| SSH | Signed SSH certificates | 30m for interactive, 8h for automation |
### Policy Design
Follow least-privilege with path-based granularity:
```hcl
# payment-service-read policy
path "secret/data/production/payment/*" {
capabilities = ["read"]
}
path "database/creds/payment-readonly" {
capabilities = ["read"]
}
# Deny access to admin paths explicitly
path "sys/*" {
capabilities = ["deny"]
}
```
**Policy naming convention:** `{service}-{access-level}` (e.g., `payment-service-read`, `api-gateway-admin`).
---
## Cloud Secret Store Integration
### Comparison Matrix
| Feature | AWS Secrets Manager | Azure Key Vault | GCP Secret Manager |
|---------|--------------------|-----------------|--------------------|
| Rotation | Built-in Lambda | Custom logic via Functions | Cloud Functions |
| Versioning | Automatic | Manual or automatic | Automatic |
| Encryption | AWS KMS (default or CMK) | HSM-backed | Google-managed or CMEK |
| Access control | IAM policies + resource policy | RBAC + Access Policies | IAM bindings |
| Cross-region | Replication supported | Geo-redundant by default | Replication supported |
| Audit | CloudTrail | Azure Monitor + Diagnostic Logs | Cloud Audit Logs |
| Pricing model | Per-secret + per-API call | Per-operation + per-key | Per-secret version + per-access |
### When to Use Which
- **AWS Secrets Manager**: RDS/Aurora credential rotation out of the box. Best when fully on AWS.
- **Azure Key Vault**: Certificate management strength. Required for Azure AD integrated workloads.
- **GCP Secret Manager**: Simplest API surface. Best for GKE-native workloads with Workload Identity.
- **HashiCorp Vault**: Multi-cloud, dynamic secrets, PKI, transit encryption. Best for complex or hybrid environments.
### SDK Access Patterns
**Principle:** Always fetch secrets at startup or via sidecar — never bake into images or config files.
```python
# AWS Secrets Manager pattern
import boto3, json
def get_secret(secret_name, region="us-east-1"):
client = boto3.client("secretsmanager", region_name=region)
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response["SecretString"])
```
```python
# GCP Secret Manager pattern
from google.cloud import secretmanager
def get_secret(project_id, secret_id, version="latest"):
client = secretmanager.SecretManagerServiceClient()
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version}"
response = client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
```
```python
# Azure Key Vault pattern
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
def get_secret(vault_url, secret_name):
credential = DefaultAzureCredential()
client = SecretClient(vault_url=vault_url, credential=credential)
return client.get_secret(secret_name).value
```
---
## Secret Rotation Workflows
### Rotation Strategy by Secret Type
| Secret Type | Rotation Frequency | Method | Downtime Risk |
|-------------|-------------------|--------|---------------|
| Database passwords | 30 days | Dual-account swap | Zero (A/B rotation) |
| API keys | 90 days | Generate new, deprecate old | Zero (overlap window) |
| TLS certificates | 60 days before expiry | ACME or Vault PKI | Zero (graceful reload) |
| SSH keys | 90 days | Vault-signed certificates | Zero (CA-based) |
| Service tokens | 24 hours | Dynamic generation | Zero (short-lived) |
| Encryption keys | 90 days | Key versioning (rewrap) | Zero (version coexistence) |
### Database Credential Rotation (Dual-Account)
1. Two database accounts exist: `app_user_a` and `app_user_b`
2. Application currently uses `app_user_a`
3. Rotation rotates `app_user_b` password, updates secret store
4. Application switches to `app_user_b` on next credential fetch
5. After grace period, `app_user_a` password is rotated
6. Cycle repeats
### API Key Rotation (Overlap Window)
1. Generate new API key with provider
2. Store new key in secret store as `current`, move old to `previous`
3. Deploy applications — they read `current`
4. After all instances restarted (or TTL expired), revoke `previous`
5. Monitoring confirms zero usage of old key before revocation
---
## Dynamic Secrets
Dynamic secrets are generated on-demand with automatic expiration. Prefer dynamic secrets over static credentials wherever possible.
### Database Dynamic Credentials (Vault)
```hcl
# Configure database engine
vault write database/config/postgres \
plugin_name=postgresql-database-plugin \
connection_url="postgresql://{{username}}:{{password}}@db.example.com:5432/app" \
allowed_roles="app-readonly,app-readwrite" \
username="vault_admin" \
password="<admin-password>"
# Create role with TTL
vault write database/roles/app-readonly \
db_name=postgres \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
default_ttl=1h \
max_ttl=24h
```
### Cloud IAM Dynamic Credentials
Vault can generate short-lived AWS IAM credentials, Azure service principal passwords, or GCP service account keys — eliminating long-lived cloud credentials entirely.
### SSH Certificate Authority
Replace SSH key distribution with a Vault-signed certificate model:
1. Vault acts as SSH CA
2. Users/machines request signed certificates with short TTL (30 min)
3. SSH servers trust the CA public key — no `authorized_keys` management
4. Certificates expire automatically — no revocation needed for normal operations
---
## Audit Logging
### What to Log
| Event | Priority | Retention |
|-------|----------|-----------|
| Secret read access | HIGH | 1 year minimum |
| Secret creation/update | HIGH | 1 year minimum |
| Auth method login | MEDIUM | 90 days |
| Policy changes | CRITICAL | 2 years (compliance) |
| Failed access attempts | CRITICAL | 1 year |
| Token creation/revocation | MEDIUM | 90 days |
| Seal/unseal operations | CRITICAL | Indefinite |
### Anomaly Detection Signals
- Secret accessed from new IP/CIDR range
- Access volume spike (>3x baseline for a path)
- Off-hours access for human auth methods
- Service accessing secrets outside its policy scope (denied requests)
- Multiple failed auth attempts from single source
- Token created with unusually long TTL
### Compliance Reporting
Generate periodic reports covering:
1. **Access inventory** — Which identities accessed which secrets, when
2. **Rotation compliance** — Secrets overdue for rotation
3. **Policy drift** — Policies modified since last review
4. **Orphaned secrets** — Secrets with no recent access (>90 days)
Use `audit_log_analyzer.py` to parse Vault or cloud audit logs for these signals.
---
## Emergency Procedures
### Secret Leak Response (Immediate)
**Time target: Contain within 15 minutes of detection.**
1. **Identify scope** — Which secret(s) leaked, where (repo, log, error message, third party)
2. **Revoke immediately** — Rotate the compromised credential at the source (provider API, Vault, cloud SM)
3. **Invalidate tokens** — Revoke all Vault tokens that accessed the leaked secret
4. **Audit blast radius** — Query audit logs for usage of the compromised secret in the exposure window
5. **Notify stakeholders** — Security team, affected service owners, compliance (if PII/regulated data)
6. **Post-mortem** — Document root cause, update controls to prevent recurrence
### Vault Seal Operations
**When to seal:** Active security incident affecting Vault infrastructure, suspected key compromise.
**Sealing** stops all Vault operations. Use only as last resort.
**Unseal procedure:**
1. Gather quorum of unseal key holders (Shamir threshold)
2. Or confirm auto-unseal KMS key is accessible
3. Unseal via `vault operator unseal` or restart with auto-unseal
4. Verify audit devices reconnected
5. Check active leases and token validity
See `references/emergency_procedures.md` for complete playbooks.
---
## CI/CD Integration
### Vault Agent Sidecar (Kubernetes)
Vault Agent runs alongside application pods, handles authentication and secret rendering:
```yaml
# Pod annotation for Vault Agent Injector
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "api-server"
vault.hashicorp.com/agent-inject-secret-db: "database/creds/app-readonly"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "database/creds/app-readonly" -}}
postgresql://{{ .Data.username }}:{{ .Data.password }}@db:5432/app
{{- end }}
```
### External Secrets Operator (Kubernetes)
For teams preferring declarative GitOps over agent sidecars:
```yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: api-credentials
data:
- secretKey: api-key
remoteRef:
key: secret/data/production/api
property: key
```
### GitHub Actions OIDC
Eliminate long-lived secrets in CI by using OIDC federation:
```yaml
- name: Authenticate to Vault
uses: hashicorp/vault-action@v2
with:
url: https://vault.example.com
method: jwt
role: github-ci
jwtGithubAudience: https://vault.example.com
secrets: |
secret/data/ci/deploy api_key | DEPLOY_API_KEY ;
secret/data/ci/deploy db_password | DB_PASSWORD
```
---
## Anti-Patterns
| Anti-Pattern | Risk | Correct Approach |
|-------------|------|-----------------|
| Hardcoded secrets in source code | Leak via repo, logs, error output | Fetch from secret store at runtime |
| Long-lived static tokens (>30 days) | Stale credentials, no accountability | Dynamic secrets or short TTL + rotation |
| Shared service accounts | No audit trail per consumer | Per-service identity with unique credentials |
| No rotation policy | Compromised creds persist indefinitely | Automated rotation on schedule |
| Secrets in environment variables on CI | Visible in build logs, process table | Vault Agent or OIDC-based injection |
| Single unseal key holder | Bus factor of 1, recovery blocked | Shamir split (3-of-5) or auto-unseal |
| No audit device configured | Zero visibility into access | Dual audit devices (file + syslog) |
| Wildcard policies (`path "*"`) | Over-permissioned, violates least privilege | Explicit path-based policies per service |
---
## Tools
| Script | Purpose |
|--------|---------|
| `vault_config_generator.py` | Generate Vault policy and auth config from application requirements |
| `rotation_planner.py` | Create rotation schedule from a secret inventory file |
| `audit_log_analyzer.py` | Analyze audit logs for anomalies and compliance gaps |
---
## Cross-References
- **env-secrets-manager** — Local `.env` file hygiene, leak detection, drift awareness
- **senior-secops** — Security operations, incident response, threat modeling
- **ci-cd-pipeline-builder** — Pipeline design where secrets are consumed
- **docker-development** — Container secret injection patterns
- **helm-chart-builder** — Kubernetes secret management in Helm charts

View File

@@ -0,0 +1,354 @@
# Cloud Secret Store Reference
## Provider Comparison
### Feature Matrix
| Feature | AWS Secrets Manager | Azure Key Vault | GCP Secret Manager |
|---------|--------------------|-----------------|--------------------|
| **Secret types** | String, binary | Secrets, keys, certificates | String, binary |
| **Max secret size** | 64 KB | 25 KB (secret), 200 KB (cert) | 64 KB |
| **Versioning** | Automatic (all versions) | Manual enable per secret | Automatic |
| **Rotation** | Built-in Lambda rotation | Custom via Functions/Logic Apps | Custom via Cloud Functions |
| **Encryption** | AWS KMS (default or CMK) | HSM-backed (FIPS 140-2 L2) | Google-managed or CMEK |
| **Cross-region** | Replication to multiple regions | Geo-redundant by SKU | Replication supported |
| **Access control** | IAM + resource-based policies | RBAC + access policies | IAM bindings |
| **Audit** | CloudTrail | Azure Monitor + Diagnostics | Cloud Audit Logs |
| **Secret references** | ARN | Vault URI + secret name | Resource name |
| **Cost model** | $0.40/secret/mo + $0.05/10K calls | $0.03/10K ops (Standard) | $0.06/10K access ops |
| **Free tier** | No | No | 6 active versions free |
### Decision Guide
**Choose AWS Secrets Manager when:**
- Fully on AWS
- Need native RDS/Aurora/Redshift rotation
- Using ECS/EKS with native AWS IAM integration
- Cross-account secret sharing via resource policies
**Choose Azure Key Vault when:**
- Azure-primary workloads
- Certificate lifecycle management is critical (built-in CA integration)
- Need HSM-backed key protection (Premium SKU)
- Azure AD conditional access integration required
**Choose GCP Secret Manager when:**
- GCP-primary workloads
- Using GKE with Workload Identity
- Want simplest API surface (few concepts, fast to integrate)
- Cost-sensitive (generous free tier)
**Choose HashiCorp Vault when:**
- Multi-cloud or hybrid environments
- Dynamic secrets (database, cloud IAM, SSH) are primary use case
- Need transit encryption, PKI, or SSH CA
- Regulatory requirement for self-hosted secret management
## AWS Secrets Manager
### Access Patterns
```python
import boto3
import json
from botocore.exceptions import ClientError
def get_secret(secret_name, region="us-east-1"):
"""Retrieve secret from AWS Secrets Manager."""
client = boto3.client("secretsmanager", region_name=region)
try:
response = client.get_secret_value(SecretId=secret_name)
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "ResourceNotFoundException":
raise ValueError(f"Secret {secret_name} not found")
elif code == "DecryptionFailureException":
raise RuntimeError("KMS decryption failed — check key permissions")
raise
if "SecretString" in response:
return json.loads(response["SecretString"])
return response["SecretBinary"]
```
### Rotation with Lambda
```python
# rotation_lambda.py — skeleton for custom rotation
def lambda_handler(event, context):
secret_id = event["SecretId"]
step = event["Step"]
token = event["ClientRequestToken"]
client = boto3.client("secretsmanager")
if step == "createSecret":
# Generate new credentials
new_password = generate_password()
client.put_secret_value(
SecretId=secret_id,
ClientRequestToken=token,
SecretString=json.dumps({"password": new_password}),
VersionStages=["AWSPENDING"],
)
elif step == "setSecret":
# Apply new credentials to the target service
pending = get_secret_version(client, secret_id, "AWSPENDING", token)
apply_credentials(pending)
elif step == "testSecret":
# Verify new credentials work
pending = get_secret_version(client, secret_id, "AWSPENDING", token)
test_connection(pending)
elif step == "finishSecret":
# Mark AWSPENDING as AWSCURRENT
client.update_secret_version_stage(
SecretId=secret_id,
VersionStage="AWSCURRENT",
MoveToVersionId=token,
RemoveFromVersionId=get_current_version(client, secret_id),
)
```
### IAM Policy for Secret Access
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:production/api/*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}
]
}
```
### Cross-Account Access
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::987654321098:role/shared-secret-reader"},
"Action": "secretsmanager:GetSecretValue",
"Resource": "*",
"Condition": {
"ForAnyValue:StringEquals": {
"secretsmanager:VersionStage": "AWSCURRENT"
}
}
}
]
}
```
## Azure Key Vault
### Access Patterns
```python
from azure.identity import DefaultAzureCredential, ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient
def get_secret(vault_url, secret_name, use_managed_identity=True):
"""Retrieve secret from Azure Key Vault."""
if use_managed_identity:
credential = ManagedIdentityCredential()
else:
credential = DefaultAzureCredential()
client = SecretClient(vault_url=vault_url, credential=credential)
return client.get_secret(secret_name).value
def list_secrets(vault_url):
"""List all secret names (not values)."""
credential = DefaultAzureCredential()
client = SecretClient(vault_url=vault_url, credential=credential)
return [s.name for s in client.list_properties_of_secrets()]
```
### RBAC vs Access Policies
**RBAC (recommended):**
- Uses Azure AD roles (`Key Vault Secrets User`, `Key Vault Secrets Officer`)
- Managed at subscription/resource group/vault level
- Audit via Azure AD activity logs
**Access Policies (legacy):**
- Per-vault configuration
- Object ID based
- No inheritance from resource group
```bash
# Assign RBAC role
az role assignment create \
--role "Key Vault Secrets User" \
--assignee <service-principal-id> \
--scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault>
```
### Certificate Management
Azure Key Vault has first-class certificate management with automatic renewal:
```bash
# Create certificate with auto-renewal
az keyvault certificate create \
--vault-name my-vault \
--name api-tls \
--policy @cert-policy.json
# cert-policy.json
{
"issuerParameters": {"name": "Self"},
"keyProperties": {"keyType": "RSA", "keySize": 2048},
"lifetimeActions": [
{"action": {"actionType": "AutoRenew"}, "trigger": {"daysBeforeExpiry": 30}}
],
"x509CertificateProperties": {
"subject": "CN=api.example.com",
"validityInMonths": 12
}
}
```
## GCP Secret Manager
### Access Patterns
```python
from google.cloud import secretmanager
def get_secret(project_id, secret_id, version="latest"):
"""Retrieve secret from GCP Secret Manager."""
client = secretmanager.SecretManagerServiceClient()
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version}"
response = client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
def create_secret(project_id, secret_id, secret_value):
"""Create a new secret with initial version."""
client = secretmanager.SecretManagerServiceClient()
parent = f"projects/{project_id}"
# Create the secret resource
secret = client.create_secret(
request={
"parent": parent,
"secret_id": secret_id,
"secret": {"replication": {"automatic": {}}},
}
)
# Add a version with the secret value
client.add_secret_version(
request={
"parent": secret.name,
"payload": {"data": secret_value.encode("UTF-8")},
}
)
return secret.name
```
### Workload Identity for GKE
Eliminate service account key files by binding Kubernetes service accounts to GCP IAM:
```bash
# Create IAM binding
gcloud iam service-accounts add-iam-policy-binding \
secret-accessor@my-project.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:my-project.svc.id.goog[namespace/ksa-name]"
# Annotate Kubernetes service account
kubectl annotate serviceaccount ksa-name \
--namespace namespace \
iam.gke.io/gcp-service-account=secret-accessor@my-project.iam.gserviceaccount.com
```
### IAM Policy
```bash
# Grant secret accessor role to a service account
gcloud secrets add-iam-policy-binding my-secret \
--member="serviceAccount:my-app@my-project.iam.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor"
```
## Cross-Cloud Patterns
### Abstraction Layer
When operating multi-cloud, create a thin abstraction that normalizes secret access:
```python
# secret_client.py — cross-cloud abstraction
class SecretClient:
def __init__(self, provider, **kwargs):
if provider == "aws":
self._client = AWSSecretClient(**kwargs)
elif provider == "azure":
self._client = AzureSecretClient(**kwargs)
elif provider == "gcp":
self._client = GCPSecretClient(**kwargs)
elif provider == "vault":
self._client = VaultSecretClient(**kwargs)
else:
raise ValueError(f"Unknown provider: {provider}")
def get(self, key):
return self._client.get(key)
def set(self, key, value):
return self._client.set(key, value)
```
### Migration Strategy
When migrating between providers:
1. **Dual-write phase** — Write to both old and new store simultaneously
2. **Dual-read phase** — Read from new store, fallback to old
3. **Cut-over** — Read exclusively from new store
4. **Cleanup** — Remove secrets from old store after grace period
### Secret Synchronization
For hybrid setups (e.g., Vault as primary, cloud SM for specific workloads):
- Use Vault's cloud secret engines to generate cloud-native credentials dynamically
- Or use External Secrets Operator to sync from Vault into cloud-native stores
- Never manually copy secrets between stores — always automate
## Caching and Performance
### Client-Side Caching
All three cloud providers support caching SDKs:
- **AWS:** `aws-secretsmanager-caching-python` — caches with configurable TTL
- **Azure:** Built-in HTTP caching in SDK, or use Azure App Configuration
- **GCP:** No official caching library — implement in-process cache with TTL
### Caching Rules
1. Cache TTL should be shorter than rotation period (e.g., cache 5 min if rotating every 30 days)
2. Implement cache invalidation on secret version change events
3. Never cache secrets to disk — in-memory only
4. Log cache hits/misses for debugging rotation issues
## Compliance Mapping
| Requirement | AWS SM | Azure KV | GCP SM | Vault |
|------------|--------|----------|--------|-------|
| SOC 2 audit trail | CloudTrail | Monitor logs | Audit Logs | Audit device |
| HIPAA encryption | KMS (BAA) | HSM (BAA) | CMEK (BAA) | Auto-encrypt |
| PCI DSS key mgmt | KMS compliance | Premium HSM | CMEK | Transit engine |
| GDPR data residency | Region selection | Region selection | Region selection | Self-hosted |
| ISO 27001 | Certified | Certified | Certified | Self-certify |

View File

@@ -0,0 +1,280 @@
# Emergency Procedures Reference
## Secret Leak Response Playbook
### Severity Classification
| Severity | Definition | Response Time | Example |
|----------|-----------|---------------|---------|
| **P0 — Critical** | Production credentials exposed publicly | Immediate (15 min) | Database password in public GitHub repo |
| **P1 — High** | Internal credentials exposed beyond intended scope | 1 hour | API key in build logs accessible to wider org |
| **P2 — Medium** | Non-production credentials exposed | 4 hours | Staging DB password in internal wiki |
| **P3 — Low** | Expired or limited-scope credential exposed | 24 hours | Rotated API key found in old commit history |
### P0/P1 Response Procedure
**Phase 1: Contain (0-15 minutes)**
1. **Identify the leaked secret**
- What credential was exposed? (type, scope, permissions)
- Where was it exposed? (repo, log, error page, third-party service)
- When was it first exposed? (commit timestamp, log timestamp)
- Is the exposure still active? (repo public? log accessible?)
2. **Revoke immediately**
- Database password: `ALTER ROLE app_user WITH PASSWORD 'new_password';`
- API key: Regenerate via provider console/API
- Vault token: `vault token revoke <token>`
- AWS access key: `aws iam delete-access-key --access-key-id <key>`
- Cloud service account: Delete and recreate key
- TLS certificate: Revoke via CA, generate new certificate
3. **Remove exposure**
- Public repo: Remove file, force-push to remove from history, request GitHub cache purge
- Build logs: Delete log artifacts, rotate CI/CD secrets
- Error page: Deploy fix to suppress secret in error output
- Third-party: Contact vendor for log purge if applicable
4. **Deploy new credentials**
- Update secret store with rotated credential
- Restart affected services to pick up new credential
- Verify services are healthy with new credential
**Phase 2: Assess (15-60 minutes)**
5. **Audit blast radius**
- Query Vault/cloud SM audit logs for the compromised credential
- Check for unauthorized usage during the exposure window
- Review network logs for suspicious connections from unknown IPs
- Check if the compromised credential grants access to other secrets (privilege escalation)
6. **Notify stakeholders**
- Security team (always)
- Service owners for affected systems
- Compliance team if regulated data was potentially accessed
- Legal if customer data may have been compromised
- Executive leadership for P0 incidents
**Phase 3: Recover (1-24 hours)**
7. **Rotate adjacent credentials**
- If the leaked credential could access other secrets, rotate those too
- If a Vault token leaked, check what policies it had — rotate everything accessible
8. **Harden against recurrence**
- Add pre-commit hook to detect secrets (e.g., `gitleaks`, `detect-secrets`)
- Review CI/CD pipeline for secret masking
- Audit who has access to the source of the leak
**Phase 4: Post-Mortem (24-72 hours)**
9. **Document incident**
- Timeline of events
- Root cause analysis
- Impact assessment
- Remediation actions taken
- Preventive measures added
### Response Communication Template
```
SECURITY INCIDENT — SECRET EXPOSURE
Severity: P0/P1
Time detected: YYYY-MM-DD HH:MM UTC
Secret type: [database password / API key / token / certificate]
Exposure vector: [public repo / build log / error output / other]
Status: [CONTAINED / INVESTIGATING / RESOLVED]
Immediate actions taken:
- [ ] Credential revoked at source
- [ ] Exposure removed
- [ ] New credential deployed
- [ ] Services verified healthy
- [ ] Audit log review in progress
Blast radius assessment: [PENDING / COMPLETE — no unauthorized access / COMPLETE — unauthorized access detected]
Next update: [time]
Incident commander: [name]
```
## Vault Seal/Unseal Procedures
### Understanding Seal Status
Vault uses a **seal** mechanism to protect the encryption key hierarchy. When sealed, Vault cannot decrypt any data or serve any requests.
```
Sealed State:
Vault process running → YES
API responding → YES (503 Sealed)
Serving secrets → NO
All active leases → FROZEN (not revoked)
Audit logging → NO
Unsealed State:
Vault process running → YES
API responding → YES (200 OK)
Serving secrets → YES
Active leases → RESUMING
Audit logging → YES
```
### When to Seal Vault (Emergency Only)
Seal Vault when:
- Active intrusion on Vault infrastructure is confirmed
- Vault server compromise is suspected (unauthorized root access)
- Encryption key material may have been extracted
- Regulatory/legal hold requires immediate data access prevention
**Do NOT seal for:**
- Routine maintenance (use graceful shutdown instead)
- Single-node issues in HA cluster (let standby take over)
- Suspected secret leak (revoke the secret, don't seal Vault)
### Seal Procedure
```bash
# Seal a single node
vault operator seal
# Seal all nodes (HA cluster)
# Seal each node individually — leader last
vault operator seal -address=https://vault-standby-1:8200
vault operator seal -address=https://vault-standby-2:8200
vault operator seal -address=https://vault-leader:8200
```
**Impact of sealing:**
- All active client connections dropped immediately
- All token and lease timers paused
- Applications lose secret access — prepare for cascading failures
- Monitoring will fire alerts for sealed state
### Unseal Procedure (Shamir Keys)
Requires a quorum of key holders (e.g., 3 of 5).
```bash
# Each key holder provides their unseal key
vault operator unseal <key-1>
vault operator unseal <key-2>
vault operator unseal <key-3>
# Vault unseals after reaching threshold
```
**Operational checklist after unseal:**
1. Verify health: `vault status` shows `Sealed: false`
2. Check audit devices: `vault audit list` — confirm all enabled
3. Check auth methods: `vault auth list`
4. Verify HA status: `vault operator raft list-peers`
5. Check lease count: monitor `vault.expire.num_leases`
6. Verify applications reconnecting (check application logs)
### Unseal Procedure (Auto-Unseal)
If using cloud KMS auto-unseal, Vault unseals automatically on restart:
```bash
# Restart Vault service
systemctl restart vault
# Verify unseal (should happen within seconds)
vault status
```
**If auto-unseal fails:**
- Check cloud KMS key permissions (IAM role may have been modified)
- Check network connectivity to cloud KMS endpoint
- Check KMS key status (not disabled, not scheduled for deletion)
- Check Vault logs: `journalctl -u vault -f`
## Mass Credential Rotation Procedure
When a broad compromise requires rotating many credentials simultaneously.
### Pre-Rotation Checklist
- [ ] Identify all credentials in scope
- [ ] Map credential dependencies (which services use which credentials)
- [ ] Determine rotation order (databases before applications)
- [ ] Prepare rollback plan for each credential
- [ ] Notify all service owners
- [ ] Schedule maintenance window if zero-downtime not possible
- [ ] Stage new credentials in secret store (but don't activate yet)
### Rotation Order
1. **Infrastructure credentials** — Database root passwords, cloud IAM admin keys
2. **Service credentials** — Application database users, API keys
3. **Integration credentials** — Third-party API keys, webhook secrets
4. **Human credentials** — Force password reset, revoke SSO sessions
### Rollback Plan
For each credential, document:
- Previous value (store in sealed emergency envelope or HSM)
- How to revert (specific command or API call)
- Verification step (how to confirm old credential works)
- Maximum time to rollback (SLA)
## Vault Recovery Procedures
### Lost Unseal Keys
If unseal keys are lost and auto-unseal is not configured:
1. **If Vault is currently unsealed:** Enable auto-unseal immediately, then reseal/unseal with KMS
2. **If Vault is sealed:** Data is irrecoverable without keys. Restore from Raft snapshot backup
3. **Prevention:** Store unseal keys in separate, secure locations (HSMs, safety deposit boxes). Use auto-unseal for production.
### Raft Cluster Recovery
**Single node failure (cluster still has quorum):**
```bash
# Remove failed peer
vault operator raft remove-peer <failed-node-id>
# Add replacement node
# (new node joins via retry_join in config)
```
**Loss of quorum (majority of nodes failed):**
```bash
# On a surviving node with recent data
vault operator raft join -leader-ca-cert=@ca.crt https://surviving-node:8200
# If no node survives, restore from snapshot
vault operator raft snapshot restore /backups/latest.snap
```
### Root Token Recovery
If root token is lost (it should be revoked after initial setup):
```bash
# Generate new root token (requires unseal key quorum)
vault operator generate-root -init
# Each key holder provides their key
vault operator generate-root -nonce=<nonce> <unseal-key>
# After quorum, decode the encoded token
vault operator generate-root -decode=<encoded-token> -otp=<otp>
```
**Best practice:** Generate a root token only when needed, complete the task, then revoke it:
```bash
vault token revoke <root-token>
```
## Incident Severity Escalation Matrix
| Signal | Escalation |
|--------|-----------|
| Single secret exposed in internal log | P2 — Rotate secret, add log masking |
| Secret in public repository (no evidence of use) | P1 — Immediate rotation, history scrub |
| Secret in public repository (evidence of unauthorized use) | P0 — Full incident response, legal notification |
| Vault node compromised | P0 — Seal cluster, rotate all accessible secrets |
| Cloud KMS key compromised | P0 — Create new key, re-encrypt all secrets, rotate all credentials |
| Audit log gap detected | P1 — Investigate cause, assume worst case for gap period |
| Multiple failed auth attempts from unknown source | P2 — Block source, investigate, rotate targeted credentials |

View File

@@ -0,0 +1,342 @@
# HashiCorp Vault Architecture & Patterns Reference
## Architecture Overview
Vault operates as a centralized secret management service with a client-server model. All secrets are encrypted at rest and in transit. The seal/unseal mechanism protects the master encryption key.
### Core Components
```
┌─────────────────────────────────────────────────┐
│ Vault Cluster │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Leader │ │ Standby │ │ Standby │ │
│ │ (active) │ │ (forward) │ │ (forward) │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
│ ┌─────┴───────────────┴───────────────┴─────┐ │
│ │ Raft Storage Backend │ │
│ └───────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Auth │ │ Secret │ │ Audit │ │
│ │ Methods │ │ Engines │ │ Devices │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
└─────────────────────────────────────────────────┘
```
### Storage Backend Selection
| Backend | HA Support | Operational Complexity | Recommendation |
|---------|-----------|----------------------|----------------|
| Integrated Raft | Yes | Low | **Default choice** — no external dependencies |
| Consul | Yes | Medium | Legacy — use Raft unless already running Consul |
| S3/GCS/Azure Blob | No | Low | Dev/test only — no HA |
| PostgreSQL/MySQL | No | Medium | Not recommended — no HA, added dependency |
## High Availability Setup
### Raft Cluster Configuration
Minimum 3 nodes for production (tolerates 1 failure). 5 nodes for critical workloads (tolerates 2 failures).
```hcl
# vault-config.hcl (per node)
storage "raft" {
path = "/opt/vault/data"
node_id = "vault-1"
retry_join {
leader_api_addr = "https://vault-2.internal:8200"
}
retry_join {
leader_api_addr = "https://vault-3.internal:8200"
}
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/opt/vault/tls/vault.crt"
tls_key_file = "/opt/vault/tls/vault.key"
}
api_addr = "https://vault-1.internal:8200"
cluster_addr = "https://vault-1.internal:8201"
```
### Auto-Unseal with AWS KMS
Eliminates manual unseal key management. Vault encrypts its master key with the KMS key.
```hcl
seal "awskms" {
region = "us-east-1"
kms_key_id = "alias/vault-unseal"
}
```
**Requirements:**
- IAM role with `kms:Encrypt`, `kms:Decrypt`, `kms:DescribeKey` permissions
- KMS key must be in the same region or accessible cross-region
- KMS key should have restricted access — only Vault nodes
### Auto-Unseal with Azure Key Vault
```hcl
seal "azurekeyvault" {
tenant_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
vault_name = "vault-unseal-kv"
key_name = "vault-unseal-key"
}
```
### Auto-Unseal with GCP KMS
```hcl
seal "gcpckms" {
project = "my-project"
region = "global"
key_ring = "vault-keyring"
crypto_key = "vault-unseal-key"
}
```
## Namespaces (Enterprise)
Namespaces provide tenant isolation within a single Vault cluster. Each namespace has independent policies, auth methods, and secret engines.
```
root/
├── dev/ # Development environment
│ ├── auth/
│ └── secret/
├── staging/ # Staging environment
│ ├── auth/
│ └── secret/
└── production/ # Production environment
├── auth/
└── secret/
```
**OSS alternative:** Use path-based isolation with strict policies. Prefix all paths with environment name (e.g., `secret/data/production/...`).
## Policy Patterns
### Templated Policies
Use identity-based templates for scalable policy management:
```hcl
# Allow entities to manage their own secrets
path "secret/data/{{identity.entity.name}}/*" {
capabilities = ["create", "read", "update", "delete"]
}
# Read shared config for the entity's group
path "secret/data/shared/{{identity.groups.names}}/*" {
capabilities = ["read"]
}
```
### Sentinel Policies (Enterprise)
Enforce governance rules beyond path-based access:
```python
# Require MFA for production secret writes
import "mfa"
main = rule {
request.path matches "secret/data/production/.*" and
request.operation in ["create", "update", "delete"] and
mfa.methods.totp.valid
}
```
### Policy Hierarchy
1. **Global deny** — Explicit deny on `sys/*`, `auth/token/create-orphan`
2. **Environment base** — Read access to environment-specific paths
3. **Service-specific** — Scoped to exact paths the service needs
4. **Admin override** — Requires MFA, time-limited, audit-heavy
## Secret Engine Configuration
### KV v2 (Versioned Key-Value)
```bash
# Enable with custom config
vault secrets enable -path=secret -version=2 kv
# Configure version retention
vault write secret/config max_versions=10 cas_required=true delete_version_after=90d
```
**Check-and-Set (CAS):** Prevents accidental overwrites. Client must supply the current version number to update.
### Database Engine
```bash
# Enable and configure PostgreSQL
vault secrets enable database
vault write database/config/postgres \
plugin_name=postgresql-database-plugin \
connection_url="postgresql://{{username}}:{{password}}@db.internal:5432/app?sslmode=require" \
allowed_roles="app-readonly,app-readwrite" \
username="vault_admin" \
password="INITIAL_PASSWORD"
# Rotate the root password (Vault manages it from now on)
vault write -f database/rotate-root/postgres
# Create a read-only role
vault write database/roles/app-readonly \
db_name=postgres \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
revocation_statements="DROP ROLE IF EXISTS \"{{name}}\";" \
default_ttl=1h \
max_ttl=24h
```
### PKI Engine (Certificate Authority)
```bash
# Enable PKI engine
vault secrets enable -path=pki pki
vault secrets tune -max-lease-ttl=87600h pki
# Generate root CA
vault write -field=certificate pki/root/generate/internal \
common_name="Example Root CA" \
ttl=87600h > root_ca.crt
# Enable intermediate CA
vault secrets enable -path=pki_int pki
vault secrets tune -max-lease-ttl=43800h pki_int
# Generate intermediate CSR
vault write -field=csr pki_int/intermediate/generate/internal \
common_name="Example Intermediate CA" > intermediate.csr
# Sign with root CA
vault write -field=certificate pki/root/sign-intermediate \
csr=@intermediate.csr format=pem_bundle ttl=43800h > intermediate.crt
# Set signed certificate
vault write pki_int/intermediate/set-signed certificate=@intermediate.crt
# Create role for leaf certificates
vault write pki_int/roles/web-server \
allowed_domains="example.com" \
allow_subdomains=true \
max_ttl=2160h
```
### Transit Engine (Encryption-as-a-Service)
```bash
vault secrets enable transit
# Create encryption key
vault write -f transit/keys/payment-data \
type=aes256-gcm96
# Encrypt data
vault write transit/encrypt/payment-data \
plaintext=$(echo "sensitive-data" | base64)
# Decrypt data
vault write transit/decrypt/payment-data \
ciphertext="vault:v1:..."
# Rotate key (old versions still decrypt, new encrypts with latest)
vault write -f transit/keys/payment-data/rotate
# Rewrap ciphertext to latest key version
vault write transit/rewrap/payment-data \
ciphertext="vault:v1:..."
```
## Performance and Scaling
### Performance Replication (Enterprise)
Primary cluster replicates to secondary clusters in other regions. Secondaries handle read traffic locally.
### Performance Standbys (Enterprise)
Standby nodes serve read requests without forwarding to the leader, reducing leader load.
### Response Wrapping
Wrap sensitive responses in a single-use token — the recipient unwraps exactly once:
```bash
# Wrap a secret (TTL = 5 minutes)
vault kv get -wrap-ttl=5m secret/data/production/db-creds
# Recipient unwraps
vault unwrap <wrapping_token>
```
### Batch Tokens
For high-throughput workloads (Lambda, serverless), use batch tokens instead of service tokens. Batch tokens are not persisted to storage, reducing I/O.
## Monitoring and Health
### Key Metrics
| Metric | Alert Threshold | Source |
|--------|----------------|--------|
| `vault.core.unsealed` | 0 (sealed) | Telemetry |
| `vault.expire.num_leases` | >10,000 | Telemetry |
| `vault.audit.log_response` | Error rate >1% | Telemetry |
| `vault.runtime.alloc_bytes` | >80% memory | Telemetry |
| `vault.raft.leader.lastContact` | >500ms | Telemetry |
| `vault.token.count` | >50,000 | Telemetry |
### Health Check Endpoint
```bash
# Returns 200 if initialized, unsealed, and active
curl -s https://vault.internal:8200/v1/sys/health
# Status codes:
# 200 — initialized, unsealed, active
# 429 — unsealed, standby
# 472 — disaster recovery secondary
# 473 — performance standby
# 501 — not initialized
# 503 — sealed
```
## Disaster Recovery
### Backup
```bash
# Raft snapshot (includes all data)
vault operator raft snapshot save backup-$(date +%Y%m%d).snap
# Schedule daily backups via cron
0 2 * * * /usr/local/bin/vault operator raft snapshot save /backups/vault-$(date +\%Y\%m\%d).snap
```
### Restore
```bash
# Restore from snapshot (causes brief outage)
vault operator raft snapshot restore backup-20260320.snap
```
### DR Replication (Enterprise)
Secondary cluster in standby. Promote on primary failure:
```bash
# On DR secondary
vault operator generate-root -dr-token
vault write sys/replication/dr/secondary/promote dr_operation_token=<token>
```

View File

@@ -0,0 +1,330 @@
#!/usr/bin/env python3
"""Analyze Vault or cloud secret manager audit logs for anomalies.
Reads JSON-lines or JSON-array audit log files and flags unusual access
patterns including volume spikes, off-hours access, new source IPs,
and failed authentication attempts.
Usage:
python audit_log_analyzer.py --log-file vault-audit.log --threshold 5
python audit_log_analyzer.py --log-file audit.json --threshold 3 --json
Expected log entry format (JSON lines or JSON array):
{
"timestamp": "2026-03-20T14:32:00Z",
"type": "request",
"auth": {"accessor": "token-abc123", "entity_id": "eid-001", "display_name": "approle-payment-svc"},
"request": {"path": "secret/data/production/payment/api-keys", "operation": "read"},
"response": {"status_code": 200},
"remote_address": "10.0.1.15"
}
Fields are optional — the analyzer works with whatever is available.
"""
import argparse
import json
import sys
import textwrap
from collections import defaultdict
from datetime import datetime
def load_logs(path):
"""Load audit log entries from file. Supports JSON lines and JSON array."""
entries = []
try:
with open(path, "r") as f:
content = f.read().strip()
except FileNotFoundError:
print(f"ERROR: Log file not found: {path}", file=sys.stderr)
sys.exit(1)
if not content:
return entries
# Try JSON array first
if content.startswith("["):
try:
entries = json.loads(content)
return entries
except json.JSONDecodeError:
pass
# Try JSON lines
for i, line in enumerate(content.split("\n"), 1):
line = line.strip()
if not line:
continue
try:
entries.append(json.loads(line))
except json.JSONDecodeError:
print(f"WARNING: Skipping malformed line {i}", file=sys.stderr)
return entries
def extract_fields(entry):
"""Extract normalized fields from a log entry."""
timestamp_raw = entry.get("timestamp", entry.get("time", ""))
ts = None
if timestamp_raw:
for fmt in ("%Y-%m-%dT%H:%M:%SZ", "%Y-%m-%dT%H:%M:%S.%fZ", "%Y-%m-%dT%H:%M:%S%z", "%Y-%m-%d %H:%M:%S"):
try:
ts = datetime.strptime(timestamp_raw.replace("+00:00", "Z").rstrip("Z") + "Z", fmt.rstrip("Z") + "Z") if "Z" not in fmt else datetime.strptime(timestamp_raw, fmt)
break
except (ValueError, TypeError):
continue
if ts is None:
# Fallback: try basic parse
try:
ts = datetime.fromisoformat(timestamp_raw.replace("Z", "+00:00").replace("+00:00", ""))
except (ValueError, TypeError):
pass
auth = entry.get("auth", {})
request = entry.get("request", {})
response = entry.get("response", {})
return {
"timestamp": ts,
"hour": ts.hour if ts else None,
"identity": auth.get("display_name", auth.get("entity_id", "unknown")),
"path": request.get("path", entry.get("path", "unknown")),
"operation": request.get("operation", entry.get("operation", "unknown")),
"status_code": response.get("status_code", entry.get("status_code")),
"remote_address": entry.get("remote_address", entry.get("source_address", "unknown")),
"entry_type": entry.get("type", "unknown"),
}
def analyze(entries, threshold):
"""Run anomaly detection across all log entries."""
parsed = [extract_fields(e) for e in entries]
# Counters
access_by_identity = defaultdict(int)
access_by_path = defaultdict(int)
access_by_ip = defaultdict(set) # identity -> set of IPs
ip_to_identities = defaultdict(set) # IP -> set of identities
failed_by_source = defaultdict(int)
off_hours_access = []
path_by_identity = defaultdict(set) # identity -> set of paths
hourly_distribution = defaultdict(int)
for p in parsed:
identity = p["identity"]
path = p["path"]
ip = p["remote_address"]
status = p["status_code"]
hour = p["hour"]
access_by_identity[identity] += 1
access_by_path[path] += 1
access_by_ip[identity].add(ip)
ip_to_identities[ip].add(identity)
path_by_identity[identity].add(path)
if hour is not None:
hourly_distribution[hour] += 1
# Failed access (non-200 or 4xx/5xx)
if status and (status >= 400 or status == 0):
failed_by_source[f"{identity}@{ip}"] += 1
# Off-hours: before 6 AM or after 10 PM
if hour is not None and (hour < 6 or hour >= 22):
off_hours_access.append(p)
# Build anomalies
anomalies = []
# 1. Volume spikes — identities accessing secrets more than threshold * average
if access_by_identity:
avg_access = sum(access_by_identity.values()) / len(access_by_identity)
spike_threshold = max(threshold * avg_access, threshold)
for identity, count in access_by_identity.items():
if count >= spike_threshold:
anomalies.append({
"type": "volume_spike",
"severity": "HIGH",
"identity": identity,
"access_count": count,
"threshold": round(spike_threshold, 1),
"description": f"Identity '{identity}' made {count} accesses (threshold: {round(spike_threshold, 1)})",
})
# 2. Multi-IP access — single identity from many IPs
for identity, ips in access_by_ip.items():
if len(ips) >= threshold:
anomalies.append({
"type": "multi_ip_access",
"severity": "MEDIUM",
"identity": identity,
"ip_count": len(ips),
"ips": sorted(ips),
"description": f"Identity '{identity}' accessed from {len(ips)} different IPs",
})
# 3. Failed access attempts
for source, count in failed_by_source.items():
if count >= threshold:
anomalies.append({
"type": "failed_access",
"severity": "HIGH",
"source": source,
"failure_count": count,
"description": f"Source '{source}' had {count} failed access attempts",
})
# 4. Off-hours access
if off_hours_access:
off_hours_identities = defaultdict(int)
for p in off_hours_access:
off_hours_identities[p["identity"]] += 1
for identity, count in off_hours_identities.items():
if count >= max(threshold, 2):
anomalies.append({
"type": "off_hours_access",
"severity": "MEDIUM",
"identity": identity,
"access_count": count,
"description": f"Identity '{identity}' made {count} accesses outside business hours (before 6 AM / after 10 PM)",
})
# 5. Broad path access — single identity touching many paths
for identity, paths in path_by_identity.items():
if len(paths) >= threshold * 2:
anomalies.append({
"type": "broad_access",
"severity": "MEDIUM",
"identity": identity,
"path_count": len(paths),
"paths": sorted(paths)[:10],
"description": f"Identity '{identity}' accessed {len(paths)} distinct secret paths",
})
# Sort anomalies by severity
severity_order = {"CRITICAL": 0, "HIGH": 1, "MEDIUM": 2, "LOW": 3}
anomalies.sort(key=lambda x: severity_order.get(x["severity"], 4))
# Summary stats
summary = {
"total_entries": len(entries),
"parsed_entries": len(parsed),
"unique_identities": len(access_by_identity),
"unique_paths": len(access_by_path),
"unique_source_ips": len(ip_to_identities),
"total_failures": sum(failed_by_source.values()),
"off_hours_events": len(off_hours_access),
"anomalies_found": len(anomalies),
}
# Top accessed paths
top_paths = sorted(access_by_path.items(), key=lambda x: -x[1])[:10]
return {
"summary": summary,
"anomalies": anomalies,
"top_accessed_paths": [{"path": p, "count": c} for p, c in top_paths],
"hourly_distribution": dict(sorted(hourly_distribution.items())),
}
def print_human(result, threshold):
"""Print human-readable analysis report."""
summary = result["summary"]
anomalies = result["anomalies"]
print("=== Audit Log Analysis Report ===")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"Anomaly threshold: {threshold}")
print()
print("--- Summary ---")
print(f" Total log entries: {summary['total_entries']}")
print(f" Unique identities: {summary['unique_identities']}")
print(f" Unique secret paths: {summary['unique_paths']}")
print(f" Unique source IPs: {summary['unique_source_ips']}")
print(f" Total failures: {summary['total_failures']}")
print(f" Off-hours events: {summary['off_hours_events']}")
print(f" Anomalies detected: {summary['anomalies_found']}")
print()
if anomalies:
print("--- Anomalies ---")
for i, a in enumerate(anomalies, 1):
print(f" [{a['severity']}] {a['type']}: {a['description']}")
print()
else:
print("--- No anomalies detected ---")
print()
if result["top_accessed_paths"]:
print("--- Top Accessed Paths ---")
for item in result["top_accessed_paths"]:
print(f" {item['count']:5d} {item['path']}")
print()
if result["hourly_distribution"]:
print("--- Hourly Distribution ---")
max_count = max(result["hourly_distribution"].values()) if result["hourly_distribution"] else 1
for hour in range(24):
count = result["hourly_distribution"].get(hour, 0)
bar_len = int((count / max_count) * 40) if max_count > 0 else 0
marker = " *" if (hour < 6 or hour >= 22) else ""
print(f" {hour:02d}:00 {'#' * bar_len:40s} {count}{marker}")
print(" (* = off-hours)")
def main():
parser = argparse.ArgumentParser(
description="Analyze Vault/cloud secret manager audit logs for anomalies.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=textwrap.dedent("""\
The analyzer detects:
- Volume spikes (identity accessing secrets above threshold * average)
- Multi-IP access (single identity from many source IPs)
- Failed access attempts (repeated auth/access failures)
- Off-hours access (before 6 AM or after 10 PM)
- Broad path access (single identity accessing many distinct paths)
Log format: JSON lines or JSON array. Each entry should include
timestamp, auth info, request path/operation, response status,
and remote address. Missing fields are handled gracefully.
Examples:
%(prog)s --log-file vault-audit.log --threshold 5
%(prog)s --log-file audit.json --threshold 3 --json
"""),
)
parser.add_argument("--log-file", required=True, help="Path to audit log file (JSON lines or JSON array)")
parser.add_argument(
"--threshold",
type=int,
default=5,
help="Anomaly sensitivity threshold — lower = more sensitive (default: 5)",
)
parser.add_argument("--json", action="store_true", dest="json_output", help="Output as JSON")
args = parser.parse_args()
entries = load_logs(args.log_file)
if not entries:
print("No log entries found in file.", file=sys.stderr)
sys.exit(1)
result = analyze(entries, args.threshold)
result["log_file"] = args.log_file
result["threshold"] = args.threshold
result["analyzed_at"] = datetime.now().isoformat()
if args.json_output:
print(json.dumps(result, indent=2))
else:
print_human(result, args.threshold)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,280 @@
#!/usr/bin/env python3
"""Create a rotation schedule from a secret inventory file.
Reads a JSON inventory of secrets and produces a rotation plan based on
the selected policy (30d, 60d, 90d) with urgency classification.
Usage:
python rotation_planner.py --inventory secrets.json --policy 30d
python rotation_planner.py --inventory secrets.json --policy 90d --json
Inventory file format (JSON):
[
{
"name": "prod-db-password",
"type": "database",
"store": "vault",
"last_rotated": "2026-01-15",
"owner": "platform-team",
"environment": "production"
},
...
]
"""
import argparse
import json
import sys
import textwrap
from datetime import datetime, timedelta
POLICY_DAYS = {
"30d": 30,
"60d": 60,
"90d": 90,
}
# Default rotation period by secret type if not overridden by policy
TYPE_DEFAULTS = {
"database": 30,
"api-key": 90,
"tls-certificate": 60,
"ssh-key": 90,
"service-token": 1,
"encryption-key": 90,
"oauth-secret": 90,
"password": 30,
}
URGENCY_THRESHOLDS = {
"critical": 0, # Already overdue
"high": 7, # Due within 7 days
"medium": 14, # Due within 14 days
"low": 30, # Due within 30 days
}
def load_inventory(path):
"""Load and validate secret inventory from JSON file."""
try:
with open(path, "r") as f:
data = json.load(f)
except FileNotFoundError:
print(f"ERROR: Inventory file not found: {path}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"ERROR: Invalid JSON in {path}: {e}", file=sys.stderr)
sys.exit(1)
if not isinstance(data, list):
print("ERROR: Inventory must be a JSON array of secret objects", file=sys.stderr)
sys.exit(1)
validated = []
for i, entry in enumerate(data):
if not isinstance(entry, dict):
print(f"WARNING: Skipping entry {i} — not an object", file=sys.stderr)
continue
name = entry.get("name", f"unnamed-{i}")
secret_type = entry.get("type", "unknown")
last_rotated = entry.get("last_rotated")
if not last_rotated:
print(f"WARNING: '{name}' has no last_rotated date — marking as overdue", file=sys.stderr)
last_rotated_dt = None
else:
try:
last_rotated_dt = datetime.strptime(last_rotated, "%Y-%m-%d")
except ValueError:
print(f"WARNING: '{name}' has invalid date '{last_rotated}' — marking as overdue", file=sys.stderr)
last_rotated_dt = None
validated.append({
"name": name,
"type": secret_type,
"store": entry.get("store", "unknown"),
"last_rotated": last_rotated_dt,
"owner": entry.get("owner", "unassigned"),
"environment": entry.get("environment", "unknown"),
})
return validated
def compute_schedule(inventory, policy_days):
"""Compute rotation schedule for each secret."""
now = datetime.now()
schedule = []
for secret in inventory:
# Determine rotation interval
type_default = TYPE_DEFAULTS.get(secret["type"], 90)
rotation_interval = min(policy_days, type_default)
if secret["last_rotated"] is None:
days_since = 999
next_rotation = now # Immediate
days_until = -999
else:
days_since = (now - secret["last_rotated"]).days
next_rotation = secret["last_rotated"] + timedelta(days=rotation_interval)
days_until = (next_rotation - now).days
# Classify urgency
if days_until <= URGENCY_THRESHOLDS["critical"]:
urgency = "CRITICAL"
elif days_until <= URGENCY_THRESHOLDS["high"]:
urgency = "HIGH"
elif days_until <= URGENCY_THRESHOLDS["medium"]:
urgency = "MEDIUM"
else:
urgency = "LOW"
schedule.append({
"name": secret["name"],
"type": secret["type"],
"store": secret["store"],
"owner": secret["owner"],
"environment": secret["environment"],
"last_rotated": secret["last_rotated"].strftime("%Y-%m-%d") if secret["last_rotated"] else "NEVER",
"rotation_interval_days": rotation_interval,
"next_rotation": next_rotation.strftime("%Y-%m-%d"),
"days_until_due": days_until,
"days_since_rotation": days_since,
"urgency": urgency,
})
# Sort by urgency (critical first), then by days until due
urgency_order = {"CRITICAL": 0, "HIGH": 1, "MEDIUM": 2, "LOW": 3}
schedule.sort(key=lambda x: (urgency_order.get(x["urgency"], 4), x["days_until_due"]))
return schedule
def build_summary(schedule):
"""Build summary statistics."""
total = len(schedule)
by_urgency = {}
by_type = {}
by_owner = {}
for entry in schedule:
urg = entry["urgency"]
by_urgency[urg] = by_urgency.get(urg, 0) + 1
t = entry["type"]
by_type[t] = by_type.get(t, 0) + 1
o = entry["owner"]
by_owner[o] = by_owner.get(o, 0) + 1
return {
"total_secrets": total,
"by_urgency": by_urgency,
"by_type": by_type,
"by_owner": by_owner,
"overdue_count": by_urgency.get("CRITICAL", 0),
"due_within_7d": by_urgency.get("HIGH", 0),
}
def print_human(schedule, summary, policy):
"""Print human-readable rotation plan."""
print(f"=== Secret Rotation Plan (Policy: {policy}) ===")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"Total secrets: {summary['total_secrets']}")
print()
print("--- Urgency Summary ---")
for urg in ["CRITICAL", "HIGH", "MEDIUM", "LOW"]:
count = summary["by_urgency"].get(urg, 0)
if count > 0:
print(f" {urg:10s} {count}")
print()
if not schedule:
print("No secrets in inventory.")
return
print("--- Rotation Schedule ---")
print(f" {'Name':30s} {'Type':15s} {'Urgency':10s} {'Last Rotated':12s} {'Next Due':12s} {'Owner'}")
print(f" {'-'*30} {'-'*15} {'-'*10} {'-'*12} {'-'*12} {'-'*15}")
for entry in schedule:
overdue_marker = " **OVERDUE**" if entry["urgency"] == "CRITICAL" else ""
print(
f" {entry['name']:30s} {entry['type']:15s} {entry['urgency']:10s} "
f"{entry['last_rotated']:12s} {entry['next_rotation']:12s} "
f"{entry['owner']}{overdue_marker}"
)
print()
print("--- Action Items ---")
critical = [e for e in schedule if e["urgency"] == "CRITICAL"]
high = [e for e in schedule if e["urgency"] == "HIGH"]
if critical:
print(f" IMMEDIATE: Rotate {len(critical)} overdue secret(s):")
for e in critical:
print(f" - {e['name']} ({e['type']}, owner: {e['owner']})")
if high:
print(f" THIS WEEK: Rotate {len(high)} secret(s) due within 7 days:")
for e in high:
print(f" - {e['name']} (due: {e['next_rotation']}, owner: {e['owner']})")
if not critical and not high:
print(" No urgent rotations needed.")
def main():
parser = argparse.ArgumentParser(
description="Create rotation schedule from a secret inventory file.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=textwrap.dedent("""\
Policies:
30d Aggressive — all secrets rotate within 30 days max
60d Standard — 60-day maximum rotation window
90d Relaxed — 90-day maximum rotation window
Note: Some secret types (e.g., database passwords) have shorter
built-in defaults that override the policy maximum.
Example inventory file (secrets.json):
[
{"name": "prod-db", "type": "database", "store": "vault",
"last_rotated": "2026-01-15", "owner": "platform-team",
"environment": "production"}
]
"""),
)
parser.add_argument("--inventory", required=True, help="Path to JSON inventory file")
parser.add_argument(
"--policy",
required=True,
choices=["30d", "60d", "90d"],
help="Rotation policy (maximum rotation interval)",
)
parser.add_argument("--json", action="store_true", dest="json_output", help="Output as JSON")
args = parser.parse_args()
policy_days = POLICY_DAYS[args.policy]
inventory = load_inventory(args.inventory)
schedule = compute_schedule(inventory, policy_days)
summary = build_summary(schedule)
result = {
"policy": args.policy,
"policy_days": policy_days,
"generated_at": datetime.now().isoformat(),
"summary": summary,
"schedule": schedule,
}
if args.json_output:
print(json.dumps(result, indent=2))
else:
print_human(schedule, summary, args.policy)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,302 @@
#!/usr/bin/env python3
"""Generate Vault policy and auth configuration from application requirements.
Produces HCL policy files and auth method setup commands for HashiCorp Vault
based on application name, auth method, and required secret paths.
Usage:
python vault_config_generator.py --app-name payment-service --auth-method approle --secrets "db-creds,api-key,tls-cert"
python vault_config_generator.py --app-name api-gateway --auth-method kubernetes --secrets "db-creds" --namespace production --json
"""
import argparse
import json
import sys
import textwrap
from datetime import datetime
# Default TTLs by auth method
AUTH_METHOD_DEFAULTS = {
"approle": {
"token_ttl": "1h",
"token_max_ttl": "4h",
"secret_id_num_uses": 1,
"secret_id_ttl": "10m",
},
"kubernetes": {
"token_ttl": "1h",
"token_max_ttl": "4h",
},
"oidc": {
"token_ttl": "8h",
"token_max_ttl": "12h",
},
}
# Secret type templates
SECRET_TYPE_MAP = {
"db-creds": {
"engine": "database",
"path": "database/creds/{app}-readonly",
"capabilities": ["read"],
"description": "Dynamic database credentials",
},
"db-admin": {
"engine": "database",
"path": "database/creds/{app}-readwrite",
"capabilities": ["read"],
"description": "Dynamic database admin credentials",
},
"api-key": {
"engine": "kv-v2",
"path": "secret/data/{env}/{app}/api-keys",
"capabilities": ["read"],
"description": "Static API keys (KV v2)",
},
"tls-cert": {
"engine": "pki",
"path": "pki/issue/{app}-cert",
"capabilities": ["create", "update"],
"description": "TLS certificate issuance",
},
"encryption": {
"engine": "transit",
"path": "transit/encrypt/{app}-key",
"capabilities": ["update"],
"description": "Transit encryption operations",
},
"ssh-cert": {
"engine": "ssh",
"path": "ssh/sign/{app}-role",
"capabilities": ["create", "update"],
"description": "SSH certificate signing",
},
"config": {
"engine": "kv-v2",
"path": "secret/data/{env}/{app}/config",
"capabilities": ["read"],
"description": "Application configuration secrets",
},
}
def parse_secrets(secrets_str):
"""Parse comma-separated secret types into list."""
secrets = [s.strip() for s in secrets_str.split(",") if s.strip()]
valid = []
unknown = []
for s in secrets:
if s in SECRET_TYPE_MAP:
valid.append(s)
else:
unknown.append(s)
return valid, unknown
def generate_policy_hcl(app_name, secrets, environment="production"):
"""Generate HCL policy document."""
lines = [
f'# Vault policy for {app_name}',
f'# Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
f'# Environment: {environment}',
'',
]
for secret_type in secrets:
tmpl = SECRET_TYPE_MAP[secret_type]
path = tmpl["path"].format(app=app_name, env=environment)
caps = ", ".join(f'"{c}"' for c in tmpl["capabilities"])
lines.append(f'# {tmpl["description"]}')
lines.append(f'path "{path}" {{')
lines.append(f' capabilities = [{caps}]')
lines.append('}')
lines.append('')
# Always deny sys paths
lines.append('# Deny admin paths')
lines.append('path "sys/*" {')
lines.append(' capabilities = ["deny"]')
lines.append('}')
return "\n".join(lines)
def generate_auth_config(app_name, auth_method, policy_name, namespace=None):
"""Generate auth method setup commands."""
commands = []
defaults = AUTH_METHOD_DEFAULTS.get(auth_method, {})
if auth_method == "approle":
cmd = (
f"vault write auth/approle/role/{app_name} \\\n"
f" token_ttl={defaults['token_ttl']} \\\n"
f" token_max_ttl={defaults['token_max_ttl']} \\\n"
f" secret_id_num_uses={defaults['secret_id_num_uses']} \\\n"
f" secret_id_ttl={defaults['secret_id_ttl']} \\\n"
f" token_policies=\"{policy_name}\""
)
commands.append({"description": f"Create AppRole for {app_name}", "command": cmd})
commands.append({
"description": "Fetch RoleID",
"command": f"vault read auth/approle/role/{app_name}/role-id",
})
commands.append({
"description": "Generate SecretID (single-use)",
"command": f"vault write -f auth/approle/role/{app_name}/secret-id",
})
elif auth_method == "kubernetes":
ns = namespace or "default"
cmd = (
f"vault write auth/kubernetes/role/{app_name} \\\n"
f" bound_service_account_names={app_name} \\\n"
f" bound_service_account_namespaces={ns} \\\n"
f" policies={policy_name} \\\n"
f" ttl={defaults['token_ttl']}"
)
commands.append({"description": f"Create Kubernetes auth role for {app_name}", "command": cmd})
elif auth_method == "oidc":
cmd = (
f"vault write auth/oidc/role/{app_name} \\\n"
f" bound_audiences=\"vault\" \\\n"
f" allowed_redirect_uris=\"https://vault.example.com/ui/vault/auth/oidc/oidc/callback\" \\\n"
f" user_claim=\"email\" \\\n"
f" oidc_scopes=\"openid,profile,email\" \\\n"
f" policies=\"{policy_name}\" \\\n"
f" ttl={defaults['token_ttl']}"
)
commands.append({"description": f"Create OIDC role for {app_name}", "command": cmd})
return commands
def build_output(app_name, auth_method, secrets, environment, namespace):
"""Build complete configuration output."""
valid_secrets, unknown_secrets = parse_secrets(secrets)
if not valid_secrets:
return {
"error": "No valid secret types provided",
"unknown": unknown_secrets,
"available_types": list(SECRET_TYPE_MAP.keys()),
}
policy_name = f"{app_name}-policy"
policy_hcl = generate_policy_hcl(app_name, valid_secrets, environment)
auth_commands = generate_auth_config(app_name, auth_method, policy_name, namespace)
secret_details = []
for s in valid_secrets:
tmpl = SECRET_TYPE_MAP[s]
secret_details.append({
"type": s,
"engine": tmpl["engine"],
"path": tmpl["path"].format(app=app_name, env=environment),
"capabilities": tmpl["capabilities"],
"description": tmpl["description"],
})
result = {
"app_name": app_name,
"auth_method": auth_method,
"environment": environment,
"policy_name": policy_name,
"policy_hcl": policy_hcl,
"auth_commands": auth_commands,
"secrets": secret_details,
"generated_at": datetime.now().isoformat(),
}
if unknown_secrets:
result["warnings"] = [f"Unknown secret type '{u}' — skipped. Available: {list(SECRET_TYPE_MAP.keys())}" for u in unknown_secrets]
if namespace:
result["namespace"] = namespace
return result
def print_human(result):
"""Print human-readable output."""
if "error" in result:
print(f"ERROR: {result['error']}")
if result.get("unknown"):
print(f" Unknown types: {', '.join(result['unknown'])}")
print(f" Available types: {', '.join(result['available_types'])}")
sys.exit(1)
print(f"=== Vault Configuration for {result['app_name']} ===")
print(f"Auth Method: {result['auth_method']}")
print(f"Environment: {result['environment']}")
print(f"Policy Name: {result['policy_name']}")
print()
if result.get("warnings"):
for w in result["warnings"]:
print(f"WARNING: {w}")
print()
print("--- Policy HCL ---")
print(result["policy_hcl"])
print()
print(f"Write policy: vault policy write {result['policy_name']} {result['policy_name']}.hcl")
print()
print("--- Auth Method Setup ---")
for cmd_info in result["auth_commands"]:
print(f"# {cmd_info['description']}")
print(cmd_info["command"])
print()
print("--- Secret Paths ---")
for s in result["secrets"]:
caps = ", ".join(s["capabilities"])
print(f" {s['type']:15s} {s['path']:50s} [{caps}]")
def main():
parser = argparse.ArgumentParser(
description="Generate Vault policy and auth configuration from application requirements.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=textwrap.dedent("""\
Secret types:
db-creds Dynamic database credentials (read-only)
db-admin Dynamic database credentials (read-write)
api-key Static API keys in KV v2
tls-cert TLS certificate issuance via PKI
encryption Transit encryption-as-a-service
ssh-cert SSH certificate signing
config Application configuration secrets
Examples:
%(prog)s --app-name payment-svc --auth-method approle --secrets "db-creds,api-key"
%(prog)s --app-name api-gw --auth-method kubernetes --secrets "db-creds,config" --namespace prod --json
"""),
)
parser.add_argument("--app-name", required=True, help="Application or service name")
parser.add_argument(
"--auth-method",
required=True,
choices=["approle", "kubernetes", "oidc"],
help="Vault auth method to configure",
)
parser.add_argument("--secrets", required=True, help="Comma-separated secret types (e.g., db-creds,api-key,tls-cert)")
parser.add_argument("--environment", default="production", help="Target environment (default: production)")
parser.add_argument("--namespace", help="Kubernetes namespace (for kubernetes auth method)")
parser.add_argument("--json", action="store_true", dest="json_output", help="Output as JSON")
args = parser.parse_args()
result = build_output(args.app_name, args.auth_method, args.secrets, args.environment, args.namespace)
if args.json_output:
print(json.dumps(result, indent=2))
else:
print_human(result)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,457 @@
---
name: "sql-database-assistant"
description: "Use when the user asks to write SQL queries, optimize database performance, generate migrations, explore database schemas, or work with ORMs like Prisma, Drizzle, TypeORM, or SQLAlchemy."
---
# SQL Database Assistant - POWERFUL Tier Skill
## Overview
The operational companion to database design. While **database-designer** focuses on schema architecture and **database-schema-designer** handles ERD modeling, this skill covers the day-to-day: writing queries, optimizing performance, generating migrations, and bridging the gap between application code and database engines.
### Core Capabilities
- **Natural Language to SQL** — translate requirements into correct, performant queries
- **Schema Exploration** — introspect live databases across PostgreSQL, MySQL, SQLite, SQL Server
- **Query Optimization** — EXPLAIN analysis, index recommendations, N+1 detection, rewrite patterns
- **Migration Generation** — up/down scripts, zero-downtime strategies, rollback plans
- **ORM Integration** — Prisma, Drizzle, TypeORM, SQLAlchemy patterns and escape hatches
- **Multi-Database Support** — dialect-aware SQL with compatibility guidance
### Tools
| Script | Purpose |
|--------|---------|
| `scripts/query_optimizer.py` | Static analysis of SQL queries for performance issues |
| `scripts/migration_generator.py` | Generate migration file templates from change descriptions |
| `scripts/schema_explorer.py` | Generate schema documentation from introspection queries |
---
## Natural Language to SQL
### Translation Patterns
When converting requirements to SQL, follow this sequence:
1. **Identify entities** — map nouns to tables
2. **Identify relationships** — map verbs to JOINs or subqueries
3. **Identify filters** — map adjectives/conditions to WHERE clauses
4. **Identify aggregations** — map "total", "average", "count" to GROUP BY
5. **Identify ordering** — map "top", "latest", "highest" to ORDER BY + LIMIT
### Common Query Templates
**Top-N per group (window function)**
```sql
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn
FROM employees
) ranked WHERE rn <= 3;
```
**Running totals**
```sql
SELECT date, amount,
SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;
```
**Gap detection**
```sql
SELECT curr.id, curr.seq_num, prev.seq_num AS prev_seq
FROM records curr
LEFT JOIN records prev ON prev.seq_num = curr.seq_num - 1
WHERE prev.id IS NULL AND curr.seq_num > 1;
```
**UPSERT (PostgreSQL)**
```sql
INSERT INTO settings (key, value, updated_at)
VALUES ('theme', 'dark', NOW())
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value, updated_at = EXCLUDED.updated_at;
```
**UPSERT (MySQL)**
```sql
INSERT INTO settings (key_name, value, updated_at)
VALUES ('theme', 'dark', NOW())
ON DUPLICATE KEY UPDATE value = VALUES(value), updated_at = VALUES(updated_at);
```
> See references/query_patterns.md for JOINs, CTEs, window functions, JSON operations, and more.
---
## Schema Exploration
### Introspection Queries
**PostgreSQL — list tables and columns**
```sql
SELECT table_name, column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;
```
**PostgreSQL — foreign keys**
```sql
SELECT tc.table_name, kcu.column_name,
ccu.table_name AS foreign_table, ccu.column_name AS foreign_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
```
**MySQL — table sizes**
```sql
SELECT table_name, table_rows,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY data_length DESC;
```
**SQLite — schema dump**
```sql
SELECT name, sql FROM sqlite_master WHERE type = 'table' ORDER BY name;
```
**SQL Server — columns with types**
```sql
SELECT t.name AS table_name, c.name AS column_name,
ty.name AS data_type, c.max_length, c.is_nullable
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
ORDER BY t.name, c.column_id;
```
### Generating Documentation from Schema
Use `scripts/schema_explorer.py` to produce markdown or JSON documentation:
```bash
python scripts/schema_explorer.py --dialect postgres --tables all --format md
python scripts/schema_explorer.py --dialect mysql --tables users,orders --format json --json
```
---
## Query Optimization
### EXPLAIN Analysis Workflow
1. **Run EXPLAIN ANALYZE** (PostgreSQL) or **EXPLAIN FORMAT=JSON** (MySQL)
2. **Identify the costliest node** — Seq Scan on large tables, Nested Loop with high row estimates
3. **Check for missing indexes** — sequential scans on filtered columns
4. **Look for estimation errors** — planned vs actual rows divergence signals stale statistics
5. **Evaluate JOIN order** — ensure the smallest result set drives the join
### Index Recommendation Checklist
- Columns in WHERE clauses with high selectivity
- Columns in JOIN conditions (foreign keys)
- Columns in ORDER BY when combined with LIMIT
- Composite indexes matching multi-column WHERE predicates (most selective column first)
- Partial indexes for queries with constant filters (e.g., `WHERE status = 'active'`)
- Covering indexes to avoid table lookups for read-heavy queries
### Query Rewriting Patterns
| Anti-Pattern | Rewrite |
|-------------|---------|
| `SELECT * FROM orders` | `SELECT id, status, total FROM orders` (explicit columns) |
| `WHERE YEAR(created_at) = 2025` | `WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'` (sargable) |
| Correlated subquery in SELECT | LEFT JOIN with aggregation |
| `NOT IN (SELECT ...)` with NULLs | `NOT EXISTS (SELECT 1 ...)` |
| `UNION` (dedup) when not needed | `UNION ALL` |
| `LIKE '%search%'` | Full-text search index (GIN/FULLTEXT) |
| `ORDER BY RAND()` | Application-side random sampling or `TABLESAMPLE` |
### N+1 Detection
**Symptoms:**
- Application loop that executes one query per parent row
- ORM lazy-loading related entities inside a loop
- Query log shows hundreds of identical SELECT patterns with different IDs
**Fixes:**
- Use eager loading (`include` in Prisma, `joinedload` in SQLAlchemy)
- Batch queries with `WHERE id IN (...)`
- Use DataLoader pattern for GraphQL resolvers
### Static Analysis Tool
```bash
python scripts/query_optimizer.py --query "SELECT * FROM orders WHERE status = 'pending'" --dialect postgres
python scripts/query_optimizer.py --query queries.sql --dialect mysql --json
```
> See references/optimization_guide.md for EXPLAIN plan reading, index types, and connection pooling.
---
## Migration Generation
### Zero-Downtime Migration Patterns
**Adding a column (safe)**
```sql
-- Up
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Down
ALTER TABLE users DROP COLUMN phone;
```
**Renaming a column (expand-contract)**
```sql
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Step 2: Backfill
UPDATE users SET full_name = name;
-- Step 3: Deploy app reading both columns
-- Step 4: Deploy app writing only new column
-- Step 5: Drop old column
ALTER TABLE users DROP COLUMN name;
```
**Adding a NOT NULL column (safe sequence)**
```sql
-- Step 1: Add nullable
ALTER TABLE orders ADD COLUMN region VARCHAR(50);
-- Step 2: Backfill with default
UPDATE orders SET region = 'unknown' WHERE region IS NULL;
-- Step 3: Add constraint
ALTER TABLE orders ALTER COLUMN region SET NOT NULL;
ALTER TABLE orders ALTER COLUMN region SET DEFAULT 'unknown';
```
**Index creation (non-blocking, PostgreSQL)**
```sql
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);
```
### Data Backfill Strategies
- **Batch updates** — process in chunks of 1000-10000 rows to avoid lock contention
- **Background jobs** — run backfills asynchronously with progress tracking
- **Dual-write** — write to old and new columns during transition period
- **Validation queries** — verify row counts and data integrity after each batch
### Rollback Strategies
Every migration must have a reversible down script. For irreversible changes:
1. **Backup before execution**`pg_dump` the affected tables
2. **Feature flags** — application can switch between old/new schema reads
3. **Shadow tables** — keep a copy of the original table during migration window
### Migration Generator Tool
```bash
python scripts/migration_generator.py --change "add email_verified boolean to users" --dialect postgres --format sql
python scripts/migration_generator.py --change "rename column name to full_name in customers" --dialect mysql --format alembic --json
```
---
## Multi-Database Support
### Dialect Differences
| Feature | PostgreSQL | MySQL | SQLite | SQL Server |
|---------|-----------|-------|--------|------------|
| UPSERT | `ON CONFLICT DO UPDATE` | `ON DUPLICATE KEY UPDATE` | `ON CONFLICT DO UPDATE` | `MERGE` |
| Boolean | Native `BOOLEAN` | `TINYINT(1)` | `INTEGER` | `BIT` |
| Auto-increment | `SERIAL` / `GENERATED` | `AUTO_INCREMENT` | `INTEGER PRIMARY KEY` | `IDENTITY` |
| JSON | `JSONB` (indexed) | `JSON` | Text (ext) | `NVARCHAR(MAX)` |
| Array | Native `ARRAY` | Not supported | Not supported | Not supported |
| CTE (recursive) | Full support | 8.0+ | 3.8.3+ | Full support |
| Window functions | Full support | 8.0+ | 3.25.0+ | Full support |
| Full-text search | `tsvector` + GIN | `FULLTEXT` index | FTS5 extension | Full-text catalog |
| LIMIT/OFFSET | `LIMIT n OFFSET m` | `LIMIT n OFFSET m` | `LIMIT n OFFSET m` | `OFFSET m ROWS FETCH NEXT n ROWS ONLY` |
### Compatibility Tips
- **Always use parameterized queries** — prevents SQL injection across all dialects
- **Avoid dialect-specific functions in shared code** — wrap in adapter layer
- **Test migrations on target engine** — `information_schema` varies between engines
- **Use ISO date format** — `'YYYY-MM-DD'` works everywhere
- **Quote identifiers** — use double quotes (SQL standard) or backticks (MySQL)
---
## ORM Patterns
### Prisma
**Schema definition**
```prisma
model User {
id Int @id @default(autoincrement())
email String @unique
name String?
posts Post[]
createdAt DateTime @default(now())
}
model Post {
id Int @id @default(autoincrement())
title String
author User @relation(fields: [authorId], references: [id])
authorId Int
}
```
**Migrations**: `npx prisma migrate dev --name add_user_email`
**Query API**: `prisma.user.findMany({ where: { email: { contains: '@' } }, include: { posts: true } })`
**Raw SQL escape hatch**: `prisma.$queryRaw\`SELECT * FROM users WHERE id = ${userId}\``
### Drizzle
**Schema-first definition**
```typescript
export const users = pgTable('users', {
id: serial('id').primaryKey(),
email: varchar('email', { length: 255 }).notNull().unique(),
name: text('name'),
createdAt: timestamp('created_at').defaultNow(),
});
```
**Query builder**: `db.select().from(users).where(eq(users.email, email))`
**Migrations**: `npx drizzle-kit generate:pg` then `npx drizzle-kit push:pg`
### TypeORM
**Entity decorators**
```typescript
@Entity()
export class User {
@PrimaryGeneratedColumn()
id: number;
@Column({ unique: true })
email: string;
@OneToMany(() => Post, post => post.author)
posts: Post[];
}
```
**Repository pattern**: `userRepo.find({ where: { email }, relations: ['posts'] })`
**Migrations**: `npx typeorm migration:generate -n AddUserEmail`
### SQLAlchemy
**Declarative models**
```python
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
email = Column(String(255), unique=True, nullable=False)
name = Column(String(255))
posts = relationship('Post', back_populates='author')
```
**Session management**: Always use `with Session() as session:` context manager
**Alembic migrations**: `alembic revision --autogenerate -m "add user email"`
> See references/orm_patterns.md for side-by-side comparisons and migration workflows per ORM.
---
## Data Integrity
### Constraint Strategy
- **Primary keys** — every table must have one; prefer surrogate keys (serial/UUID)
- **Foreign keys** — enforce referential integrity; define ON DELETE behavior explicitly
- **UNIQUE constraints** — for business-level uniqueness (email, slug, API key)
- **CHECK constraints** — validate ranges, enums, and business rules at the DB level
- **NOT NULL** — default to NOT NULL; make nullable only when genuinely optional
### Transaction Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use Case |
|-------|-----------|-------------------|-------------|----------|
| READ UNCOMMITTED | Yes | Yes | Yes | Never recommended |
| READ COMMITTED | No | Yes | Yes | Default for PostgreSQL, general OLTP |
| REPEATABLE READ | No | No | Yes (InnoDB: No) | Financial calculations |
| SERIALIZABLE | No | No | No | Critical consistency (billing, inventory) |
### Deadlock Prevention
1. **Consistent lock ordering** — always acquire locks in the same table/row order
2. **Short transactions** — minimize time between first lock and commit
3. **Advisory locks** — use `pg_advisory_lock()` for application-level coordination
4. **Retry logic** — catch deadlock errors and retry with exponential backoff
---
## Backup & Restore
### PostgreSQL
```bash
# Full backup
pg_dump -Fc --no-owner dbname > backup.dump
# Restore
pg_restore -d dbname --clean --no-owner backup.dump
# Point-in-time recovery: configure WAL archiving + restore_command
```
### MySQL
```bash
# Full backup
mysqldump --single-transaction --routines --triggers dbname > backup.sql
# Restore
mysql dbname < backup.sql
# Binary log for PITR: mysqlbinlog --start-datetime="2025-01-01 00:00:00" binlog.000001
```
### SQLite
```bash
# Backup (safe with concurrent reads)
sqlite3 dbname ".backup backup.db"
```
### Backup Best Practices
- **Automate** — cron or systemd timer, never manual-only
- **Test restores** — untested backups are not backups
- **Offsite copies** — S3, GCS, or separate region
- **Retention policy** — daily for 7 days, weekly for 4 weeks, monthly for 12 months
- **Monitor backup size and duration** — sudden changes signal issues
---
## Anti-Patterns
| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| `SELECT *` | Transfers unnecessary data, breaks on schema changes | Explicit column list |
| Missing indexes on FK columns | Slow JOINs and cascading deletes | Add indexes on all foreign keys |
| N+1 queries | 1 + N round trips to database | Eager loading or batch queries |
| Implicit type coercion | `WHERE id = '123'` prevents index use | Match types in predicates |
| No connection pooling | Exhausts connections under load | PgBouncer, ProxySQL, or ORM pool |
| Unbounded queries | No LIMIT risks returning millions of rows | Always paginate |
| Storing money as FLOAT | Rounding errors | Use `DECIMAL(19,4)` or integer cents |
| God tables | One table with 50+ columns | Normalize or use vertical partitioning |
| Soft deletes everywhere | Complicates every query with `WHERE deleted_at IS NULL` | Archive tables or event sourcing |
| Raw string concatenation | SQL injection | Parameterized queries always |
---
## Cross-References
| Skill | Relationship |
|-------|-------------|
| **database-designer** | Schema architecture, normalization analysis, ERD generation |
| **database-schema-designer** | Visual ERD modeling, relationship mapping |
| **migration-architect** | Complex multi-step migration orchestration |
| **api-design-reviewer** | Ensuring API endpoints align with query patterns |
| **observability-platform** | Query performance monitoring, slow query alerts |

View File

@@ -0,0 +1,330 @@
# Query Optimization Guide
How to read EXPLAIN plans, choose the right index types, understand query plan operators, and configure connection pooling.
---
## Reading EXPLAIN Plans
### PostgreSQL — EXPLAIN ANALYZE
```sql
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE status = 'paid' ORDER BY created_at DESC LIMIT 20;
```
**Sample output:**
```
Limit (cost=0.43..12.87 rows=20 width=128) (actual time=0.052..0.089 rows=20 loops=1)
-> Index Scan Backward using idx_orders_status_created on orders (cost=0.43..4521.33 rows=7284 width=128) (actual time=0.051..0.085 rows=20 loops=1)
Index Cond: (status = 'paid')
Buffers: shared hit=4
Planning Time: 0.156 ms
Execution Time: 0.112 ms
```
**Key fields to check:**
| Field | What it tells you |
|-------|-------------------|
| `cost` | Estimated startup..total cost (arbitrary units) |
| `rows` | Estimated row count at that node |
| `actual time` | Real wall-clock time in milliseconds |
| `actual rows` | Real row count — compare against estimate |
| `Buffers: shared hit` | Pages read from cache (good) |
| `Buffers: shared read` | Pages read from disk (slow) |
| `loops` | How many times the node executed |
**Red flags:**
- `Seq Scan` on a large table with a WHERE clause — missing index
- `actual rows` >> `rows` (estimated) — stale statistics, run `ANALYZE`
- `Nested Loop` with high loop count — consider hash join or add index
- `Sort` with `external merge` — not enough `work_mem`, spilling to disk
- `Buffers: shared read` much higher than `shared hit` — cold cache or table too large for memory
### MySQL — EXPLAIN FORMAT=JSON
```sql
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE status = 'paid' ORDER BY created_at DESC LIMIT 20;
```
**Key fields:**
- `query_block.select_id` — identifies subqueries
- `table.access_type``ALL` (full scan), `ref` (index lookup), `range`, `index`, `const`
- `table.rows_examined_per_scan` — how many rows the engine reads
- `table.using_index` — covering index (no table lookup needed)
- `table.attached_condition` — the WHERE filter applied
**Access types ranked (best to worst):**
`system` > `const` > `eq_ref` > `ref` > `range` > `index` > `ALL`
---
## Index Types
### B-tree (default)
The workhorse index. Supports equality, range, prefix, and ORDER BY operations.
**Best for:** `=`, `<`, `>`, `<=`, `>=`, `BETWEEN`, `LIKE 'prefix%'`, `ORDER BY`, `MIN()`, `MAX()`
```sql
CREATE INDEX idx_orders_created ON orders (created_at);
```
**Composite B-tree:** Column order matters. The index is useful for queries that filter on a leftmost prefix of the indexed columns.
```sql
-- This index serves: WHERE status = ... AND created_at > ...
-- Also serves: WHERE status = ...
-- Does NOT serve: WHERE created_at > ... (without status)
CREATE INDEX idx_orders_status_created ON orders (status, created_at);
```
### Hash
Equality-only lookups. Faster than B-tree for exact matches but no range support.
**Best for:** `=` lookups on high-cardinality columns
```sql
-- PostgreSQL
CREATE INDEX idx_sessions_token ON sessions USING hash (token);
```
**Limitations:** No range queries, no ORDER BY, not WAL-logged before PostgreSQL 10.
### GIN (Generalized Inverted Index)
For multi-valued data: arrays, JSONB, full-text search vectors.
```sql
-- JSONB containment
CREATE INDEX idx_products_tags ON products USING gin (tags);
-- Query: SELECT * FROM products WHERE tags @> '["sale"]';
-- Full-text search
CREATE INDEX idx_articles_search ON articles USING gin (to_tsvector('english', title || ' ' || body));
```
### GiST (Generalized Search Tree)
For geometric, range, and proximity data.
```sql
-- Range type (e.g., date ranges)
CREATE INDEX idx_bookings_period ON bookings USING gist (during);
-- Query: SELECT * FROM bookings WHERE during && '[2025-01-01, 2025-01-31]';
-- PostGIS geometry
CREATE INDEX idx_locations_geom ON locations USING gist (geom);
```
### BRIN (Block Range INdex)
Tiny index for naturally ordered data (e.g., time-series append-only tables).
```sql
CREATE INDEX idx_events_created ON events USING brin (created_at);
```
**Best for:** Large tables where the indexed column correlates with physical row order. Much smaller than B-tree but less precise.
### Partial Index
Index only rows matching a condition. Smaller and faster for targeted queries.
```sql
-- Only index active users (skip millions of inactive)
CREATE INDEX idx_users_active_email ON users (email) WHERE status = 'active';
```
### Covering Index (INCLUDE)
Store extra columns in the index to avoid table lookups (index-only scans).
```sql
-- PostgreSQL 11+
CREATE INDEX idx_orders_status ON orders (status) INCLUDE (total, created_at);
-- Query can be answered entirely from the index:
-- SELECT total, created_at FROM orders WHERE status = 'paid';
```
### Expression Index
Index the result of a function or expression.
```sql
CREATE INDEX idx_users_lower_email ON users (LOWER(email));
-- Query: SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
```
---
## Query Plan Operators
### Scan operators
| Operator | Description | Performance |
|----------|-------------|-------------|
| **Seq Scan** | Full table scan, reads every row | Slow on large tables |
| **Index Scan** | B-tree lookup + table fetch | Fast for selective queries |
| **Index Only Scan** | Reads only the index (covering) | Fastest for covered queries |
| **Bitmap Index Scan** | Builds a bitmap of matching pages | Good for medium selectivity |
| **Bitmap Heap Scan** | Fetches pages identified by bitmap | Pairs with bitmap index scan |
### Join operators
| Operator | Description | Best when |
|----------|-------------|-----------|
| **Nested Loop** | For each outer row, scan inner | Small outer set, indexed inner |
| **Hash Join** | Build hash table on inner, probe with outer | Medium-large sets, no index |
| **Merge Join** | Merge two sorted inputs | Both inputs already sorted |
### Other operators
| Operator | Description |
|----------|-------------|
| **Sort** | Sorts rows (may spill to disk if work_mem exceeded) |
| **Hash Aggregate** | GROUP BY using hash table |
| **Group Aggregate** | GROUP BY on pre-sorted input |
| **Limit** | Stops after N rows |
| **Materialize** | Caches subquery results in memory |
| **Gather / Gather Merge** | Collects results from parallel workers |
---
## Connection Pooling
### Why pool connections?
Each database connection consumes memory (5-10 MB in PostgreSQL). Without pooling:
- Application creates a new connection per request (slow: TCP + TLS + auth)
- Under load, connection count spikes past `max_connections`
- Database OOM or connection refused errors
### PgBouncer (PostgreSQL)
The standard external connection pooler for PostgreSQL.
**Modes:**
- **Session** — connection assigned for entire client session (safest, least efficient)
- **Transaction** — connection returned to pool after each transaction (recommended)
- **Statement** — connection returned after each statement (cannot use transactions)
```ini
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 200
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 300
```
**Sizing formula:**
```
default_pool_size = num_cpu_cores * 2 + effective_spindle_count
```
For SSDs, start with `num_cpu_cores * 2` (typically 4-16 connections is optimal).
### ProxySQL (MySQL)
```ini
mysql_servers = ({ address="127.0.0.1", port=3306, hostgroup=0, max_connections=100 })
mysql_query_rules = ({ rule_id=1, match_pattern="^SELECT.*FOR UPDATE", destination_hostgroup=0 })
```
### Application-Level Pooling
Most ORMs and drivers include built-in pooling:
| Platform | Pool Configuration |
|----------|--------------------|
| **node-postgres** | `new Pool({ max: 20, idleTimeoutMillis: 30000 })` |
| **SQLAlchemy** | `create_engine(url, pool_size=20, max_overflow=5)` |
| **HikariCP (Java)** | `maximumPoolSize=20, minimumIdle=5, idleTimeout=300000` |
| **Prisma** | `connection_limit=20` in connection string |
### Pool Sizing Guidelines
| Metric | Guideline |
|--------|-----------|
| **Minimum** | Number of always-active background workers |
| **Maximum** | 2-4x CPU cores for OLTP; lower for OLAP |
| **Idle timeout** | 30-300 seconds (reclaim unused connections) |
| **Connection timeout** | 3-10 seconds (fail fast under pressure) |
| **Queue size** | 2-5x pool max (buffer bursts before rejecting) |
**Warning:** More connections does not mean better performance. Beyond the optimal point (usually 20-50), contention on locks, CPU, and I/O causes throughput to decrease.
---
## Statistics and Maintenance
### PostgreSQL
```sql
-- Update statistics for the query planner
ANALYZE orders;
ANALYZE; -- All tables
-- Check table bloat and dead tuples
SELECT relname, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
-- Identify unused indexes
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexrelname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;
```
### MySQL
```sql
-- Update statistics
ANALYZE TABLE orders;
-- Check index usage
SELECT * FROM sys.schema_unused_indexes;
SELECT * FROM sys.schema_redundant_indexes;
-- Identify long-running queries
SELECT * FROM information_schema.processlist WHERE time > 10;
```
---
## Performance Checklist
Before deploying any query to production:
1. Run `EXPLAIN ANALYZE` and verify no unexpected sequential scans
2. Check that estimated rows are within 10x of actual rows
3. Verify index usage on all WHERE, JOIN, and ORDER BY columns
4. Ensure LIMIT is present for user-facing list queries
5. Confirm parameterized queries (no string concatenation)
6. Test with production-like data volume (not just 10 rows)
7. Monitor query time in application metrics after deployment
8. Set up slow query log alerting (> 100ms for OLTP, > 5s for reports)
---
## Quick Reference: When to Use Which Index
| Query Pattern | Index Type |
|--------------|-----------|
| `WHERE col = value` | B-tree or Hash |
| `WHERE col > value` | B-tree |
| `WHERE col LIKE 'prefix%'` | B-tree |
| `WHERE col LIKE '%substring%'` | GIN (full-text) or trigram |
| `WHERE jsonb_col @> '{...}'` | GIN |
| `WHERE array_col && ARRAY[...]` | GIN |
| `WHERE range_col && '[a,b]'` | GiST |
| `WHERE ST_DWithin(geom, ...)` | GiST |
| `WHERE col = value` (append-only) | BRIN |
| `WHERE col = value AND status = 'active'` | Partial B-tree |
| `SELECT a, b WHERE c = value` | Covering (INCLUDE) |

View File

@@ -0,0 +1,451 @@
# ORM Patterns Reference
Side-by-side comparison of Prisma, Drizzle, TypeORM, and SQLAlchemy patterns for common database operations.
---
## Schema Definition
### Prisma (schema.prisma)
```prisma
model User {
id Int @id @default(autoincrement())
email String @unique
name String?
role Role @default(USER)
posts Post[]
profile Profile?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([email])
@@map("users")
}
model Post {
id Int @id @default(autoincrement())
title String
body String?
published Boolean @default(false)
author User @relation(fields: [authorId], references: [id], onDelete: Cascade)
authorId Int
tags Tag[]
createdAt DateTime @default(now())
@@index([authorId])
@@index([published, createdAt])
@@map("posts")
}
enum Role {
USER
ADMIN
MODERATOR
}
```
### Drizzle (schema.ts)
```typescript
import { pgTable, serial, varchar, text, boolean, timestamp, integer, pgEnum } from 'drizzle-orm/pg-core';
export const roleEnum = pgEnum('role', ['USER', 'ADMIN', 'MODERATOR']);
export const users = pgTable('users', {
id: serial('id').primaryKey(),
email: varchar('email', { length: 255 }).notNull().unique(),
name: varchar('name', { length: 255 }),
role: roleEnum('role').default('USER').notNull(),
createdAt: timestamp('created_at').defaultNow().notNull(),
updatedAt: timestamp('updated_at').defaultNow().notNull(),
});
export const posts = pgTable('posts', {
id: serial('id').primaryKey(),
title: varchar('title', { length: 255 }).notNull(),
body: text('body'),
published: boolean('published').default(false).notNull(),
authorId: integer('author_id').notNull().references(() => users.id, { onDelete: 'cascade' }),
createdAt: timestamp('created_at').defaultNow().notNull(),
}, (table) => ({
authorIdx: index('idx_posts_author').on(table.authorId),
publishedIdx: index('idx_posts_published').on(table.published, table.createdAt),
}));
```
### TypeORM (entities)
```typescript
import { Entity, PrimaryGeneratedColumn, Column, ManyToOne, OneToMany, CreateDateColumn, UpdateDateColumn, Index } from 'typeorm';
export enum Role { USER = 'USER', ADMIN = 'ADMIN', MODERATOR = 'MODERATOR' }
@Entity('users')
export class User {
@PrimaryGeneratedColumn()
id: number;
@Column({ unique: true })
@Index()
email: string;
@Column({ nullable: true })
name: string;
@Column({ type: 'enum', enum: Role, default: Role.USER })
role: Role;
@OneToMany(() => Post, post => post.author)
posts: Post[];
@CreateDateColumn()
createdAt: Date;
@UpdateDateColumn()
updatedAt: Date;
}
@Entity('posts')
@Index(['published', 'createdAt'])
export class Post {
@PrimaryGeneratedColumn()
id: number;
@Column()
title: string;
@Column({ nullable: true, type: 'text' })
body: string;
@Column({ default: false })
published: boolean;
@ManyToOne(() => User, user => user.posts, { onDelete: 'CASCADE' })
author: User;
@Column()
authorId: number;
@CreateDateColumn()
createdAt: Date;
}
```
### SQLAlchemy (models.py)
```python
import enum
from datetime import datetime
from sqlalchemy import Column, Integer, String, Text, Boolean, DateTime, Enum, ForeignKey, Index
from sqlalchemy.orm import relationship, DeclarativeBase
class Base(DeclarativeBase):
pass
class Role(enum.Enum):
USER = "USER"
ADMIN = "ADMIN"
MODERATOR = "MODERATOR"
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True, autoincrement=True)
email = Column(String(255), unique=True, nullable=False, index=True)
name = Column(String(255), nullable=True)
role = Column(Enum(Role), default=Role.USER, nullable=False)
posts = relationship('Post', back_populates='author', cascade='all, delete-orphan')
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow, nullable=False)
class Post(Base):
__tablename__ = 'posts'
__table_args__ = (
Index('idx_posts_published', 'published', 'created_at'),
)
id = Column(Integer, primary_key=True, autoincrement=True)
title = Column(String(255), nullable=False)
body = Column(Text, nullable=True)
published = Column(Boolean, default=False, nullable=False)
author_id = Column(Integer, ForeignKey('users.id', ondelete='CASCADE'), nullable=False, index=True)
author = relationship('User', back_populates='posts')
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
```
---
## CRUD Operations
### Create
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.create({ data: { email, name } })` |
| **Drizzle** | `await db.insert(users).values({ email, name }).returning()` |
| **TypeORM** | `await userRepo.save(userRepo.create({ email, name }))` |
| **SQLAlchemy** | `session.add(User(email=email, name=name)); session.commit()` |
### Read (with filter)
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.findMany({ where: { role: 'ADMIN' }, orderBy: { createdAt: 'desc' } })` |
| **Drizzle** | `await db.select().from(users).where(eq(users.role, 'ADMIN')).orderBy(desc(users.createdAt))` |
| **TypeORM** | `await userRepo.find({ where: { role: Role.ADMIN }, order: { createdAt: 'DESC' } })` |
| **SQLAlchemy** | `session.query(User).filter(User.role == Role.ADMIN).order_by(User.created_at.desc()).all()` |
### Update
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.update({ where: { id }, data: { name } })` |
| **Drizzle** | `await db.update(users).set({ name }).where(eq(users.id, id))` |
| **TypeORM** | `await userRepo.update(id, { name })` |
| **SQLAlchemy** | `session.query(User).filter(User.id == id).update({User.name: name}); session.commit()` |
### Delete
| ORM | Pattern |
|-----|---------|
| **Prisma** | `await prisma.user.delete({ where: { id } })` |
| **Drizzle** | `await db.delete(users).where(eq(users.id, id))` |
| **TypeORM** | `await userRepo.delete(id)` |
| **SQLAlchemy** | `session.query(User).filter(User.id == id).delete(); session.commit()` |
---
## Relations and Eager Loading
### Prisma — include / select
```typescript
// Eager load posts with user
const user = await prisma.user.findUnique({
where: { id: 1 },
include: { posts: { where: { published: true }, orderBy: { createdAt: 'desc' } } },
});
// Nested create
await prisma.user.create({
data: {
email: 'new@example.com',
posts: { create: [{ title: 'First post' }] },
},
});
```
### Drizzle — relational queries
```typescript
const result = await db.query.users.findFirst({
where: eq(users.id, 1),
with: { posts: { where: eq(posts.published, true), orderBy: [desc(posts.createdAt)] } },
});
```
### TypeORM — relations / query builder
```typescript
// FindOptions
const user = await userRepo.findOne({ where: { id: 1 }, relations: ['posts'] });
// QueryBuilder for complex joins
const result = await userRepo.createQueryBuilder('u')
.leftJoinAndSelect('u.posts', 'p', 'p.published = :pub', { pub: true })
.where('u.id = :id', { id: 1 })
.getOne();
```
### SQLAlchemy — joinedload / selectinload
```python
from sqlalchemy.orm import joinedload, selectinload
# Eager load in one JOIN query
user = session.query(User).options(joinedload(User.posts)).filter(User.id == 1).first()
# Eager load in a separate IN query (better for collections)
users = session.query(User).options(selectinload(User.posts)).all()
```
---
## Raw SQL Escape Hatches
Every ORM should provide a way to execute raw SQL for complex queries:
| ORM | Pattern |
|-----|---------|
| **Prisma** | `` prisma.$queryRaw`SELECT * FROM users WHERE id = ${id}` `` |
| **Drizzle** | `db.execute(sql`SELECT * FROM users WHERE id = ${id}`)` |
| **TypeORM** | `dataSource.query('SELECT * FROM users WHERE id = $1', [id])` |
| **SQLAlchemy** | `session.execute(text('SELECT * FROM users WHERE id = :id'), {'id': id})` |
Always use parameterized queries in raw SQL to prevent injection.
---
## Transaction Patterns
### Prisma
```typescript
await prisma.$transaction(async (tx) => {
const user = await tx.user.create({ data: { email } });
await tx.post.create({ data: { title: 'Welcome', authorId: user.id } });
});
```
### Drizzle
```typescript
await db.transaction(async (tx) => {
const [user] = await tx.insert(users).values({ email }).returning();
await tx.insert(posts).values({ title: 'Welcome', authorId: user.id });
});
```
### TypeORM
```typescript
await dataSource.transaction(async (manager) => {
const user = await manager.save(User, { email });
await manager.save(Post, { title: 'Welcome', authorId: user.id });
});
```
### SQLAlchemy
```python
with Session() as session:
try:
user = User(email=email)
session.add(user)
session.flush() # Get user.id without committing
session.add(Post(title='Welcome', author_id=user.id))
session.commit()
except Exception:
session.rollback()
raise
```
---
## Migration Workflows
### Prisma
```bash
# Generate migration from schema changes
npx prisma migrate dev --name add_posts_table
# Apply in production
npx prisma migrate deploy
# Reset database (dev only)
npx prisma migrate reset
# Generate client after schema change
npx prisma generate
```
**Files:** `prisma/migrations/<timestamp>_<name>/migration.sql`
### Drizzle
```bash
# Generate migration SQL from schema diff
npx drizzle-kit generate:pg
# Push schema directly (dev only, no migration files)
npx drizzle-kit push:pg
# Apply migrations
npx drizzle-kit migrate
```
**Files:** `drizzle/<timestamp>_<name>.sql`
### TypeORM
```bash
# Auto-generate migration from entity changes
npx typeorm migration:generate -d data-source.ts -n AddPostsTable
# Create empty migration
npx typeorm migration:create -n CustomMigration
# Run pending migrations
npx typeorm migration:run -d data-source.ts
# Revert last migration
npx typeorm migration:revert -d data-source.ts
```
**Files:** `src/migrations/<timestamp>-<Name>.ts`
### SQLAlchemy (Alembic)
```bash
# Initialize Alembic
alembic init alembic
# Auto-generate migration from model changes
alembic revision --autogenerate -m "add posts table"
# Apply all pending
alembic upgrade head
# Revert one step
alembic downgrade -1
# Show current state
alembic current
```
**Files:** `alembic/versions/<hash>_<slug>.py`
---
## N+1 Prevention Cheat Sheet
| ORM | Lazy (N+1 risk) | Eager (fixed) |
|-----|-----------------|---------------|
| **Prisma** | Not accessing `include` | `include: { posts: true }` |
| **Drizzle** | Separate queries | `with: { posts: true }` |
| **TypeORM** | `@ManyToOne(() => ..., { lazy: true })` | `relations: ['posts']` or `leftJoinAndSelect` |
| **SQLAlchemy** | Default `lazy='select'` | `joinedload()` or `selectinload()` |
**Rule of thumb:** If you access a relation inside a loop, you have an N+1 problem. Always load relations before the loop.
---
## Connection Pooling
### Prisma
```
# In .env or connection string
DATABASE_URL="postgresql://user:pass@host/db?connection_limit=20&pool_timeout=10"
```
### Drizzle (with node-postgres)
```typescript
import { Pool } from 'pg';
const pool = new Pool({ max: 20, idleTimeoutMillis: 30000, connectionTimeoutMillis: 5000 });
const db = drizzle(pool);
```
### TypeORM
```typescript
const dataSource = new DataSource({
type: 'postgres',
extra: { max: 20, idleTimeoutMillis: 30000 },
});
```
### SQLAlchemy
```python
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/db', pool_size=20, max_overflow=5, pool_timeout=30)
```
---
## Best Practices Summary
1. **Always use migrations** — never modify production schemas by hand
2. **Eager load relations** — prevent N+1 in every list/collection query
3. **Use transactions** — group related writes to maintain consistency
4. **Parameterize raw SQL** — never concatenate user input into queries
5. **Connection pooling** — configure pool size matching your workload
6. **Index foreign keys** — ORMs often skip this; add manually if needed
7. **Review generated SQL** — enable query logging in development to catch inefficiencies
8. **Type-safe queries** — leverage TypeScript/Python typing for compile-time checks
9. **Separate read/write models** — use views or read replicas for heavy reporting queries
10. **Test migrations both ways** — always verify that down migrations actually reverse up migrations

View File

@@ -0,0 +1,406 @@
# SQL Query Patterns Reference
Common query patterns for everyday database operations. All examples use PostgreSQL syntax with dialect notes where they differ.
---
## JOIN Patterns
### INNER JOIN — matching rows in both tables
```sql
SELECT u.name, o.id AS order_id, o.total
FROM users u
INNER JOIN orders o ON o.user_id = u.id
WHERE o.status = 'paid';
```
### LEFT JOIN — all rows from left, matching from right
```sql
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name;
```
Returns users even if they have zero orders.
### Self JOIN — comparing rows within the same table
```sql
-- Find employees who earn more than their manager
SELECT e.name AS employee, m.name AS manager, e.salary, m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
```
### CROSS JOIN — every combination (cartesian product)
```sql
-- Generate a calendar grid
SELECT d.date, s.shift_name
FROM dates d
CROSS JOIN shifts s;
```
Use intentionally. Accidental cartesian joins are a performance killer.
### LATERAL JOIN (PostgreSQL) — correlated subquery as a table
```sql
-- Top 3 orders per user
SELECT u.name, top_orders.*
FROM users u
CROSS JOIN LATERAL (
SELECT id, total FROM orders
WHERE user_id = u.id
ORDER BY total DESC LIMIT 3
) top_orders;
```
MySQL equivalent: use a subquery with `ROW_NUMBER()`.
---
## Common Table Expressions (CTEs)
### Basic CTE — readable subquery
```sql
WITH active_users AS (
SELECT id, name, email
FROM users
WHERE last_login > CURRENT_DATE - INTERVAL '30 days'
)
SELECT au.name, COUNT(o.id) AS recent_orders
FROM active_users au
JOIN orders o ON o.user_id = au.id
GROUP BY au.name;
```
### Multiple CTEs — chaining transformations
```sql
WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', created_at) AS month, SUM(total) AS revenue
FROM orders WHERE status = 'paid'
GROUP BY 1
),
growth AS (
SELECT month, revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) / LAG(revenue) OVER (ORDER BY month) * 100, 1) AS growth_pct
FROM monthly_revenue
)
SELECT * FROM growth ORDER BY month;
```
### Recursive CTE — hierarchical data
```sql
-- Organization tree
WITH RECURSIVE org_tree AS (
-- Base case: top-level managers
SELECT id, name, manager_id, 0 AS depth
FROM employees WHERE manager_id IS NULL
UNION ALL
-- Recursive case: subordinates
SELECT e.id, e.name, e.manager_id, ot.depth + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth, name;
```
### Recursive CTE — path traversal
```sql
-- Category breadcrumb
WITH RECURSIVE breadcrumb AS (
SELECT id, name, parent_id, name::TEXT AS path
FROM categories WHERE id = 42
UNION ALL
SELECT c.id, c.name, c.parent_id, c.name || ' > ' || b.path
FROM categories c
JOIN breadcrumb b ON c.id = b.parent_id
)
SELECT path FROM breadcrumb WHERE parent_id IS NULL;
```
---
## Window Functions
### ROW_NUMBER — assign unique rank per partition
```sql
SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;
```
### RANK and DENSE_RANK — handle ties
```sql
-- RANK: 1, 2, 2, 4 (skips after tie)
-- DENSE_RANK: 1, 2, 2, 3 (no skip)
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;
```
### Running total and moving average
```sql
SELECT date, amount,
SUM(amount) OVER (ORDER BY date) AS running_total,
AVG(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7d
FROM daily_revenue;
```
### LAG / LEAD — access adjacent rows
```sql
SELECT date, revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_day,
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change
FROM daily_revenue;
```
### NTILE — divide into buckets
```sql
-- Split customers into quartiles by total spend
SELECT customer_id, total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_summary;
```
### FIRST_VALUE / LAST_VALUE
```sql
SELECT department_id, name, salary,
FIRST_VALUE(name) OVER (PARTITION BY department_id ORDER BY salary DESC) AS highest_paid
FROM employees;
```
---
## Subquery Patterns
### EXISTS — correlated existence check
```sql
-- Users who have placed at least one order
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
```
### NOT EXISTS — safer than NOT IN for NULLs
```sql
-- Users who have never ordered
SELECT u.* FROM users u
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
```
### Scalar subquery — single value
```sql
SELECT name, salary,
salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;
```
### Derived table — subquery in FROM
```sql
SELECT dept, avg_salary
FROM (
SELECT department_id AS dept, AVG(salary) AS avg_salary
FROM employees GROUP BY department_id
) dept_avg
WHERE avg_salary > 100000;
```
---
## Aggregation Patterns
### GROUP BY with HAVING
```sql
-- Departments with more than 10 employees
SELECT department_id, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 10;
```
### GROUPING SETS — multiple grouping levels
```sql
SELECT region, product_category, SUM(revenue)
FROM sales
GROUP BY GROUPING SETS (
(region, product_category),
(region),
(product_category),
()
);
```
### ROLLUP — hierarchical subtotals
```sql
SELECT region, city, SUM(revenue)
FROM sales
GROUP BY ROLLUP (region, city);
-- Produces: (region, city), (region), ()
```
### CUBE — all combinations
```sql
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY CUBE (region, product);
```
### FILTER clause (PostgreSQL) — conditional aggregation
```sql
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = 'paid') AS paid,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled,
SUM(total) FILTER (WHERE status = 'paid') AS paid_revenue
FROM orders;
```
MySQL/SQL Server equivalent: `SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END)`.
---
## UPSERT Patterns
### PostgreSQL — ON CONFLICT
```sql
INSERT INTO user_settings (user_id, key, value, updated_at)
VALUES (1, 'theme', 'dark', NOW())
ON CONFLICT (user_id, key)
DO UPDATE SET value = EXCLUDED.value, updated_at = EXCLUDED.updated_at;
```
### MySQL — ON DUPLICATE KEY
```sql
INSERT INTO user_settings (user_id, key_name, value, updated_at)
VALUES (1, 'theme', 'dark', NOW())
ON DUPLICATE KEY UPDATE value = VALUES(value), updated_at = VALUES(updated_at);
```
### SQL Server — MERGE
```sql
MERGE INTO user_settings AS target
USING (VALUES (1, 'theme', 'dark')) AS source (user_id, key_name, value)
ON target.user_id = source.user_id AND target.key_name = source.key_name
WHEN MATCHED THEN UPDATE SET value = source.value, updated_at = GETDATE()
WHEN NOT MATCHED THEN INSERT (user_id, key_name, value, updated_at)
VALUES (source.user_id, source.key_name, source.value, GETDATE());
```
---
## JSON Operations
### PostgreSQL JSONB
```sql
-- Extract field
SELECT data->>'name' AS name FROM products WHERE data->>'category' = 'electronics';
-- Array contains
SELECT * FROM products WHERE data->'tags' ? 'sale';
-- Update nested field
UPDATE products SET data = jsonb_set(data, '{price}', '29.99') WHERE id = 1;
-- Aggregate into JSON array
SELECT jsonb_agg(jsonb_build_object('id', id, 'name', name)) FROM users;
```
### MySQL JSON
```sql
-- Extract field
SELECT JSON_EXTRACT(data, '$.name') AS name FROM products;
-- Shorthand: SELECT data->>"$.name"
-- Search in array
SELECT * FROM products WHERE JSON_CONTAINS(data->"$.tags", '"sale"');
-- Update
UPDATE products SET data = JSON_SET(data, '$.price', 29.99) WHERE id = 1;
```
---
## Pagination Patterns
### Offset pagination (simple but slow for deep pages)
```sql
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 40;
```
### Keyset pagination (fast, requires ordered unique column)
```sql
-- Page after the last seen id
SELECT * FROM products WHERE id > :last_seen_id ORDER BY id LIMIT 20;
```
### Keyset with composite sort
```sql
SELECT * FROM products
WHERE (created_at, id) < (:last_created_at, :last_id)
ORDER BY created_at DESC, id DESC
LIMIT 20;
```
---
## Bulk Operations
### Batch INSERT
```sql
INSERT INTO events (type, payload, created_at) VALUES
('click', '{"page": "/home"}', NOW()),
('view', '{"page": "/pricing"}', NOW()),
('click', '{"page": "/signup"}', NOW());
```
### Batch UPDATE with VALUES
```sql
UPDATE products AS p SET price = v.price
FROM (VALUES (1, 29.99), (2, 49.99), (3, 9.99)) AS v(id, price)
WHERE p.id = v.id;
```
### DELETE with subquery
```sql
DELETE FROM sessions
WHERE user_id IN (SELECT id FROM users WHERE deleted_at IS NOT NULL);
```
### COPY (PostgreSQL bulk load)
```sql
COPY products (name, price, category) FROM '/path/to/data.csv' WITH (FORMAT csv, HEADER true);
```
---
## Utility Patterns
### Generate series (PostgreSQL)
```sql
-- Fill date gaps
SELECT d::date FROM generate_series('2025-01-01'::date, '2025-12-31', '1 day') d;
```
### Deduplicate rows
```sql
DELETE FROM events a USING events b
WHERE a.id > b.id AND a.user_id = b.user_id AND a.event_type = b.event_type
AND a.created_at = b.created_at;
```
### Pivot (manual)
```sql
SELECT user_id,
SUM(CASE WHEN month = 1 THEN revenue END) AS jan,
SUM(CASE WHEN month = 2 THEN revenue END) AS feb,
SUM(CASE WHEN month = 3 THEN revenue END) AS mar
FROM monthly_revenue
GROUP BY user_id;
```
### Conditional INSERT (skip if exists)
```sql
INSERT INTO tags (name) SELECT 'new-tag'
WHERE NOT EXISTS (SELECT 1 FROM tags WHERE name = 'new-tag');
```

View File

@@ -0,0 +1,442 @@
#!/usr/bin/env python3
"""
Migration Generator
Generates database migration file templates (up/down) from natural-language
schema change descriptions.
Supported operations:
- Add column, drop column, rename column
- Add table, drop table, rename table
- Add index, drop index
- Add constraint, drop constraint
- Change column type
Usage:
python migration_generator.py --change "add email_verified boolean to users" --dialect postgres
python migration_generator.py --change "rename column name to full_name in customers" --format alembic
python migration_generator.py --change "add index on orders(status, created_at)" --output 001_add_index.sql
python migration_generator.py --change "create table reviews with id, user_id, rating, body" --json
"""
import argparse
import json
import os
import re
import sys
import textwrap
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import List, Optional, Tuple
@dataclass
class Migration:
"""A generated migration with up and down scripts."""
description: str
dialect: str
format: str
up: str
down: str
warnings: List[str]
def to_dict(self):
return asdict(self)
# ---------------------------------------------------------------------------
# Change parsers — extract structured intent from natural language
# ---------------------------------------------------------------------------
def parse_add_column(desc: str) -> Optional[dict]:
"""Parse: add <column> <type> to <table>"""
m = re.match(
r'add\s+(?:column\s+)?(\w+)\s+(\w[\w(),.]*)\s+(?:to|on)\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "add_column", "column": m.group(1), "type": m.group(2), "table": m.group(3)}
return None
def parse_drop_column(desc: str) -> Optional[dict]:
"""Parse: drop/remove <column> from <table>"""
m = re.match(
r'(?:drop|remove)\s+(?:column\s+)?(\w+)\s+from\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "drop_column", "column": m.group(1), "table": m.group(2)}
return None
def parse_rename_column(desc: str) -> Optional[dict]:
"""Parse: rename column <old> to <new> in <table>"""
m = re.match(
r'rename\s+column\s+(\w+)\s+to\s+(\w+)\s+in\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "rename_column", "old": m.group(1), "new": m.group(2), "table": m.group(3)}
return None
def parse_add_table(desc: str) -> Optional[dict]:
"""Parse: create table <name> with <col1>, <col2>, ..."""
m = re.match(
r'create\s+table\s+(\w+)\s+with\s+(.+)',
desc, re.IGNORECASE,
)
if m:
cols = [c.strip() for c in m.group(2).split(",")]
return {"op": "add_table", "table": m.group(1), "columns": cols}
return None
def parse_drop_table(desc: str) -> Optional[dict]:
"""Parse: drop table <name>"""
m = re.match(r'drop\s+table\s+(\w+)', desc, re.IGNORECASE)
if m:
return {"op": "drop_table", "table": m.group(1)}
return None
def parse_add_index(desc: str) -> Optional[dict]:
"""Parse: add index on <table>(<col1>, <col2>)"""
m = re.match(
r'add\s+(?:unique\s+)?index\s+(?:on\s+)?(\w+)\s*\(([^)]+)\)',
desc, re.IGNORECASE,
)
if m:
unique = "unique" in desc.lower()
cols = [c.strip() for c in m.group(2).split(",")]
return {"op": "add_index", "table": m.group(1), "columns": cols, "unique": unique}
return None
def parse_change_type(desc: str) -> Optional[dict]:
"""Parse: change <column> type to <type> in <table>"""
m = re.match(
r'change\s+(?:column\s+)?(\w+)\s+type\s+to\s+(\w[\w(),.]*)\s+in\s+(\w+)',
desc, re.IGNORECASE,
)
if m:
return {"op": "change_type", "column": m.group(1), "new_type": m.group(2), "table": m.group(3)}
return None
PARSERS = [
parse_add_column,
parse_drop_column,
parse_rename_column,
parse_add_table,
parse_drop_table,
parse_add_index,
parse_change_type,
]
def parse_change(desc: str) -> Optional[dict]:
for parser in PARSERS:
result = parser(desc)
if result:
return result
return None
# ---------------------------------------------------------------------------
# SQL generators per dialect
# ---------------------------------------------------------------------------
TYPE_MAP = {
"boolean": {"postgres": "BOOLEAN", "mysql": "TINYINT(1)", "sqlite": "INTEGER", "sqlserver": "BIT"},
"text": {"postgres": "TEXT", "mysql": "TEXT", "sqlite": "TEXT", "sqlserver": "NVARCHAR(MAX)"},
"integer": {"postgres": "INTEGER", "mysql": "INT", "sqlite": "INTEGER", "sqlserver": "INT"},
"int": {"postgres": "INTEGER", "mysql": "INT", "sqlite": "INTEGER", "sqlserver": "INT"},
"serial": {"postgres": "SERIAL", "mysql": "INT AUTO_INCREMENT", "sqlite": "INTEGER", "sqlserver": "INT IDENTITY(1,1)"},
"varchar": {"postgres": "VARCHAR(255)", "mysql": "VARCHAR(255)", "sqlite": "TEXT", "sqlserver": "NVARCHAR(255)"},
"timestamp": {"postgres": "TIMESTAMP", "mysql": "DATETIME", "sqlite": "TEXT", "sqlserver": "DATETIME2"},
"uuid": {"postgres": "UUID", "mysql": "CHAR(36)", "sqlite": "TEXT", "sqlserver": "UNIQUEIDENTIFIER"},
"json": {"postgres": "JSONB", "mysql": "JSON", "sqlite": "TEXT", "sqlserver": "NVARCHAR(MAX)"},
"decimal": {"postgres": "DECIMAL(19,4)", "mysql": "DECIMAL(19,4)", "sqlite": "REAL", "sqlserver": "DECIMAL(19,4)"},
"float": {"postgres": "DOUBLE PRECISION", "mysql": "DOUBLE", "sqlite": "REAL", "sqlserver": "FLOAT"},
}
def map_type(type_name: str, dialect: str) -> str:
"""Map a generic type name to a dialect-specific type."""
key = type_name.lower().rstrip("()")
if key in TYPE_MAP and dialect in TYPE_MAP[key]:
return TYPE_MAP[key][dialect]
return type_name.upper()
def gen_add_column(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
col_type = map_type(change["type"], dialect)
table = change["table"]
col = change["column"]
up = f"ALTER TABLE {table} ADD COLUMN {col} {col_type};"
down = f"ALTER TABLE {table} DROP COLUMN {col};"
return up, down, []
def gen_drop_column(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
col = change["column"]
up = f"ALTER TABLE {table} DROP COLUMN {col};"
down = f"-- WARNING: Cannot fully reverse DROP COLUMN. Provide the original type.\nALTER TABLE {table} ADD COLUMN {col} TEXT;"
return up, down, ["Down migration uses TEXT as placeholder. Replace with the original column type."]
def gen_rename_column(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
old, new = change["old"], change["new"]
warnings = []
if dialect == "postgres":
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
elif dialect == "mysql":
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
elif dialect == "sqlite":
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
warnings.append("SQLite RENAME COLUMN requires version 3.25.0+.")
elif dialect == "sqlserver":
up = f"EXEC sp_rename '{table}.{old}', '{new}', 'COLUMN';"
down = f"EXEC sp_rename '{table}.{new}', '{old}', 'COLUMN';"
else:
up = f"ALTER TABLE {table} RENAME COLUMN {old} TO {new};"
down = f"ALTER TABLE {table} RENAME COLUMN {new} TO {old};"
return up, down, warnings
def gen_add_table(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
cols = change["columns"]
col_defs = []
has_id = False
for col in cols:
col = col.strip()
if col.lower() == "id":
has_id = True
if dialect == "postgres":
col_defs.append(" id SERIAL PRIMARY KEY")
elif dialect == "mysql":
col_defs.append(" id INT AUTO_INCREMENT PRIMARY KEY")
elif dialect == "sqlite":
col_defs.append(" id INTEGER PRIMARY KEY AUTOINCREMENT")
elif dialect == "sqlserver":
col_defs.append(" id INT IDENTITY(1,1) PRIMARY KEY")
else:
# Check if type is specified (e.g., "rating int")
parts = col.split()
if len(parts) >= 2:
col_defs.append(f" {parts[0]} {map_type(parts[1], dialect)}")
else:
col_defs.append(f" {col} TEXT")
cols_sql = ",\n".join(col_defs)
up = f"CREATE TABLE {table} (\n{cols_sql}\n);"
down = f"DROP TABLE {table};"
warnings = []
if not has_id:
warnings.append("Table has no explicit primary key. Consider adding an 'id' column.")
return up, down, warnings
def gen_drop_table(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
up = f"DROP TABLE {table};"
down = f"-- WARNING: Cannot reverse DROP TABLE without original DDL.\nCREATE TABLE {table} (id INTEGER PRIMARY KEY);"
return up, down, ["Down migration is a placeholder. Replace with the original CREATE TABLE statement."]
def gen_add_index(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
cols = change["columns"]
unique = "UNIQUE " if change.get("unique") else ""
idx_name = f"idx_{table}_{'_'.join(cols)}"
if dialect == "postgres":
up = f"CREATE {unique}INDEX CONCURRENTLY {idx_name} ON {table} ({', '.join(cols)});"
else:
up = f"CREATE {unique}INDEX {idx_name} ON {table} ({', '.join(cols)});"
down = f"DROP INDEX {idx_name};" if dialect != "mysql" else f"DROP INDEX {idx_name} ON {table};"
warnings = []
if dialect == "postgres":
warnings.append("CONCURRENTLY cannot run inside a transaction. Run outside migration transaction.")
return up, down, warnings
def gen_change_type(change: dict, dialect: str) -> Tuple[str, str, List[str]]:
table = change["table"]
col = change["column"]
new_type = map_type(change["new_type"], dialect)
warnings = ["Down migration uses TEXT as placeholder. Replace with the original column type."]
if dialect == "postgres":
up = f"ALTER TABLE {table} ALTER COLUMN {col} TYPE {new_type};"
down = f"ALTER TABLE {table} ALTER COLUMN {col} TYPE TEXT;"
elif dialect == "mysql":
up = f"ALTER TABLE {table} MODIFY COLUMN {col} {new_type};"
down = f"ALTER TABLE {table} MODIFY COLUMN {col} TEXT;"
elif dialect == "sqlserver":
up = f"ALTER TABLE {table} ALTER COLUMN {col} {new_type};"
down = f"ALTER TABLE {table} ALTER COLUMN {col} NVARCHAR(MAX);"
else:
up = f"-- SQLite does not support ALTER COLUMN. Recreate the table."
down = f"-- SQLite does not support ALTER COLUMN. Recreate the table."
warnings.append("SQLite requires table recreation for type changes.")
return up, down, warnings
GENERATORS = {
"add_column": gen_add_column,
"drop_column": gen_drop_column,
"rename_column": gen_rename_column,
"add_table": gen_add_table,
"drop_table": gen_drop_table,
"add_index": gen_add_index,
"change_type": gen_change_type,
}
# ---------------------------------------------------------------------------
# Format wrappers
# ---------------------------------------------------------------------------
def wrap_sql(up: str, down: str, description: str) -> Tuple[str, str]:
"""Wrap as plain SQL migration files."""
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
header = f"-- Migration: {description}\n-- Generated: {datetime.now().isoformat()}\n\n"
return header + "-- Up\n" + up, header + "-- Down\n" + down
def wrap_prisma(up: str, down: str, description: str) -> Tuple[str, str]:
"""Format as Prisma migration SQL (Prisma uses raw SQL in migration.sql)."""
header = f"-- Migration: {description}\n-- Format: Prisma (migration.sql)\n\n"
return header + up, header + "-- Rollback\n" + down
def wrap_alembic(up: str, down: str, description: str) -> Tuple[str, str]:
"""Format as Alembic Python migration."""
slug = re.sub(r'\W+', '_', description.lower())[:40]
revision = datetime.now().strftime("%Y%m%d%H%M")
template = textwrap.dedent(f'''\
"""
{description}
Revision ID: {revision}
"""
from alembic import op
import sqlalchemy as sa
revision = '{revision}'
down_revision = None # Set to previous revision
def upgrade():
op.execute("""
{textwrap.indent(up, " ")}
""")
def downgrade():
op.execute("""
{textwrap.indent(down, " ")}
""")
''')
return template, ""
FORMATTERS = {
"sql": wrap_sql,
"prisma": wrap_prisma,
"alembic": wrap_alembic,
}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Generate database migration templates from change descriptions.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Supported change descriptions:
"add email_verified boolean to users"
"drop column legacy_flag from accounts"
"rename column name to full_name in customers"
"create table reviews with id, user_id, rating int, body text"
"drop table temp_imports"
"add index on orders(status, created_at)"
"add unique index on users(email)"
"change email type to varchar in users"
Examples:
%(prog)s --change "add phone varchar to users" --dialect postgres
%(prog)s --change "create table reviews with id, user_id, rating int, body" --format prisma
%(prog)s --change "add index on orders(status)" --output migrations/001.sql --json
""",
)
parser.add_argument("--change", required=True, help="Natural-language description of the schema change")
parser.add_argument("--dialect", choices=["postgres", "mysql", "sqlite", "sqlserver"],
default="postgres", help="Target database dialect (default: postgres)")
parser.add_argument("--format", choices=["sql", "prisma", "alembic"], default="sql",
dest="fmt", help="Output format (default: sql)")
parser.add_argument("--output", help="Write migration to file instead of stdout")
parser.add_argument("--json", action="store_true", dest="json_output", help="Output as JSON")
args = parser.parse_args()
change = parse_change(args.change)
if not change:
print(f"Error: Could not parse change description: '{args.change}'", file=sys.stderr)
print("Run with --help to see supported patterns.", file=sys.stderr)
sys.exit(1)
gen_fn = GENERATORS.get(change["op"])
if not gen_fn:
print(f"Error: No generator for operation '{change['op']}'", file=sys.stderr)
sys.exit(1)
up, down, warnings = gen_fn(change, args.dialect)
fmt_fn = FORMATTERS[args.fmt]
up_formatted, down_formatted = fmt_fn(up, down, args.change)
migration = Migration(
description=args.change,
dialect=args.dialect,
format=args.fmt,
up=up_formatted,
down=down_formatted,
warnings=warnings,
)
if args.json_output:
print(json.dumps(migration.to_dict(), indent=2))
else:
if args.output:
with open(args.output, "w") as f:
f.write(migration.up)
print(f"Migration written to {args.output}")
if migration.down:
down_path = args.output.replace(".sql", "_down.sql")
with open(down_path, "w") as f:
f.write(migration.down)
print(f"Rollback written to {down_path}")
else:
print(migration.up)
if migration.down:
print("\n" + "=" * 40 + " ROLLBACK " + "=" * 40 + "\n")
print(migration.down)
if warnings:
print("\nWarnings:")
for w in warnings:
print(f" - {w}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,348 @@
#!/usr/bin/env python3
"""
SQL Query Optimizer — Static Analysis
Analyzes SQL queries for common performance issues:
- SELECT * usage
- Missing WHERE clauses on UPDATE/DELETE
- Cartesian joins (missing JOIN conditions)
- Subqueries in SELECT list
- Missing LIMIT on unbounded SELECTs
- Function calls on indexed columns (non-sargable)
- LIKE with leading wildcard
- ORDER BY RAND()
- UNION instead of UNION ALL
- NOT IN with subquery (NULL-unsafe)
Usage:
python query_optimizer.py --query "SELECT * FROM users"
python query_optimizer.py --query queries.sql --dialect postgres
python query_optimizer.py --query "SELECT * FROM orders" --json
"""
import argparse
import json
import os
import re
import sys
from dataclasses import dataclass, asdict
from typing import List, Optional
@dataclass
class Issue:
"""A single optimization issue found in a query."""
severity: str # critical, warning, info
rule: str
message: str
suggestion: str
line: Optional[int] = None
@dataclass
class QueryAnalysis:
"""Analysis result for one SQL query."""
query: str
issues: List[Issue]
score: int # 0-100, higher is better
def to_dict(self):
return {
"query": self.query[:200] + ("..." if len(self.query) > 200 else ""),
"issues": [asdict(i) for i in self.issues],
"issue_count": len(self.issues),
"score": self.score,
}
# ---------------------------------------------------------------------------
# Rule checkers
# ---------------------------------------------------------------------------
def check_select_star(sql: str) -> Optional[Issue]:
"""Detect SELECT * usage."""
if re.search(r'\bSELECT\s+\*\s', sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="select-star",
message="SELECT * transfers unnecessary data and breaks on schema changes.",
suggestion="List only the columns you need: SELECT col1, col2, ...",
)
return None
def check_missing_where(sql: str) -> Optional[Issue]:
"""Detect UPDATE/DELETE without WHERE."""
upper = sql.upper().strip()
for keyword in ("UPDATE", "DELETE"):
if upper.startswith(keyword) and "WHERE" not in upper:
return Issue(
severity="critical",
rule="missing-where",
message=f"{keyword} without WHERE affects every row in the table.",
suggestion=f"Add a WHERE clause to restrict the {keyword} scope.",
)
return None
def check_cartesian_join(sql: str) -> Optional[Issue]:
"""Detect comma-separated tables without explicit JOIN or WHERE join condition."""
upper = sql.upper()
if "SELECT" not in upper:
return None
from_match = re.search(r'\bFROM\s+(.+?)(?:\bWHERE\b|\bGROUP\b|\bORDER\b|\bLIMIT\b|\bHAVING\b|;|$)',
sql, re.IGNORECASE | re.DOTALL)
if not from_match:
return None
from_clause = from_match.group(1)
# Skip if explicit JOINs are used
if re.search(r'\bJOIN\b', from_clause, re.IGNORECASE):
return None
# Count comma-separated tables
tables = [t.strip() for t in from_clause.split(",") if t.strip()]
if len(tables) > 1 and "WHERE" not in upper:
return Issue(
severity="critical",
rule="cartesian-join",
message="Multiple tables in FROM without JOIN or WHERE creates a cartesian product.",
suggestion="Use explicit JOIN syntax with ON conditions.",
)
return None
def check_subquery_in_select(sql: str) -> Optional[Issue]:
"""Detect correlated subqueries in SELECT list."""
select_match = re.search(r'\bSELECT\b(.+?)\bFROM\b', sql, re.IGNORECASE | re.DOTALL)
if select_match:
select_clause = select_match.group(1)
if re.search(r'\(\s*SELECT\b', select_clause, re.IGNORECASE):
return Issue(
severity="warning",
rule="subquery-in-select",
message="Subquery in SELECT list executes once per row (correlated subquery).",
suggestion="Rewrite as a LEFT JOIN with aggregation.",
)
return None
def check_missing_limit(sql: str) -> Optional[Issue]:
"""Detect unbounded SELECT without LIMIT."""
upper = sql.upper().strip()
if not upper.startswith("SELECT"):
return None
# Skip if it's a subquery or aggregate-only
if re.search(r'\bCOUNT\s*\(', upper) and "GROUP BY" not in upper:
return None
if "LIMIT" not in upper and "FETCH" not in upper and "TOP " not in upper:
return Issue(
severity="info",
rule="missing-limit",
message="SELECT without LIMIT may return unbounded rows.",
suggestion="Add LIMIT to prevent returning excessive data.",
)
return None
def check_function_on_column(sql: str) -> Optional[Issue]:
"""Detect function calls on columns in WHERE (non-sargable)."""
where_match = re.search(r'\bWHERE\b(.+?)(?:\bGROUP\b|\bORDER\b|\bLIMIT\b|\bHAVING\b|;|$)',
sql, re.IGNORECASE | re.DOTALL)
if not where_match:
return None
where_clause = where_match.group(1)
non_sargable = re.search(
r'\b(YEAR|MONTH|DAY|DATE|UPPER|LOWER|TRIM|CAST|COALESCE|IFNULL|NVL)\s*\(',
where_clause, re.IGNORECASE
)
if non_sargable:
func = non_sargable.group(1).upper()
return Issue(
severity="warning",
rule="non-sargable",
message=f"Function {func}() on column in WHERE prevents index usage.",
suggestion="Rewrite to compare the raw column against transformed constants.",
)
return None
def check_leading_wildcard(sql: str) -> Optional[Issue]:
"""Detect LIKE '%...' patterns."""
if re.search(r"LIKE\s+'%", sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="leading-wildcard",
message="LIKE with leading wildcard prevents index usage.",
suggestion="Use full-text search (GIN index, FULLTEXT, FTS5) for substring matching.",
)
return None
def check_order_by_rand(sql: str) -> Optional[Issue]:
"""Detect ORDER BY RAND() / RANDOM()."""
if re.search(r'ORDER\s+BY\s+(RAND|RANDOM)\s*\(\)', sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="order-by-rand",
message="ORDER BY RAND() scans and sorts the entire table.",
suggestion="Use application-side random sampling or TABLESAMPLE.",
)
return None
def check_union_vs_union_all(sql: str) -> Optional[Issue]:
"""Detect UNION without ALL (unnecessary dedup)."""
if re.search(r'\bUNION\b(?!\s+ALL\b)', sql, re.IGNORECASE):
return Issue(
severity="info",
rule="union-without-all",
message="UNION performs deduplication sort; use UNION ALL if duplicates are acceptable.",
suggestion="Replace UNION with UNION ALL unless you specifically need deduplication.",
)
return None
def check_not_in_subquery(sql: str) -> Optional[Issue]:
"""Detect NOT IN (SELECT ...) which is NULL-unsafe."""
if re.search(r'\bNOT\s+IN\s*\(\s*SELECT\b', sql, re.IGNORECASE):
return Issue(
severity="warning",
rule="not-in-subquery",
message="NOT IN with subquery returns no rows if any subquery result is NULL.",
suggestion="Use NOT EXISTS (SELECT 1 ...) instead.",
)
return None
ALL_CHECKS = [
check_select_star,
check_missing_where,
check_cartesian_join,
check_subquery_in_select,
check_missing_limit,
check_function_on_column,
check_leading_wildcard,
check_order_by_rand,
check_union_vs_union_all,
check_not_in_subquery,
]
# ---------------------------------------------------------------------------
# Analysis engine
# ---------------------------------------------------------------------------
def analyze_query(sql: str, dialect: str = "postgres") -> QueryAnalysis:
"""Run all checks against a single SQL query."""
issues: List[Issue] = []
for check_fn in ALL_CHECKS:
issue = check_fn(sql)
if issue:
issues.append(issue)
# Score: start at 100, deduct per severity
score = 100
for issue in issues:
if issue.severity == "critical":
score -= 25
elif issue.severity == "warning":
score -= 10
else:
score -= 5
score = max(0, score)
return QueryAnalysis(query=sql.strip(), issues=issues, score=score)
def split_queries(text: str) -> List[str]:
"""Split SQL text into individual statements."""
queries = []
for stmt in text.split(";"):
stmt = stmt.strip()
if stmt and len(stmt) > 5:
queries.append(stmt + ";")
return queries
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
SEVERITY_ICONS = {"critical": "[CRITICAL]", "warning": "[WARNING]", "info": "[INFO]"}
def format_text(analyses: List[QueryAnalysis]) -> str:
"""Format analysis results as human-readable text."""
lines = []
for i, analysis in enumerate(analyses, 1):
lines.append(f"{'='*60}")
lines.append(f"Query {i} (Score: {analysis.score}/100)")
lines.append(f" {analysis.query[:120]}{'...' if len(analysis.query) > 120 else ''}")
lines.append("")
if not analysis.issues:
lines.append(" No issues detected.")
for issue in analysis.issues:
icon = SEVERITY_ICONS.get(issue.severity, "")
lines.append(f" {icon} {issue.rule}: {issue.message}")
lines.append(f" -> {issue.suggestion}")
lines.append("")
return "\n".join(lines)
def format_json(analyses: List[QueryAnalysis]) -> str:
"""Format analysis results as JSON."""
return json.dumps(
{"analyses": [a.to_dict() for a in analyses], "total_queries": len(analyses)},
indent=2,
)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Analyze SQL queries for common performance issues.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s --query "SELECT * FROM users"
%(prog)s --query queries.sql --dialect mysql
%(prog)s --query "DELETE FROM orders" --json
""",
)
parser.add_argument(
"--query", required=True,
help="SQL query string or path to a .sql file",
)
parser.add_argument(
"--dialect", choices=["postgres", "mysql", "sqlite", "sqlserver"],
default="postgres", help="SQL dialect (default: postgres)",
)
parser.add_argument(
"--json", action="store_true", dest="json_output",
help="Output results as JSON",
)
args = parser.parse_args()
# Determine if query is a file path or inline SQL
sql_text = args.query
if os.path.isfile(args.query):
with open(args.query, "r") as f:
sql_text = f.read()
queries = split_queries(sql_text)
if not queries:
# Treat the whole input as a single query
queries = [sql_text.strip()]
analyses = [analyze_query(q, args.dialect) for q in queries]
if args.json_output:
print(format_json(analyses))
else:
print(format_text(analyses))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,315 @@
#!/usr/bin/env python3
"""
Schema Explorer
Generates schema documentation from database introspection queries.
Outputs the introspection SQL and sample documentation templates
for PostgreSQL, MySQL, SQLite, and SQL Server.
Since this tool runs without a live database connection, it generates:
1. The introspection queries you need to run
2. Documentation templates from the results
3. Sample schema docs for common table patterns
Usage:
python schema_explorer.py --dialect postgres --tables all --format md
python schema_explorer.py --dialect mysql --tables users,orders --format json
python schema_explorer.py --dialect sqlite --tables all --json
"""
import argparse
import json
import sys
import textwrap
from dataclasses import dataclass, asdict
from typing import List, Optional, Dict
# ---------------------------------------------------------------------------
# Introspection query templates per dialect
# ---------------------------------------------------------------------------
INTROSPECTION_QUERIES: Dict[str, Dict[str, str]] = {
"postgres": {
"tables": textwrap.dedent("""\
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public' AND table_type = 'BASE TABLE'
ORDER BY table_name;"""),
"columns": textwrap.dedent("""\
SELECT table_name, column_name, data_type, character_maximum_length,
is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public' {table_filter}
ORDER BY table_name, ordinal_position;"""),
"primary_keys": textwrap.dedent("""\
SELECT tc.table_name, kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'PRIMARY KEY' AND tc.table_schema = 'public'
ORDER BY tc.table_name;"""),
"foreign_keys": textwrap.dedent("""\
SELECT tc.table_name, kcu.column_name,
ccu.table_name AS foreign_table, ccu.column_name AS foreign_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
ORDER BY tc.table_name;"""),
"indexes": textwrap.dedent("""\
SELECT schemaname, tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY tablename, indexname;"""),
"table_sizes": textwrap.dedent("""\
SELECT relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS data_size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;"""),
},
"mysql": {
"tables": textwrap.dedent("""\
SELECT table_name
FROM information_schema.tables
WHERE table_schema = DATABASE() AND table_type = 'BASE TABLE'
ORDER BY table_name;"""),
"columns": textwrap.dedent("""\
SELECT table_name, column_name, column_type, is_nullable,
column_default, column_key, extra
FROM information_schema.columns
WHERE table_schema = DATABASE() {table_filter}
ORDER BY table_name, ordinal_position;"""),
"foreign_keys": textwrap.dedent("""\
SELECT table_name, column_name, referenced_table_name, referenced_column_name
FROM information_schema.key_column_usage
WHERE table_schema = DATABASE() AND referenced_table_name IS NOT NULL
ORDER BY table_name;"""),
"indexes": textwrap.dedent("""\
SELECT table_name, index_name, non_unique, column_name, seq_in_index
FROM information_schema.statistics
WHERE table_schema = DATABASE()
ORDER BY table_name, index_name, seq_in_index;"""),
"table_sizes": textwrap.dedent("""\
SELECT table_name, table_rows,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY data_length DESC;"""),
},
"sqlite": {
"tables": textwrap.dedent("""\
SELECT name FROM sqlite_master
WHERE type = 'table' AND name NOT LIKE 'sqlite_%'
ORDER BY name;"""),
"columns": textwrap.dedent("""\
-- Run for each table:
PRAGMA table_info({table_name});"""),
"foreign_keys": textwrap.dedent("""\
-- Run for each table:
PRAGMA foreign_key_list({table_name});"""),
"indexes": textwrap.dedent("""\
SELECT name, tbl_name, sql FROM sqlite_master
WHERE type = 'index'
ORDER BY tbl_name, name;"""),
"schema_dump": textwrap.dedent("""\
SELECT name, sql FROM sqlite_master
WHERE type = 'table'
ORDER BY name;"""),
},
"sqlserver": {
"tables": textwrap.dedent("""\
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_NAME;"""),
"columns": textwrap.dedent("""\
SELECT t.name AS table_name, c.name AS column_name,
ty.name AS data_type, c.max_length, c.precision, c.scale,
c.is_nullable, dc.definition AS default_value
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
LEFT JOIN sys.default_constraints dc ON c.default_object_id = dc.object_id
{table_filter}
ORDER BY t.name, c.column_id;"""),
"foreign_keys": textwrap.dedent("""\
SELECT fk.name AS fk_name,
tp.name AS parent_table, cp.name AS parent_column,
tr.name AS referenced_table, cr.name AS referenced_column
FROM sys.foreign_keys fk
JOIN sys.foreign_key_columns fkc ON fk.object_id = fkc.constraint_object_id
JOIN sys.tables tp ON fkc.parent_object_id = tp.object_id
JOIN sys.columns cp ON fkc.parent_object_id = cp.object_id AND fkc.parent_column_id = cp.column_id
JOIN sys.tables tr ON fkc.referenced_object_id = tr.object_id
JOIN sys.columns cr ON fkc.referenced_object_id = cr.object_id AND fkc.referenced_column_id = cr.column_id
ORDER BY tp.name;"""),
"indexes": textwrap.dedent("""\
SELECT t.name AS table_name, i.name AS index_name,
i.type_desc, i.is_unique, c.name AS column_name,
ic.key_ordinal
FROM sys.indexes i
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.tables t ON i.object_id = t.object_id
WHERE i.name IS NOT NULL
ORDER BY t.name, i.name, ic.key_ordinal;"""),
},
}
# ---------------------------------------------------------------------------
# Documentation generators
# ---------------------------------------------------------------------------
SAMPLE_TABLES = {
"users": {
"columns": [
{"name": "id", "type": "SERIAL / INT", "nullable": "NO", "default": "auto", "notes": "Primary key"},
{"name": "email", "type": "VARCHAR(255)", "nullable": "NO", "default": "-", "notes": "Unique, indexed"},
{"name": "name", "type": "VARCHAR(255)", "nullable": "YES", "default": "NULL", "notes": "Display name"},
{"name": "password_hash", "type": "VARCHAR(255)", "nullable": "NO", "default": "-", "notes": "bcrypt hash"},
{"name": "created_at", "type": "TIMESTAMP", "nullable": "NO", "default": "NOW()", "notes": ""},
{"name": "updated_at", "type": "TIMESTAMP", "nullable": "NO", "default": "NOW()", "notes": ""},
],
"indexes": ["PRIMARY KEY (id)", "UNIQUE INDEX (email)"],
"foreign_keys": [],
},
"orders": {
"columns": [
{"name": "id", "type": "SERIAL / INT", "nullable": "NO", "default": "auto", "notes": "Primary key"},
{"name": "user_id", "type": "INTEGER", "nullable": "NO", "default": "-", "notes": "FK -> users.id"},
{"name": "status", "type": "VARCHAR(50)", "nullable": "NO", "default": "'pending'", "notes": "pending/paid/shipped/cancelled"},
{"name": "total", "type": "DECIMAL(19,4)", "nullable": "NO", "default": "0", "notes": "Order total in cents"},
{"name": "created_at", "type": "TIMESTAMP", "nullable": "NO", "default": "NOW()", "notes": ""},
],
"indexes": ["PRIMARY KEY (id)", "INDEX (user_id)", "INDEX (status, created_at)"],
"foreign_keys": ["user_id -> users.id ON DELETE CASCADE"],
},
}
def generate_md(dialect: str, tables: List[str]) -> str:
"""Generate markdown schema documentation."""
lines = [f"# Database Schema Documentation ({dialect.upper()})\n"]
lines.append(f"Generated by sql-database-assistant schema_explorer.\n")
# Introspection queries section
lines.append("## Introspection Queries\n")
lines.append("Run these queries against your database to extract schema information:\n")
queries = INTROSPECTION_QUERIES.get(dialect, {})
for qname, qsql in queries.items():
table_filter = ""
if "all" not in tables:
tlist = ", ".join(f"'{t}'" for t in tables)
table_filter = f"AND table_name IN ({tlist})"
qsql = qsql.replace("{table_filter}", table_filter)
qsql = qsql.replace("{table_name}", tables[0] if tables and tables[0] != "all" else "TABLE_NAME")
lines.append(f"### {qname.replace('_', ' ').title()}\n")
lines.append(f"```sql\n{qsql}\n```\n")
# Sample documentation
lines.append("## Sample Table Documentation\n")
lines.append("Below is an example of the documentation format produced from query results:\n")
show_tables = tables if "all" not in tables else list(SAMPLE_TABLES.keys())
for tname in show_tables:
sample = SAMPLE_TABLES.get(tname)
if not sample:
lines.append(f"### {tname}\n")
lines.append("_No sample data available. Run introspection queries above._\n")
continue
lines.append(f"### {tname}\n")
lines.append("| Column | Type | Nullable | Default | Notes |")
lines.append("|--------|------|----------|---------|-------|")
for col in sample["columns"]:
lines.append(f"| {col['name']} | {col['type']} | {col['nullable']} | {col['default']} | {col['notes']} |")
lines.append("")
if sample["indexes"]:
lines.append("**Indexes:** " + ", ".join(sample["indexes"]))
if sample["foreign_keys"]:
lines.append("**Foreign Keys:** " + ", ".join(sample["foreign_keys"]))
lines.append("")
return "\n".join(lines)
def generate_json_output(dialect: str, tables: List[str]) -> dict:
"""Generate JSON schema documentation."""
queries = INTROSPECTION_QUERIES.get(dialect, {})
processed = {}
for qname, qsql in queries.items():
table_filter = ""
if "all" not in tables:
tlist = ", ".join(f"'{t}'" for t in tables)
table_filter = f"AND table_name IN ({tlist})"
processed[qname] = qsql.replace("{table_filter}", table_filter).replace(
"{table_name}", tables[0] if tables and tables[0] != "all" else "TABLE_NAME"
)
show_tables = tables if "all" not in tables else list(SAMPLE_TABLES.keys())
sample_docs = {}
for tname in show_tables:
sample = SAMPLE_TABLES.get(tname)
if sample:
sample_docs[tname] = sample
return {
"dialect": dialect,
"requested_tables": tables,
"introspection_queries": processed,
"sample_documentation": sample_docs,
"instructions": "Run the introspection queries against your database, then use the results to populate documentation in the sample format shown.",
}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Generate schema documentation from database introspection.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s --dialect postgres --tables all --format md
%(prog)s --dialect mysql --tables users,orders --format json
%(prog)s --dialect sqlite --tables all --json
""",
)
parser.add_argument(
"--dialect", required=True, choices=["postgres", "mysql", "sqlite", "sqlserver"],
help="Target database dialect",
)
parser.add_argument(
"--tables", default="all",
help="Comma-separated table names or 'all' (default: all)",
)
parser.add_argument(
"--format", choices=["md", "json"], default="md", dest="fmt",
help="Output format (default: md)",
)
parser.add_argument(
"--json", action="store_true", dest="json_output",
help="Output as JSON (overrides --format)",
)
args = parser.parse_args()
tables = [t.strip() for t in args.tables.split(",")]
if args.json_output or args.fmt == "json":
result = generate_json_output(args.dialect, tables)
print(json.dumps(result, indent=2))
else:
print(generate_md(args.dialect, tables))
if __name__ == "__main__":
main()

View File

@@ -126,6 +126,7 @@ nav:
- "Code Reviewer": skills/engineering-team/code-reviewer.md
- "Email Template Builder": skills/engineering-team/email-template-builder.md
- "Incident Commander": skills/engineering-team/incident-commander.md
- "GCP Cloud Architect": skills/engineering-team/gcp-cloud-architect.md
- "Google Workspace CLI": skills/engineering-team/google-workspace-cli.md
- "Microsoft 365 Tenant Manager": skills/engineering-team/ms365-tenant-manager.md
- Playwright Pro:
@@ -196,9 +197,11 @@ nav:
- "RAG Architect": skills/engineering/rag-architect.md
- "Release Manager": skills/engineering/release-manager.md
- "Runbook Generator": skills/engineering/runbook-generator.md
- "Secrets Vault Manager": skills/engineering/secrets-vault-manager.md
- "Skill Security Auditor": skills/engineering/skill-security-auditor.md
- "Skill Tester": skills/engineering/skill-tester.md
- "Spec-Driven Workflow": skills/engineering/spec-driven-workflow.md
- "SQL Database Assistant": skills/engineering/sql-database-assistant.md
- "Tech Debt Tracker": skills/engineering/tech-debt-tracker.md
- "Terraform Patterns": skills/engineering/terraform-patterns.md
- "Helm Chart Builder": skills/engineering/helm-chart-builder.md
@@ -333,6 +336,7 @@ nav:
- "Quality Manager - ISO 13485": skills/ra-qm-team/quality-manager-qms-iso13485.md
- "Regulatory Affairs Head": skills/ra-qm-team/regulatory-affairs-head.md
- "Risk Management Specialist": skills/ra-qm-team/risk-management-specialist.md
- "SOC 2 Compliance": skills/ra-qm-team/soc2-compliance.md
- Business & Growth:
- Overview: skills/business-growth/index.md
- "Contract & Proposal Writer": skills/business-growth/contract-and-proposal-writer.md

View File

@@ -0,0 +1,417 @@
---
name: "soc2-compliance"
description: "Use when the user asks to prepare for SOC 2 audits, map Trust Service Criteria, build control matrices, collect audit evidence, perform gap analysis, or assess SOC 2 Type I vs Type II readiness."
---
# SOC 2 Compliance
SOC 2 Type I and Type II compliance preparation for SaaS companies. Covers Trust Service Criteria mapping, control matrix generation, evidence collection, gap analysis, and audit readiness assessment.
## Table of Contents
- [Overview](#overview)
- [Trust Service Criteria](#trust-service-criteria)
- [Control Matrix Generation](#control-matrix-generation)
- [Gap Analysis Workflow](#gap-analysis-workflow)
- [Evidence Collection](#evidence-collection)
- [Audit Readiness Checklist](#audit-readiness-checklist)
- [Vendor Management](#vendor-management)
- [Continuous Compliance](#continuous-compliance)
- [Anti-Patterns](#anti-patterns)
- [Tools](#tools)
- [References](#references)
- [Cross-References](#cross-references)
---
## Overview
### What Is SOC 2?
SOC 2 (System and Organization Controls 2) is an auditing framework developed by the AICPA that evaluates how a service organization manages customer data. It applies to any technology company that stores, processes, or transmits customer information — primarily SaaS, cloud infrastructure, and managed service providers.
### Type I vs Type II
| Aspect | Type I | Type II |
|--------|--------|---------|
| **Scope** | Design of controls at a point in time | Design AND operating effectiveness over a period |
| **Duration** | Snapshot (single date) | Observation window (3-12 months, typically 6) |
| **Evidence** | Control descriptions, policies | Control descriptions + operating evidence (logs, tickets, screenshots) |
| **Cost** | $20K-$50K (audit fees) | $30K-$100K+ (audit fees) |
| **Timeline** | 1-2 months (audit phase) | 6-12 months (observation + audit) |
| **Best For** | First-time compliance, rapid market need | Mature organizations, enterprise customers |
### Who Needs SOC 2?
- **SaaS companies** selling to enterprise customers
- **Cloud infrastructure providers** handling customer workloads
- **Data processors** managing PII, PHI, or financial data
- **Managed service providers** with access to client systems
- **Any vendor** whose customers require third-party assurance
### Typical Journey
```
Gap Assessment → Remediation → Type I Audit → Observation Period → Type II Audit → Annual Renewal
(4-8 wk) (8-16 wk) (4-6 wk) (6-12 mo) (4-6 wk) (ongoing)
```
---
## Trust Service Criteria
SOC 2 is organized around five Trust Service Criteria (TSC) categories. **Security** is required for every SOC 2 report; the remaining four are optional and selected based on business need.
### Security (Common Criteria CC1-CC9) — Required
The foundation of every SOC 2 report. Maps to COSO 2013 principles.
| Criteria | Domain | Key Controls |
|----------|--------|-------------|
| **CC1** | Control Environment | Integrity/ethics, board oversight, org structure, competence, accountability |
| **CC2** | Communication & Information | Internal/external communication, information quality |
| **CC3** | Risk Assessment | Risk identification, fraud risk, change impact analysis |
| **CC4** | Monitoring Activities | Ongoing monitoring, deficiency evaluation, corrective actions |
| **CC5** | Control Activities | Policies/procedures, technology controls, deployment through policies |
| **CC6** | Logical & Physical Access | Access provisioning, authentication, encryption, physical restrictions |
| **CC7** | System Operations | Vulnerability management, anomaly detection, incident response |
| **CC8** | Change Management | Change authorization, testing, approval, emergency changes |
| **CC9** | Risk Mitigation | Vendor/business partner risk management |
### Availability (A1) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **A1.1** | Capacity management | Infrastructure scaling, resource monitoring, capacity planning |
| **A1.2** | Recovery operations | Backup procedures, disaster recovery, BCP testing |
| **A1.3** | Recovery testing | DR drills, failover testing, RTO/RPO validation |
**Select when:** Customers depend on your uptime; you have SLAs; downtime causes direct business impact.
### Confidentiality (C1) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **C1.1** | Identification | Data classification policy, confidential data inventory |
| **C1.2** | Protection | Encryption at rest and in transit, DLP, access restrictions |
| **C1.3** | Disposal | Secure deletion procedures, media sanitization, retention enforcement |
**Select when:** You handle trade secrets, proprietary data, or contractually confidential information.
### Processing Integrity (PI1) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **PI1.1** | Accuracy | Input validation, processing checks, output verification |
| **PI1.2** | Completeness | Transaction monitoring, reconciliation, error handling |
| **PI1.3** | Timeliness | SLA monitoring, processing delay alerts, batch job monitoring |
| **PI1.4** | Authorization | Processing authorization controls, segregation of duties |
**Select when:** Data accuracy is critical (financial processing, healthcare records, analytics platforms).
### Privacy (P1-P8) — Optional
| Criteria | Focus | Key Controls |
|----------|-------|-------------|
| **P1** | Notice | Privacy policy, data collection notice, purpose limitation |
| **P2** | Choice & Consent | Opt-in/opt-out, consent management, preference tracking |
| **P3** | Collection | Minimal collection, lawful basis, purpose specification |
| **P4** | Use, Retention, Disposal | Purpose limitation, retention schedules, secure disposal |
| **P5** | Access | Data subject access requests, correction rights |
| **P6** | Disclosure & Notification | Third-party sharing, breach notification |
| **P7** | Quality | Data accuracy verification, correction mechanisms |
| **P8** | Monitoring & Enforcement | Privacy program monitoring, complaint handling |
**Select when:** You process PII and customers expect privacy assurance (complements GDPR compliance).
---
## Control Matrix Generation
A control matrix maps each TSC criterion to specific controls, owners, evidence, and testing procedures.
### Matrix Structure
| Field | Description |
|-------|-------------|
| **Control ID** | Unique identifier (e.g., SEC-001, AVL-003) |
| **TSC Mapping** | Which criteria the control addresses (e.g., CC6.1, A1.2) |
| **Control Description** | What the control does |
| **Control Type** | Preventive, Detective, or Corrective |
| **Owner** | Responsible person/team |
| **Frequency** | Continuous, Daily, Weekly, Monthly, Quarterly, Annual |
| **Evidence Type** | Screenshot, Log, Policy, Config, Ticket |
| **Testing Procedure** | How the auditor verifies the control |
### Control Naming Convention
```
{CATEGORY}-{NUMBER}
SEC-001 through SEC-NNN → Security
AVL-001 through AVL-NNN → Availability
CON-001 through CON-NNN → Confidentiality
PRI-001 through PRI-NNN → Processing Integrity
PRV-001 through PRV-NNN → Privacy
```
### Workflow
1. Select applicable TSC categories based on business needs
2. Run `control_matrix_builder.py` to generate the baseline matrix
3. Customize controls to match your actual environment
4. Assign owners and evidence requirements
5. Validate coverage — every selected TSC criterion must have at least one control
---
## Gap Analysis Workflow
### Phase 1: Current State Assessment
1. **Document existing controls** — inventory all security policies, procedures, and technical controls
2. **Map to TSC** — align existing controls to Trust Service Criteria
3. **Collect evidence samples** — gather proof that controls exist and operate
4. **Interview control owners** — verify understanding and execution
### Phase 2: Gap Identification
Run `gap_analyzer.py` against your current controls to identify:
- **Missing controls** — TSC criteria with no corresponding control
- **Partially implemented** — Control exists but lacks evidence or consistency
- **Design gaps** — Control designed but does not adequately address the criteria
- **Operating gaps** (Type II only) — Control designed correctly but not operating effectively
### Phase 3: Remediation Planning
For each gap, define:
| Field | Description |
|-------|-------------|
| Gap ID | Reference identifier |
| TSC Criteria | Affected criteria |
| Gap Description | What is missing or insufficient |
| Remediation Action | Specific steps to close the gap |
| Owner | Person responsible for remediation |
| Priority | Critical / High / Medium / Low |
| Target Date | Completion deadline |
| Dependencies | Other gaps or projects that must complete first |
### Phase 4: Timeline Planning
| Priority | Target Remediation |
|----------|--------------------|
| Critical | 2-4 weeks |
| High | 4-8 weeks |
| Medium | 8-12 weeks |
| Low | 12-16 weeks |
---
## Evidence Collection
### Evidence Types by Control Category
| Control Area | Primary Evidence | Secondary Evidence |
|--------------|-----------------|-------------------|
| Access Management | User access reviews, provisioning tickets | Role matrix, access logs |
| Change Management | Change tickets, approval records | Deployment logs, test results |
| Incident Response | Incident tickets, postmortems | Runbooks, escalation records |
| Vulnerability Management | Scan reports, patch records | Remediation timelines |
| Encryption | Configuration screenshots, certificate inventory | Key rotation logs |
| Backup & Recovery | Backup logs, DR test results | Recovery time measurements |
| Monitoring | Alert configurations, dashboard screenshots | On-call schedules, escalation records |
| Policy Management | Signed policies, version history | Training completion records |
| Vendor Management | Vendor assessments, SOC 2 reports | Contract reviews, risk registers |
### Automation Opportunities
| Area | Automation Approach |
|------|-------------------|
| Access reviews | Integrate IAM with ticketing (automatic quarterly review triggers) |
| Configuration evidence | Infrastructure-as-code snapshots, compliance-as-code tools |
| Vulnerability scans | Scheduled scanning with auto-generated reports |
| Change management | Git-based audit trail (commits, PRs, approvals) |
| Uptime monitoring | Automated SLA dashboards with historical data |
| Backup verification | Automated restore tests with success/failure logging |
### Continuous Monitoring
Move from point-in-time evidence collection to continuous compliance:
1. **Automated evidence gathering** — scripts that pull evidence on schedule
2. **Control dashboards** — real-time visibility into control status
3. **Alert-based monitoring** — notify when a control drifts out of compliance
4. **Evidence repository** — centralized, timestamped evidence storage
---
## Audit Readiness Checklist
### Pre-Audit Preparation (4-6 Weeks Before)
- [ ] All controls documented with descriptions, owners, and frequencies
- [ ] Evidence collected for the entire observation period (Type II)
- [ ] Control matrix reviewed and gaps remediated
- [ ] Policies signed and distributed within the last 12 months
- [ ] Access reviews completed within the required frequency
- [ ] Vulnerability scans current (no critical/high unpatched > SLA)
- [ ] Incident response plan tested within the last 12 months
- [ ] Vendor risk assessments current for all subservice organizations
- [ ] DR/BCP tested and documented within the last 12 months
- [ ] Employee security training completed for all staff
### Readiness Scoring
| Score | Rating | Meaning |
|-------|--------|---------|
| 90-100% | Audit Ready | Proceed with confidence |
| 75-89% | Minor Gaps | Address before scheduling audit |
| 50-74% | Significant Gaps | Remediation required |
| < 50% | Not Ready | Major program build-out needed |
### Common Audit Findings
| Finding | Root Cause | Prevention |
|---------|-----------|-----------|
| Incomplete access reviews | Manual process, no reminders | Automate quarterly review triggers |
| Missing change approvals | Emergency changes bypass process | Define emergency change procedure with post-hoc approval |
| Stale vulnerability scans | Scanner misconfigured | Automated weekly scans with alerting |
| Policy not acknowledged | No tracking mechanism | Annual e-signature workflow |
| Missing vendor assessments | No vendor inventory | Maintain vendor register with review schedule |
---
## Vendor Management
### Third-Party Risk Assessment
Every vendor that accesses, stores, or processes customer data must be assessed:
1. **Vendor inventory** — maintain a register of all service providers
2. **Risk classification** — categorize vendors by data access level
3. **Due diligence** — collect SOC 2 reports, security questionnaires, certifications
4. **Contractual protections** — ensure DPAs, security requirements, breach notification clauses
5. **Ongoing monitoring** — annual reassessment, continuous news monitoring
### Vendor Risk Tiers
| Tier | Data Access | Assessment Frequency | Requirements |
|------|-------------|---------------------|-------------|
| Critical | Processes/stores customer data | Annual + continuous monitoring | SOC 2 Type II, penetration test, security review |
| High | Accesses customer environment | Annual | SOC 2 Type II or equivalent, questionnaire |
| Medium | Indirect access, support tools | Annual questionnaire | Security certifications, questionnaire |
| Low | No data access | Biennial questionnaire | Basic security questionnaire |
### Subservice Organizations
When your SOC 2 report relies on controls at a subservice organization (e.g., AWS, GCP, Azure):
- **Inclusive method** — your report covers the subservice org's controls (requires their cooperation)
- **Carve-out method** — your report excludes their controls but references their SOC 2 report
- Most companies use **carve-out** and include complementary user entity controls (CUECs)
---
## Continuous Compliance
### From Point-in-Time to Continuous
| Aspect | Point-in-Time | Continuous |
|--------|---------------|-----------|
| Evidence collection | Manual, before audit | Automated, ongoing |
| Control monitoring | Periodic review | Real-time dashboards |
| Drift detection | Found during audit | Alert-based, immediate |
| Remediation | Reactive | Proactive |
| Audit preparation | 4-8 week scramble | Always ready |
### Implementation Steps
1. **Automate evidence gathering** — cron jobs, API integrations, IaC snapshots
2. **Build control dashboards** — aggregate control status into a single view
3. **Configure drift alerts** — notify when controls fall out of compliance
4. **Establish review cadence** — weekly control owner check-ins, monthly steering
5. **Maintain evidence repository** — centralized, timestamped, auditor-accessible
### Annual Re-Assessment Cycle
| Quarter | Activities |
|---------|-----------|
| Q1 | Annual risk assessment, policy refresh, vendor reassessment launch |
| Q2 | Internal control testing, remediation of findings |
| Q3 | Pre-audit readiness review, evidence completeness check |
| Q4 | External audit, management assertion, report distribution |
---
## Anti-Patterns
| Anti-Pattern | Why It Fails | Better Approach |
|--------------|-------------|----------------|
| Point-in-time compliance | Controls degrade between audits; gaps found during audit | Implement continuous monitoring and automated evidence |
| Manual evidence collection | Time-consuming, inconsistent, error-prone | Automate with scripts, IaC, and compliance platforms |
| Missing vendor assessments | Auditors flag incomplete vendor due diligence | Maintain vendor register with risk-tiered assessment schedule |
| Copy-paste policies | Generic policies don't match actual operations | Tailor policies to your actual environment and technology stack |
| Security theater | Controls exist on paper but aren't followed | Verify operating effectiveness; build controls into workflows |
| Skipping Type I | Jumping to Type II without foundational readiness | Start with Type I to validate control design before observation |
| Over-scoping TSC | Including all 5 categories when only Security is needed | Select categories based on actual customer/business requirements |
| Treating audit as a project | Compliance degrades after the report is issued | Build compliance into daily operations and engineering culture |
---
## Tools
### Control Matrix Builder
Generates a SOC 2 control matrix from selected TSC categories.
```bash
# Generate full security matrix in markdown
python scripts/control_matrix_builder.py --categories security --format md
# Generate matrix for multiple categories as JSON
python scripts/control_matrix_builder.py --categories security,availability,confidentiality --format json
# All categories, CSV output
python scripts/control_matrix_builder.py --categories security,availability,confidentiality,processing-integrity,privacy --format csv
```
### Evidence Tracker
Tracks evidence collection status per control.
```bash
# Check evidence status from a control matrix
python scripts/evidence_tracker.py --matrix controls.json --status
# JSON output for integration
python scripts/evidence_tracker.py --matrix controls.json --status --json
```
### Gap Analyzer
Analyzes current controls against SOC 2 requirements and identifies gaps.
```bash
# Type I gap analysis
python scripts/gap_analyzer.py --controls current_controls.json --type type1
# Type II gap analysis (includes operating effectiveness)
python scripts/gap_analyzer.py --controls current_controls.json --type type2 --json
```
---
## References
- [Trust Service Criteria Reference](references/trust_service_criteria.md) — All 5 TSC categories with sub-criteria, control objectives, and evidence examples
- [Evidence Collection Guide](references/evidence_collection_guide.md) — Evidence types per control, automation tools, documentation requirements
- [Type I vs Type II Comparison](references/type1_vs_type2.md) — Detailed comparison, timeline, cost analysis, and upgrade path
---
## Cross-References
- **[gdpr-dsgvo-expert](../gdpr-dsgvo-expert/SKILL.md)** — SOC 2 Privacy criteria overlaps significantly with GDPR requirements; use together when processing EU personal data
- **[information-security-manager-iso27001](../information-security-manager-iso27001/SKILL.md)** — ISO 27001 Annex A controls map closely to SOC 2 Security criteria; organizations pursuing both can share evidence
- **[isms-audit-expert](../isms-audit-expert/SKILL.md)** — Audit methodology and finding management patterns transfer directly to SOC 2 audit preparation

View File

@@ -0,0 +1,227 @@
# SOC 2 Evidence Collection Guide
Practical guide for collecting, organizing, and maintaining audit evidence for SOC 2 Type I and Type II engagements. Covers evidence types, automation strategies, and documentation requirements.
---
## Evidence Fundamentals
### What Auditors Look For
1. **Existence** — The control is documented and exists
2. **Design effectiveness** — The control is designed to address the TSC criterion (Type I + Type II)
3. **Operating effectiveness** — The control operates consistently over the observation period (Type II only)
### Evidence Quality Criteria
| Criterion | Description |
|-----------|-------------|
| **Relevant** | Directly demonstrates the control's operation |
| **Reliable** | Generated by systems or independent parties (not self-reported) |
| **Timely** | Falls within the audit/observation period |
| **Sufficient** | Enough samples to demonstrate consistency |
| **Complete** | Covers the full population or a representative sample |
### Evidence Types
| Type | Description | Examples |
|------|-------------|---------|
| **Inquiry** | Verbal or written descriptions from personnel | Interview notes, written responses |
| **Observation** | Auditor witnesses control in operation | Process walkthroughs, live demonstrations |
| **Inspection** | Review of documents, records, or configurations | Policy documents, system screenshots, logs |
| **Re-performance** | Auditor re-executes the control to verify results | Access review validation, configuration checks |
---
## Evidence by Control Area
### Access Management
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Access provisioning | Provisioning policy, role matrix | Sample provisioning tickets with approvals (full period) |
| Access removal | Termination checklist, deprovisioning SOP | Sample termination events with access removal timestamps |
| Access reviews | Review policy, review template | Completed quarterly access review reports with sign-offs |
| MFA enforcement | MFA policy, configuration screenshot | MFA enrollment report showing 100% coverage |
| Privileged access | Privileged access policy, admin list | Quarterly privileged access reviews, admin activity logs |
### Change Management
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Change authorization | Change management policy, workflow description | Sample change tickets with approvals, peer reviews |
| Testing requirements | Testing policy, test plan template | Test results for sampled changes, QA sign-offs |
| Emergency changes | Emergency change procedure | Emergency change tickets with post-hoc approvals |
| Deployment process | CI/CD documentation, deployment runbook | Deployment logs, rollback records |
| Code review | Code review policy | Pull request histories showing reviewer approvals |
### Incident Response
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| IR plan | Incident response plan document | Plan review/update records, version history |
| IR testing | Tabletop exercise schedule | Tabletop exercise reports, lessons learned |
| Incident handling | Triage procedures, classification criteria | Incident tickets with timestamps, escalation records |
| Postmortems | Postmortem template, review process | Completed postmortem documents, follow-up actions |
| Communication | Communication plan, stakeholder list | Notification records, status page updates |
### Vulnerability Management
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Scanning | Scanning schedule, tool configuration | Scan reports covering the full period (weekly/monthly) |
| Remediation SLAs | Remediation policy with SLA definitions | Remediation tracking showing SLA compliance rates |
| Patch management | Patching policy, schedule | Patch records, before/after scan comparisons |
| Penetration testing | Pentest policy, scope definition | Pentest reports (annual), remediation records |
### Encryption and Data Protection
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Encryption at rest | Encryption policy, configuration docs | Configuration screenshots, encryption audit reports |
| Encryption in transit | TLS policy, minimum version requirements | TLS scan results, certificate inventory |
| Key management | Key management policy, rotation schedule | Key rotation logs, access records for key stores |
| DLP | DLP policy, tool configuration | DLP alert logs, incident records, exception approvals |
### Backup and Recovery
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Backup procedures | Backup policy, schedule, retention rules | Backup success/failure logs (daily), retention compliance |
| DR planning | DR plan, recovery procedures | DR plan review records, update history |
| DR testing | DR test schedule, test plan | DR test reports with RTO/RPO measurements |
| BCP | BCP document, communication tree | BCP review records, test results |
### Monitoring and Logging
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| SIEM/logging | Logging policy, SIEM configuration | Log retention evidence, alert samples, dashboard screenshots |
| Alert management | Alert rules, escalation procedures | Alert trigger samples, response records |
| Uptime monitoring | Monitoring tool configuration, SLA definitions | Uptime reports covering the full period |
| Anomaly detection | Detection rules, baseline configuration | Detection events, investigation records |
### Policy and Governance
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Security policies | Policy library, version control | Policy acknowledgment records, annual review evidence |
| Security training | Training program description, content | Training completion records (all employees) |
| Risk assessment | Risk assessment methodology | Annual risk assessment report, risk register updates |
| Board oversight | Committee charter, reporting schedule | Board meeting minutes, security reports to leadership |
### Vendor Management
| Control | Type I Evidence | Type II Evidence |
|---------|----------------|-----------------|
| Vendor inventory | Vendor register, classification criteria | Current vendor register with risk tiers |
| Vendor assessment | Assessment questionnaire, criteria | Completed assessments, vendor SOC reports collected |
| Contractual controls | DPA template, security requirements | Signed DPAs, contract review records |
| Ongoing monitoring | Monitoring schedule, reassessment triggers | Reassessment records, monitoring reports |
---
## Evidence Automation
### Automated Evidence Sources
| Evidence | Automation Approach | Tools |
|----------|-------------------|-------|
| Access reviews | Scheduled IAM exports, automated review workflows | Okta, Azure AD, AWS IAM + Jira/ServiceNow |
| Configuration compliance | Infrastructure-as-code, policy-as-code scanning | Terraform, OPA, AWS Config, Azure Policy |
| Vulnerability scans | Scheduled scanning with report auto-generation | Nessus, Qualys, Snyk, Dependabot |
| Change management | Git-based audit trails (commits, PRs, approvals) | GitHub, GitLab, Bitbucket |
| Uptime monitoring | Continuous synthetic monitoring with SLA dashboards | Datadog, New Relic, PagerDuty, Pingdom |
| Backup verification | Automated backup validation and restore tests | AWS Backup, Veeam, custom scripts |
| Training completion | LMS with automated tracking and reminders | KnowBe4, Curricula, custom LMS |
| Policy acknowledgment | Digital signature workflows with tracking | DocuSign, HelloSign, internal tools |
### Evidence Collection Script Pattern
```
1. Define evidence requirements per control
2. Map each requirement to a data source (API, log, screenshot)
3. Schedule automated collection (daily/weekly/monthly)
4. Store evidence with timestamps in a central repository
5. Generate collection status dashboard
6. Alert on missing or overdue evidence
```
### Evidence Repository Structure
```
evidence/
├── {year}-{audit-period}/
│ ├── access-management/
│ │ ├── quarterly-access-review-Q1.pdf
│ │ ├── quarterly-access-review-Q2.pdf
│ │ ├── mfa-enrollment-report-2025-03.png
│ │ └── provisioning-samples/
│ ├── change-management/
│ │ ├── change-ticket-samples/
│ │ └── deployment-logs/
│ ├── incident-response/
│ │ ├── ir-plan-v3.2.pdf
│ │ ├── tabletop-exercise-2025-06.pdf
│ │ └── incident-tickets/
│ ├── vulnerability-management/
│ │ ├── scan-reports/
│ │ └── pentest-report-2025.pdf
│ ├── policies/
│ │ ├── information-security-policy-v4.pdf
│ │ └── acknowledgment-records/
│ └── vendor-management/
│ ├── vendor-register.csv
│ └── vendor-assessments/
```
---
## Sampling Methodology
Auditors use sampling to test operating effectiveness. Understanding the methodology helps you prepare the right volume of evidence.
### Sample Sizes by Control Frequency
| Control Frequency | Population Size (per period) | Typical Sample Size |
|-------------------|------------------------------|-------------------|
| Annual | 1 | 1 (all items) |
| Quarterly | 4 | 2-4 |
| Monthly | 6-12 | 2-5 |
| Weekly | 26-52 | 5-15 |
| Daily | 180-365 | 20-40 |
| Continuous/per-event | Varies | 25-60 |
### Key Sampling Rules
1. **Higher frequency = larger sample** — more occurrences mean more samples needed
2. **Automated controls** — typically only 1 sample needed if the system is validated
3. **Exceptions must be explained** — any deviation in a sample requires documentation
4. **Population completeness** — you must provide the full population for the auditor to select from
---
## Type I vs Type II Evidence Differences
| Aspect | Type I | Type II |
|--------|--------|---------|
| **Time scope** | Single point in time | Entire observation period (3-12 months) |
| **Volume** | Lower — policies and configurations | Higher — ongoing logs, tickets, reports |
| **Focus** | "Is the control designed properly?" | "Did the control operate effectively?" |
| **Exceptions** | N/A | Must document and explain every exception |
| **Owner sign-off** | Policy approval records | Ongoing review sign-offs throughout the period |
---
## Common Evidence Pitfalls
| Pitfall | Impact | Prevention |
|---------|--------|-----------|
| Screenshots without timestamps | Auditor cannot verify timing | Always include system clock or date stamps |
| Policies without version control | Cannot prove current vs outdated | Use document management with version tracking |
| Access reviews without sign-off | Cannot prove review was completed | Require digital approval/sign-off on every review |
| Gaps in monitoring data | Suggests control was not operating | Ensure logging continuity; document any outages |
| Evidence from wrong period | Does not cover the observation window | Verify date ranges before submission |
| Redacted evidence without explanation | Auditor may question completeness | Provide redaction rationale and methodology |
| Self-generated evidence only | Lower reliability in auditor's assessment | Include system-generated and third-party evidence |
| Missing exception documentation | Auditor flags as control failure | Document every exception with root cause and remediation |

View File

@@ -0,0 +1,273 @@
# SOC 2 Trust Service Criteria Reference
Comprehensive reference for all five AICPA Trust Service Criteria (TSC) categories. Each criterion includes its objective, sub-criteria, typical controls, and evidence examples.
---
## 1. Security (Common Criteria) — Required
The Security category is mandatory for every SOC 2 engagement. It maps to the 17 COSO 2013 internal control principles organized into nine groups (CC1-CC9).
### CC1 — Control Environment
Establishes the foundation for all other components of internal control.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC1.1 | Demonstrate commitment to integrity and ethical values | Code of conduct, ethics hotline, background checks | Signed code of conduct, hotline reports, screening records |
| CC1.2 | Board exercises oversight of internal control | Independent board/committee, regular reporting | Board meeting minutes, committee charters, oversight reports |
| CC1.3 | Management establishes structure and reporting lines | Organizational charts, role definitions, RACI matrices | Org charts, job descriptions, authority matrices |
| CC1.4 | Commitment to attract, develop, and retain competent individuals | Training programs, competency assessments, career development | Training completion records, skills assessments, HR policies |
| CC1.5 | Hold individuals accountable for internal control responsibilities | Performance evaluations, disciplinary procedures | Performance review records, accountability documentation |
### CC2 — Communication and Information
Ensures relevant, quality information flows internally and externally.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC2.1 | Obtain and generate relevant quality information | Data classification, information quality standards | Classification policy, data quality reports |
| CC2.2 | Internally communicate information and responsibilities | Internal newsletters, policy distribution, security awareness | Communication logs, training materials, acknowledgment records |
| CC2.3 | Communicate with external parties | Customer notifications, vendor communications, incident notices | External communication policy, notification records, status pages |
### CC3 — Risk Assessment
Identifies and assesses risks that may prevent achievement of objectives.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC3.1 | Specify objectives to identify and assess risks | Risk management framework, risk appetite statement | Risk methodology document, risk appetite approval |
| CC3.2 | Identify and analyze risks | Risk assessments, threat modeling, vulnerability analysis | Risk register, threat models, assessment reports |
| CC3.3 | Consider potential for fraud | Fraud risk assessment, segregation of duties | Fraud risk report, SoD matrix, anti-fraud controls |
| CC3.4 | Identify and assess changes impacting internal control | Change impact analysis, environmental scanning | Change assessments, business impact analyses |
### CC4 — Monitoring Activities
Ongoing evaluations to verify internal controls are present and functioning.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC4.1 | Select and perform ongoing and separate evaluations | Continuous monitoring, internal audits, control testing | Monitoring dashboards, audit reports, testing results |
| CC4.2 | Evaluate and communicate deficiencies | Deficiency tracking, remediation management, management reporting | Deficiency logs, remediation plans, management reports |
### CC5 — Control Activities
Policies and procedures that ensure management directives are carried out.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC5.1 | Select and develop control activities that mitigate risks | Risk-based control selection, control design documentation | Control matrix, risk treatment plans |
| CC5.2 | Select and develop technology controls | IT general controls, automated controls, technology governance | ITGC documentation, technology policies, automated control configs |
| CC5.3 | Deploy control activities through policies and procedures | Policy library, procedure documentation, acknowledgment tracking | Policy repository, version history, signed acknowledgments |
### CC6 — Logical and Physical Access Controls
Restrict logical and physical access to information assets.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC6.1 | Logical access security over protected assets | IAM platform, SSO, MFA enforcement | IAM configuration, SSO settings, MFA enrollment reports |
| CC6.2 | Access provisioning based on role and need | Role-based access, provisioning workflows, approval chains | Provisioning tickets, role matrix, approval records |
| CC6.3 | Access removal on termination or role change | Offboarding checklists, automated deprovisioning | Deprovisioning tickets, termination checklists, access removal logs |
| CC6.4 | Periodic access reviews | Quarterly user access reviews, entitlement validation | Access review reports, entitlement listings, sign-off records |
| CC6.5 | Physical access restrictions | Badge systems, visitor management, secure areas | Badge access logs, visitor logs, physical access policies |
| CC6.6 | Encryption of data in transit and at rest | TLS enforcement, disk encryption, key management | TLS configuration, encryption settings, key rotation records |
| CC6.7 | Data transmission and movement restrictions | DLP tools, network segmentation, firewall rules | DLP configuration, network diagrams, firewall rule sets |
| CC6.8 | Prevention/detection of unauthorized software | Endpoint protection, application whitelisting, malware scanning | EDR configuration, whitelist policies, scan reports |
### CC7 — System Operations
Detect and mitigate security events and anomalies.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC7.1 | Vulnerability identification and management | Vulnerability scanning, patch management, remediation SLAs | Scan reports, patch records, SLA compliance metrics |
| CC7.2 | Monitor for anomalies and security events | SIEM, IDS/IPS, behavioral analytics | SIEM dashboards, alert rules, detection logs |
| CC7.3 | Security event evaluation and classification | Incident classification criteria, triage procedures | Classification matrix, triage logs, escalation records |
| CC7.4 | Incident response execution | Incident response plan, response team, communication procedures | IR plan, incident tickets, communication records |
| CC7.5 | Incident recovery and lessons learned | Recovery procedures, post-incident reviews, plan updates | Recovery records, postmortem reports, plan revision history |
### CC8 — Change Management
Authorize, design, develop, test, and implement changes to infrastructure and software.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC8.1 | Change authorization, testing, and approval | Change management process, approval workflows, testing requirements | Change tickets, approval records, test results, deployment logs |
### CC9 — Risk Mitigation
Manage risks associated with business disruption, vendors, and partners.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| CC9.1 | Vendor and business partner risk management | Vendor assessment program, third-party risk management | Vendor risk assessments, vendor register, vendor SOC reports |
| CC9.2 | Risk mitigation through transfer mechanisms | Cyber insurance, contractual protections | Insurance certificates, contract provisions |
---
## 2. Availability (A1) — Optional
Addresses system uptime, performance, and recoverability commitments.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| A1.1 | Capacity and performance management | Auto-scaling, resource monitoring, capacity planning | Capacity dashboards, scaling policies, resource utilization trends |
| A1.2 | Recovery operations | Backup procedures, DR planning, BCP documentation | Backup logs, DR plan, BCP documentation, recovery procedures |
| A1.3 | Recovery testing | DR drills, failover tests, RTO/RPO validation | DR test reports, failover results, RTO/RPO measurements |
### When to Include Availability
- Your customers depend on your service uptime
- You have SLAs with financial penalties for downtime
- Your service is in the critical path of customer operations
- You provide infrastructure or platform services
### Key Metrics
| Metric | Description | Typical Target |
|--------|-------------|----------------|
| RTO | Recovery Time Objective — max acceptable downtime | 1-4 hours |
| RPO | Recovery Point Objective — max acceptable data loss | 1-24 hours |
| SLA | Service Level Agreement — uptime commitment | 99.9%-99.99% |
| MTTR | Mean Time to Recovery — average recovery duration | < 1 hour |
---
## 3. Confidentiality (C1) — Optional
Protects information designated as confidential throughout its lifecycle.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| C1.1 | Identification of confidential information | Data classification scheme, confidential data inventory | Classification policy, data inventory, labeling standards |
| C1.2 | Protection of confidential information | Encryption, access restrictions, DLP, secure transmission | Encryption configs, ACLs, DLP rules, secure transfer logs |
| C1.3 | Disposal of confidential information | Secure deletion, media sanitization, retention enforcement | Disposal procedures, sanitization certificates, deletion logs |
### When to Include Confidentiality
- You handle trade secrets or proprietary business information
- Contracts require confidentiality assurance
- You process data classified above "public" in your classification scheme
- Customers share confidential data for processing
### Data Classification Levels
| Level | Description | Handling Requirements |
|-------|-------------|----------------------|
| Public | No restrictions | No special controls |
| Internal | Business use only | Access controls, basic encryption |
| Confidential | Restricted access | Strong encryption, DLP, access reviews |
| Highly Confidential | Strictly controlled | Strongest encryption, MFA, audit logging, need-to-know |
---
## 4. Processing Integrity (PI1) — Optional
Ensures system processing is complete, valid, accurate, timely, and authorized.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| PI1.1 | Processing accuracy | Input validation, data integrity checks, output verification | Validation rules, integrity check logs, reconciliation reports |
| PI1.2 | Processing completeness | Transaction monitoring, completeness checks, reconciliation | Transaction logs, batch processing reports, reconciliation records |
| PI1.3 | Processing timeliness | SLA monitoring, batch job scheduling, processing alerts | SLA reports, job schedules, processing time metrics |
| PI1.4 | Processing authorization | Authorization controls, segregation of duties, approval workflows | Authorization matrix, SoD analysis, approval records |
### When to Include Processing Integrity
- You perform financial calculations or transactions
- Data accuracy is critical to customer operations
- You provide analytics or reporting that drives business decisions
- Regulatory requirements demand processing accuracy (e.g., healthcare, finance)
### Validation Checkpoints
| Stage | Validation | Method |
|-------|-----------|--------|
| Input | Data format, range, completeness | Automated validation rules |
| Processing | Calculation accuracy, transformation correctness | Unit tests, reconciliation |
| Output | Report accuracy, data completeness | Cross-checks, manual review, checksums |
| Transfer | Transmission integrity, completeness | Hash verification, acknowledgment protocols |
---
## 5. Privacy (P1-P8) — Optional
Governs the collection, use, retention, disclosure, and disposal of personal information. Closely aligns with GDPR, CCPA, and other privacy regulations.
| Criterion | Objective | Typical Controls | Evidence |
|-----------|-----------|-----------------|----------|
| P1.1 | Notice — inform data subjects about data practices | Privacy policy, collection notices, purpose statements | Published privacy policy, collection banners, purpose documentation |
| P2.1 | Choice and consent — provide opt-in/opt-out mechanisms | Consent management, preference centers, granular consent | Consent records, preference logs, opt-out mechanisms |
| P3.1 | Collection — collect only necessary personal information | Data minimization, lawful basis documentation, purpose specification | Collection audits, lawful basis records, data flow diagrams |
| P4.1 | Use, retention, and disposal — limit use and enforce retention | Purpose limitation, retention schedules, automated deletion | Use restriction controls, retention policies, deletion logs |
| P4.2 | Disposal — secure disposal when no longer needed | Secure deletion, media sanitization | Disposal certificates, sanitization records |
| P5.1 | Access — provide data subjects access to their data | DSAR processing, data portability, access portals | DSAR logs, response timelines, export capabilities |
| P5.2 | Correction — allow data subjects to correct their data | Correction request processing, data update mechanisms | Correction logs, update records |
| P6.1 | Disclosure — control third-party data sharing | Data sharing agreements, third-party inventory, DPAs | DPAs, sharing agreements, third-party register |
| P6.2 | Notification — notify of breaches affecting personal data | Breach notification procedures, regulatory reporting | Breach response plan, notification records, reporting logs |
| P7.1 | Quality — maintain accurate personal information | Data quality checks, accuracy verification, correction mechanisms | Quality reports, accuracy audits, correction records |
| P8.1 | Monitoring — monitor privacy program effectiveness | Privacy audits, compliance reviews, complaint tracking | Audit reports, compliance dashboards, complaint logs |
### When to Include Privacy
- You process personal information (PII) of end users or customers
- You operate in jurisdictions with privacy regulations (GDPR, CCPA, LGPD)
- Customers request privacy assurance as part of vendor assessment
- Your service involves health, financial, or other sensitive personal data
### Privacy Criteria Overlap with GDPR
| SOC 2 Privacy | GDPR Article | Alignment |
|---------------|-------------|-----------|
| P1 (Notice) | Art. 13-14 | Direct — transparency requirements |
| P2 (Consent) | Art. 6-7 | Direct — lawful basis and consent |
| P3 (Collection) | Art. 5(1)(b-c) | Direct — purpose limitation, minimization |
| P4 (Retention) | Art. 5(1)(e) | Direct — storage limitation |
| P5 (Access) | Art. 15-16 | Direct — data subject rights |
| P6 (Disclosure) | Art. 33-34 | Direct — breach notification |
| P7 (Quality) | Art. 5(1)(d) | Direct — accuracy principle |
| P8 (Monitoring) | Art. 5(2) | Direct — accountability principle |
---
## TSC Selection Guide
| Question | If Yes, Include |
|----------|----------------|
| Do you store/process customer data? | Security (required) |
| Do customers depend on your uptime? | Availability |
| Do you handle confidential business data? | Confidentiality |
| Is data accuracy critical to your service? | Processing Integrity |
| Do you process personal information? | Privacy |
### Common Combinations
| Company Type | Typical TSC Selection |
|-------------|----------------------|
| SaaS platform | Security + Availability |
| Data analytics | Security + Processing Integrity + Confidentiality |
| Healthcare SaaS | Security + Availability + Privacy + Confidentiality |
| Financial services | Security + Availability + Processing Integrity + Confidentiality |
| Infrastructure/PaaS | Security + Availability |
| HR/Payroll SaaS | Security + Availability + Privacy |
---
## Mapping to Other Frameworks
| SOC 2 Criteria | ISO 27001 | NIST CSF | HIPAA | PCI DSS |
|---------------|-----------|----------|-------|---------|
| CC1 (Control Environment) | A.5 (Policies) | ID.GV | Administrative Safeguards | Req 12 |
| CC2 (Communication) | A.5.1 (Policies) | ID.GV | Administrative Safeguards | Req 12 |
| CC3 (Risk Assessment) | A.8.2 (Risk) | ID.RA | Risk Analysis | Req 12.2 |
| CC4 (Monitoring) | A.8.34 (Monitoring) | DE.CM | Audit Controls | Req 10 |
| CC5 (Control Activities) | A.5-A.8 | PR | All Safeguards | Multiple |
| CC6 (Logical/Physical Access) | A.5.15, A.7 | PR.AC | Access Controls | Req 7-9 |
| CC7 (System Operations) | A.8.8, A.8.15 | DE, RS | Technical Safeguards | Req 5-6, 11 |
| CC8 (Change Management) | A.8.32 | PR.IP | Change Management | Req 6.4 |
| CC9 (Risk Mitigation) | A.5.19-5.22 | ID.SC | Business Associate Agreements | Req 12.8 |
| A1 (Availability) | A.8.13-14 | PR.IP | Contingency Plan | Req 12.10 |
| C1 (Confidentiality) | A.5.13-14, A.8.10-12 | PR.DS | Access Controls | Req 3-4 |
| PI1 (Processing Integrity) | A.8.24-25 | PR.DS | Integrity Controls | Req 6.5 |
| P1-P8 (Privacy) | A.5.34 (Privacy) | PR.PT | Privacy Rule | N/A |

View File

@@ -0,0 +1,273 @@
# SOC 2 Type I vs Type II Comparison
Detailed guide for understanding the differences between SOC 2 Type I and Type II reports, selecting the right starting point, planning timelines, and managing the upgrade path.
---
## Overview
| Dimension | Type I | Type II |
|-----------|--------|---------|
| **Full Name** | SOC 2 Type I Report | SOC 2 Type II Report |
| **What It Tests** | Design of controls at a specific point in time | Design AND operating effectiveness over a period |
| **Observation Period** | None — single date | 3-12 months (6 months typical) |
| **Auditor Opinion** | "Controls are suitably designed as of [date]" | "Controls are suitably designed and operating effectively for the period [start] to [end]" |
| **Evidence Volume** | Lower — policies, configs, descriptions | Higher — ongoing logs, tickets, samples across the period |
| **Timeline to Complete** | 1-3 months (prep + audit) | 6-15 months (prep + observation + audit) |
| **Audit Fee Range** | $20K-$50K | $30K-$100K+ |
| **Internal Cost** | $50K-$150K (implementation + audit) | $100K-$300K+ (implementation + monitoring + audit) |
| **Market Perception** | "They have controls" | "Their controls actually work" |
| **Validity** | Snapshot — stale quickly | Covers a defined period; renewed annually |
---
## When to Start with Type I
Type I is the right starting point when:
1. **First SOC 2 engagement** — You need to validate control design before investing in a full observation period
2. **Rapid market need** — A customer or deal requires SOC 2 assurance within 3 months
3. **Building the program** — Your compliance program is new and you want a structured assessment
4. **Budget constraints** — Type I costs significantly less and helps justify future Type II investment
5. **Control maturity is low** — You are still implementing controls and need a milestone before Type II
### Type I Limitations
- **Short shelf life** — Enterprise customers often ask "When is your Type II coming?"
- **No operating proof** — Does not demonstrate that controls work consistently
- **Annual deals may require Type II** — Many procurement teams mandate Type II for contracts above a threshold
- **Repeated cost** — If you plan to go Type II anyway, Type I is an additional expense
---
## When to Go Directly to Type II
Skip Type I and go directly to Type II when:
1. **Controls are already mature** — You have been operating security controls for 6+ months
2. **Customer requirements** — Your target customers explicitly require Type II
3. **Competitive pressure** — Competitors already have Type II reports
4. **Existing framework** — You already have ISO 27001 or similar, and controls are mapped
5. **Budget allows it** — You can absorb the longer timeline and higher cost
---
## Timeline Comparison
### Type I Timeline (Typical: 3-4 Months)
```
Month 1-2: Gap Assessment + Remediation
├── Assess current controls against TSC
├── Implement missing controls
├── Document policies and procedures
└── Assign control owners
Month 3: Audit Execution
├── Auditor reviews control descriptions
├── Auditor inspects configurations and policies
├── Management provides representation letter
└── Report issued
```
### Type II Timeline (Typical: 9-15 Months)
```
Month 1-3: Gap Assessment + Remediation
├── Assess current controls against TSC
├── Implement missing controls
├── Document policies and procedures
├── Set up evidence collection processes
└── Assign control owners
Month 4-9: Observation Period (6 months minimum)
├── Controls operate normally
├── Evidence is collected continuously
├── Periodic internal reviews
├── Address any control failures
└── Maintain documentation
Month 10-12: Audit Execution
├── Auditor tests operating effectiveness
├── Auditor samples evidence across the period
├── Exceptions documented and evaluated
├── Management provides representation letter
└── Report issued
```
### Accelerated Type II (Bridge from Type I)
```
Month 1-3: Type I Audit
├── Complete Type I assessment
├── Receive Type I report
└── Begin observation period immediately
Month 4-9: Observation Period
├── Controls operate with evidence collection
├── Address any Type I findings
└── Prepare for Type II testing
Month 10-12: Type II Audit
├── Auditor tests operating effectiveness
└── Type II report issued
```
---
## Cost Breakdown
### Type I Costs
| Cost Category | Range | Notes |
|--------------|-------|-------|
| Readiness assessment | $5K-$15K | Optional, but recommended for first-timers |
| Gap remediation | $10K-$50K | Depends on current maturity |
| Audit firm fees | $20K-$50K | Varies by scope, firm, and company size |
| Internal labor | $20K-$60K | Staff time for preparation and audit support |
| Tooling | $0-$20K | Compliance platforms, evidence management |
| **Total** | **$55K-$195K** | |
### Type II Costs
| Cost Category | Range | Notes |
|--------------|-------|-------|
| Readiness assessment | $5K-$15K | If not already done for Type I |
| Gap remediation | $15K-$75K | More thorough than Type I |
| Observation period monitoring | $10K-$30K | Internal effort for evidence collection |
| Audit firm fees | $30K-$100K+ | Larger scope, more testing |
| Internal labor | $40K-$120K | Ongoing effort across the observation period |
| Tooling | $5K-$40K | Compliance platforms, automation tools |
| **Total** | **$105K-$380K** | |
### Annual Renewal Costs (Type II)
| Cost Category | Range |
|--------------|-------|
| Audit firm fees | $25K-$80K |
| Internal labor | $30K-$80K |
| Tooling renewal | $5K-$30K |
| Remediation (if findings) | $5K-$30K |
| **Total** | **$65K-$220K** |
---
## Upgrade Path: Type I to Type II
### Step 1: Receive Type I Report
Review the Type I report for:
- Any exceptions or findings
- Auditor recommendations
- Control gaps identified during testing
- Areas where design could be strengthened
### Step 2: Address Type I Findings
- Remediate any exceptions before starting the observation period
- Strengthen control design based on auditor feedback
- Document all changes and their effective dates
### Step 3: Begin Observation Period
- Start the clock on your observation period (minimum 3 months, recommend 6)
- Implement evidence collection automation
- Assign control owners and review cadences
- Document any control changes during the period
### Step 4: Maintain During Observation
- Conduct monthly internal control reviews
- Track and remediate any control failures
- Keep evidence organized and timestamped
- Prepare for auditor walkthroughs
### Step 5: Type II Audit
- Auditor tests a sample of evidence across the observation period
- Auditor evaluates operating effectiveness
- Exceptions are documented with management responses
- Type II report issued
---
## What Auditors Test Differently
### Type I Testing
| Test | What the Auditor Does |
|------|----------------------|
| Inquiry | Asks control owners to describe how controls work |
| Inspection | Reviews policies, configurations, and documentation |
| Observation | May watch a control being executed (single instance) |
### Type II Additional Testing
| Test | What the Auditor Does |
|------|----------------------|
| Re-performance | Re-executes the control to verify it works correctly |
| Sampling | Selects samples from the full observation period |
| Walkthroughs | Traces a transaction end-to-end through all controls |
| Exception testing | Investigates any deviations found in samples |
| Consistency checks | Verifies controls operated the same way throughout the period |
---
## Report Distribution and Use
### Who Receives the Report
SOC 2 reports are **restricted-use documents** under AICPA standards:
- Your organization (the service organization)
- Your auditor
- User entities (customers) and their auditors
- Prospective customers under NDA
### Report Shelf Life
| Report Type | Practical Validity | Market Expectation |
|-------------|-------------------|-------------------|
| Type I | 6-12 months | Replace with Type II within 12 months |
| Type II | 12 months from period end | Renew annually; gap > 3 months raises concerns |
### Bridge Letters
If there is a gap between your report period end and a customer's request date, you may issue a **bridge letter** (also called a gap letter) stating:
- No material changes to the system since the report period
- No known control failures since the report period
- Management's assertion that controls continue to operate effectively
---
## Decision Framework
```
START
├─ Do you have existing controls operating for 6+ months?
│ ├─ YES → Do customers require Type II specifically?
│ │ ├─ YES → Go directly to Type II
│ │ └─ NO → Type I first (lower risk, validates design)
│ └─ NO → Type I first (build foundation)
├─ Is there an urgent deal requiring SOC 2 in < 4 months?
│ ├─ YES → Type I (fastest path to a report)
│ └─ NO → Evaluate maturity and go Type I or Type II
└─ Budget available for full Type II program?
├─ YES → Consider direct Type II if controls are mature
└─ NO → Type I first, budget Type II for next fiscal year
```
---
## Common Mistakes in the Upgrade Path
| Mistake | Consequence | Prevention |
|---------|------------|-----------|
| Starting observation before fixing Type I findings | Findings carry into Type II as exceptions | Remediate all Type I findings first |
| Choosing a 3-month observation period | Less convincing to customers; some reject < 6 months | Default to 6-month minimum observation |
| Changing auditors between Type I and Type II | New auditor must re-learn your environment; potential scope changes | Use the same firm for continuity |
| Not collecting evidence from day one of observation | Missing evidence for early-period controls | Start automated collection before observation begins |
| Treating the observation period as passive | Control failures go undetected until audit | Conduct monthly internal reviews during observation |
| Letting the Type I report expire before Type II is ready | Gap in coverage erodes customer confidence | Plan Type II timeline to overlap with Type I validity |

View File

@@ -0,0 +1,679 @@
#!/usr/bin/env python3
"""
SOC 2 Control Matrix Builder
Generates a SOC 2 control matrix from selected Trust Service Criteria categories.
Outputs in markdown, JSON, or CSV format.
Usage:
python control_matrix_builder.py --categories security --format md
python control_matrix_builder.py --categories security,availability --format json
python control_matrix_builder.py --categories security,availability,confidentiality,processing-integrity,privacy --format csv
"""
import argparse
import csv
import io
import json
import sys
from typing import Dict, List, Any
# Trust Service Criteria control definitions
TSC_CONTROLS: Dict[str, Dict[str, Any]] = {
"security": {
"name": "Security (Common Criteria)",
"controls": [
{
"id": "SEC-001",
"tsc": "CC1.1",
"description": "Management demonstrates commitment to integrity and ethical values",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Code of conduct, ethics policy, signed acknowledgments",
},
{
"id": "SEC-002",
"tsc": "CC1.2",
"description": "Board of directors demonstrates independence and exercises oversight",
"type": "Preventive",
"frequency": "Quarterly",
"evidence": "Board meeting minutes, oversight committee charters",
},
{
"id": "SEC-003",
"tsc": "CC1.3",
"description": "Management establishes organizational structure, reporting lines, and authorities",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Org charts, RACI matrices, role descriptions",
},
{
"id": "SEC-004",
"tsc": "CC1.4",
"description": "Organization demonstrates commitment to attract, develop, and retain competent individuals",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Training records, competency assessments, HR policies",
},
{
"id": "SEC-005",
"tsc": "CC1.5",
"description": "Organization holds individuals accountable for internal control responsibilities",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Performance reviews, disciplinary policy, accountability matrix",
},
{
"id": "SEC-006",
"tsc": "CC2.1",
"description": "Organization obtains and generates relevant quality information to support internal control",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Information classification policy, data flow diagrams",
},
{
"id": "SEC-007",
"tsc": "CC2.2",
"description": "Organization internally communicates objectives and responsibilities for internal control",
"type": "Preventive",
"frequency": "Quarterly",
"evidence": "Internal communications, policy distribution records, training materials",
},
{
"id": "SEC-008",
"tsc": "CC2.3",
"description": "Organization communicates with external parties regarding matters affecting internal control",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Customer notifications, external communication policy, incident notices",
},
{
"id": "SEC-009",
"tsc": "CC3.1",
"description": "Organization specifies objectives to identify and assess risks",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Risk assessment methodology, risk register, risk appetite statement",
},
{
"id": "SEC-010",
"tsc": "CC3.2",
"description": "Organization identifies and analyzes risks to achievement of objectives",
"type": "Detective",
"frequency": "Annual",
"evidence": "Risk assessment report, threat modeling documentation",
},
{
"id": "SEC-011",
"tsc": "CC3.3",
"description": "Organization considers potential for fraud in assessing risks",
"type": "Detective",
"frequency": "Annual",
"evidence": "Fraud risk assessment, anti-fraud controls documentation",
},
{
"id": "SEC-012",
"tsc": "CC3.4",
"description": "Organization identifies and assesses changes that could impact internal control",
"type": "Detective",
"frequency": "Quarterly",
"evidence": "Change impact assessments, environmental scan reports",
},
{
"id": "SEC-013",
"tsc": "CC4.1",
"description": "Organization selects and performs ongoing and separate monitoring evaluations",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Monitoring dashboards, automated alert configurations, review logs",
},
{
"id": "SEC-014",
"tsc": "CC4.2",
"description": "Organization evaluates and communicates internal control deficiencies",
"type": "Corrective",
"frequency": "Quarterly",
"evidence": "Deficiency tracking log, management reports, remediation plans",
},
{
"id": "SEC-015",
"tsc": "CC5.1",
"description": "Organization selects and develops control activities that mitigate risks",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Control matrix, risk treatment plans, control design documentation",
},
{
"id": "SEC-016",
"tsc": "CC5.2",
"description": "Organization selects and develops general control activities over technology",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "IT general controls documentation, technology policies",
},
{
"id": "SEC-017",
"tsc": "CC5.3",
"description": "Organization deploys control activities through policies and procedures",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Policy library, procedure documents, acknowledgment records",
},
{
"id": "SEC-018",
"tsc": "CC6.1",
"description": "Logical access security controls over protected information assets",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Access control policy, IAM configuration, SSO/MFA settings",
},
{
"id": "SEC-019",
"tsc": "CC6.2",
"description": "User access provisioning based on role and business need",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Provisioning tickets, role matrix, access request approvals",
},
{
"id": "SEC-020",
"tsc": "CC6.3",
"description": "User access removal upon termination or role change",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Deprovisioning tickets, termination checklists, access removal logs",
},
{
"id": "SEC-021",
"tsc": "CC6.4",
"description": "Periodic access reviews to validate appropriateness",
"type": "Detective",
"frequency": "Quarterly",
"evidence": "Access review reports, user entitlement listings, review sign-offs",
},
{
"id": "SEC-022",
"tsc": "CC6.5",
"description": "Physical access restrictions to facilities and protected assets",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Badge access logs, visitor logs, physical security configuration",
},
{
"id": "SEC-023",
"tsc": "CC6.6",
"description": "Encryption of data in transit and at rest",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "TLS configuration, encryption settings, certificate inventory",
},
{
"id": "SEC-024",
"tsc": "CC6.7",
"description": "Restrictions on data transmission and movement",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "DLP configuration, network segmentation, firewall rules",
},
{
"id": "SEC-025",
"tsc": "CC6.8",
"description": "Controls to prevent or detect unauthorized software",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Endpoint protection config, software whitelist, malware scan reports",
},
{
"id": "SEC-026",
"tsc": "CC7.1",
"description": "Vulnerability identification and management",
"type": "Detective",
"frequency": "Weekly",
"evidence": "Vulnerability scan reports, remediation SLAs, patch records",
},
{
"id": "SEC-027",
"tsc": "CC7.2",
"description": "Monitoring for anomalies and security events",
"type": "Detective",
"frequency": "Continuous",
"evidence": "SIEM configuration, alert rules, monitoring dashboards",
},
{
"id": "SEC-028",
"tsc": "CC7.3",
"description": "Security event evaluation and incident classification",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Incident classification criteria, triage procedures, event logs",
},
{
"id": "SEC-029",
"tsc": "CC7.4",
"description": "Incident response execution and recovery",
"type": "Corrective",
"frequency": "Continuous",
"evidence": "Incident response plan, incident tickets, postmortem reports",
},
{
"id": "SEC-030",
"tsc": "CC7.5",
"description": "Incident recovery and lessons learned",
"type": "Corrective",
"frequency": "Continuous",
"evidence": "Recovery records, lessons learned documentation, plan updates",
},
{
"id": "SEC-031",
"tsc": "CC8.1",
"description": "Change management authorization and testing",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Change tickets, approval records, test results, deployment logs",
},
{
"id": "SEC-032",
"tsc": "CC9.1",
"description": "Vendor and business partner risk management",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Vendor risk assessments, vendor register, SOC 2 reports from vendors",
},
{
"id": "SEC-033",
"tsc": "CC9.2",
"description": "Risk mitigation through insurance and other transfer mechanisms",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Insurance policies, risk transfer documentation",
},
],
},
"availability": {
"name": "Availability",
"controls": [
{
"id": "AVL-001",
"tsc": "A1.1",
"description": "Capacity management and infrastructure scaling",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Capacity monitoring dashboards, scaling policies, resource utilization reports",
},
{
"id": "AVL-002",
"tsc": "A1.1",
"description": "System performance monitoring and SLA tracking",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Uptime reports, SLA dashboards, performance metrics",
},
{
"id": "AVL-003",
"tsc": "A1.2",
"description": "Data backup procedures and verification",
"type": "Preventive",
"frequency": "Daily",
"evidence": "Backup logs, backup success/failure reports, retention configuration",
},
{
"id": "AVL-004",
"tsc": "A1.2",
"description": "Disaster recovery planning and documentation",
"type": "Preventive",
"frequency": "Annual",
"evidence": "DR plan, BCP documentation, recovery procedures",
},
{
"id": "AVL-005",
"tsc": "A1.2",
"description": "Business continuity management and communication",
"type": "Preventive",
"frequency": "Annual",
"evidence": "BCP plan, communication tree, emergency contacts",
},
{
"id": "AVL-006",
"tsc": "A1.3",
"description": "Disaster recovery testing and validation",
"type": "Detective",
"frequency": "Annual",
"evidence": "DR test results, RTO/RPO measurements, test reports",
},
{
"id": "AVL-007",
"tsc": "A1.3",
"description": "Failover testing and redundancy validation",
"type": "Detective",
"frequency": "Quarterly",
"evidence": "Failover test records, redundancy configuration, test results",
},
],
},
"confidentiality": {
"name": "Confidentiality",
"controls": [
{
"id": "CON-001",
"tsc": "C1.1",
"description": "Data classification and labeling policy",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Data classification policy, labeling standards, data inventory",
},
{
"id": "CON-002",
"tsc": "C1.1",
"description": "Confidential data inventory and mapping",
"type": "Detective",
"frequency": "Quarterly",
"evidence": "Data inventory, data flow diagrams, system classification",
},
{
"id": "CON-003",
"tsc": "C1.2",
"description": "Encryption of confidential data at rest and in transit",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Encryption configuration, TLS settings, key management procedures",
},
{
"id": "CON-004",
"tsc": "C1.2",
"description": "Access restrictions to confidential information",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Access control lists, need-to-know policy, access review records",
},
{
"id": "CON-005",
"tsc": "C1.2",
"description": "Data loss prevention controls",
"type": "Detective",
"frequency": "Continuous",
"evidence": "DLP configuration, DLP alerts/incidents, exception approvals",
},
{
"id": "CON-006",
"tsc": "C1.3",
"description": "Secure data disposal and media sanitization",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Disposal procedures, sanitization certificates, destruction logs",
},
{
"id": "CON-007",
"tsc": "C1.3",
"description": "Data retention enforcement and schedule compliance",
"type": "Preventive",
"frequency": "Quarterly",
"evidence": "Retention schedule, deletion logs, retention compliance reports",
},
],
},
"processing-integrity": {
"name": "Processing Integrity",
"controls": [
{
"id": "PRI-001",
"tsc": "PI1.1",
"description": "Input validation and data accuracy controls",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Validation rules, input sanitization config, error handling logs",
},
{
"id": "PRI-002",
"tsc": "PI1.1",
"description": "Output verification and data integrity checks",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Reconciliation reports, checksum verification, output validation logs",
},
{
"id": "PRI-003",
"tsc": "PI1.2",
"description": "Transaction completeness monitoring",
"type": "Detective",
"frequency": "Continuous",
"evidence": "Transaction logs, reconciliation reports, completeness dashboards",
},
{
"id": "PRI-004",
"tsc": "PI1.2",
"description": "Error handling and exception management",
"type": "Corrective",
"frequency": "Continuous",
"evidence": "Error logs, exception handling procedures, retry mechanisms",
},
{
"id": "PRI-005",
"tsc": "PI1.3",
"description": "Processing timeliness and SLA monitoring",
"type": "Detective",
"frequency": "Continuous",
"evidence": "SLA reports, processing time metrics, batch job monitoring",
},
{
"id": "PRI-006",
"tsc": "PI1.4",
"description": "Processing authorization and segregation of duties",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Authorization matrix, SoD controls, approval workflows",
},
],
},
"privacy": {
"name": "Privacy",
"controls": [
{
"id": "PRV-001",
"tsc": "P1.1",
"description": "Privacy notice publication and data collection transparency",
"type": "Preventive",
"frequency": "Annual",
"evidence": "Privacy policy, data collection notices, purpose statements",
},
{
"id": "PRV-002",
"tsc": "P2.1",
"description": "Consent management and preference tracking",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Consent records, opt-in/opt-out mechanisms, preference center",
},
{
"id": "PRV-003",
"tsc": "P3.1",
"description": "Data minimization and lawful collection",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Data collection audit, purpose limitation documentation, lawful basis records",
},
{
"id": "PRV-004",
"tsc": "P4.1",
"description": "Purpose limitation and use restrictions",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Data use policy, purpose limitation controls, access restrictions",
},
{
"id": "PRV-005",
"tsc": "P4.2",
"description": "Data retention schedules and disposal procedures",
"type": "Preventive",
"frequency": "Quarterly",
"evidence": "Retention schedule, deletion logs, disposal certificates",
},
{
"id": "PRV-006",
"tsc": "P5.1",
"description": "Data subject access request (DSAR) processing",
"type": "Corrective",
"frequency": "Continuous",
"evidence": "DSAR log, response records, processing timelines",
},
{
"id": "PRV-007",
"tsc": "P5.2",
"description": "Data correction and rectification rights",
"type": "Corrective",
"frequency": "Continuous",
"evidence": "Correction request records, data update logs",
},
{
"id": "PRV-008",
"tsc": "P6.1",
"description": "Third-party data sharing controls and notifications",
"type": "Preventive",
"frequency": "Continuous",
"evidence": "Data sharing agreements, third-party inventory, DPAs",
},
{
"id": "PRV-009",
"tsc": "P6.2",
"description": "Breach notification procedures",
"type": "Corrective",
"frequency": "Continuous",
"evidence": "Breach response plan, notification templates, incident records",
},
{
"id": "PRV-010",
"tsc": "P7.1",
"description": "Data quality and accuracy verification",
"type": "Detective",
"frequency": "Quarterly",
"evidence": "Data quality reports, accuracy checks, correction logs",
},
{
"id": "PRV-011",
"tsc": "P8.1",
"description": "Privacy program monitoring and compliance reviews",
"type": "Detective",
"frequency": "Quarterly",
"evidence": "Privacy audits, compliance dashboards, complaint tracking",
},
],
},
}
VALID_CATEGORIES = list(TSC_CONTROLS.keys())
def build_matrix(categories: List[str]) -> List[Dict[str, str]]:
"""Build a control matrix for the selected TSC categories."""
matrix = []
for cat in categories:
if cat not in TSC_CONTROLS:
continue
cat_data = TSC_CONTROLS[cat]
for ctrl in cat_data["controls"]:
matrix.append(
{
"control_id": ctrl["id"],
"tsc_criteria": ctrl["tsc"],
"category": cat_data["name"],
"description": ctrl["description"],
"control_type": ctrl["type"],
"frequency": ctrl["frequency"],
"evidence_required": ctrl["evidence"],
"owner": "TBD",
"status": "Not Started",
}
)
return matrix
def format_markdown(matrix: List[Dict[str, str]]) -> str:
"""Format control matrix as markdown table."""
lines = ["# SOC 2 Control Matrix", ""]
lines.append(
"| Control ID | TSC | Category | Description | Type | Frequency | Evidence | Owner | Status |"
)
lines.append(
"|------------|-----|----------|-------------|------|-----------|----------|-------|--------|"
)
for row in matrix:
lines.append(
"| {control_id} | {tsc_criteria} | {category} | {description} | {control_type} | {frequency} | {evidence_required} | {owner} | {status} |".format(
**row
)
)
lines.append("")
lines.append(f"**Total Controls:** {len(matrix)}")
return "\n".join(lines)
def format_csv(matrix: List[Dict[str, str]]) -> str:
"""Format control matrix as CSV."""
output = io.StringIO()
if not matrix:
return ""
writer = csv.DictWriter(output, fieldnames=matrix[0].keys())
writer.writeheader()
writer.writerows(matrix)
return output.getvalue()
def format_json(matrix: List[Dict[str, str]]) -> str:
"""Format control matrix as JSON."""
return json.dumps({"controls": matrix, "total": len(matrix)}, indent=2)
def main():
parser = argparse.ArgumentParser(
description="SOC 2 Control Matrix Builder — generates control matrices from selected Trust Service Criteria categories."
)
parser.add_argument(
"--categories",
type=str,
required=True,
help=f"Comma-separated TSC categories: {','.join(VALID_CATEGORIES)}",
)
parser.add_argument(
"--format",
type=str,
choices=["md", "json", "csv"],
default="md",
help="Output format (default: md)",
)
parser.add_argument(
"--json",
action="store_true",
help="Shorthand for --format json",
)
args = parser.parse_args()
# Parse categories
categories = [c.strip().lower() for c in args.categories.split(",")]
invalid = [c for c in categories if c not in VALID_CATEGORIES]
if invalid:
print(
f"Error: Invalid categories: {', '.join(invalid)}. Valid options: {', '.join(VALID_CATEGORIES)}",
file=sys.stderr,
)
sys.exit(1)
# Build matrix
matrix = build_matrix(categories)
if not matrix:
print("No controls found for the selected categories.", file=sys.stderr)
sys.exit(1)
# Output
fmt = "json" if args.json else args.format
if fmt == "md":
print(format_markdown(matrix))
elif fmt == "json":
print(format_json(matrix))
elif fmt == "csv":
print(format_csv(matrix))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,240 @@
#!/usr/bin/env python3
"""
SOC 2 Evidence Tracker
Tracks evidence collection status per control in a SOC 2 control matrix.
Reads a JSON control matrix (from control_matrix_builder.py) and reports
collection completeness, overdue items, and readiness scoring.
Usage:
python evidence_tracker.py --matrix controls.json --status
python evidence_tracker.py --matrix controls.json --status --json
"""
import argparse
import json
import sys
from datetime import datetime
from typing import Dict, List, Any
# Evidence status classifications
EVIDENCE_STATUSES = {
"collected": "Evidence gathered and verified",
"pending": "Evidence identified but not yet collected",
"overdue": "Evidence past its collection deadline",
"not_started": "No evidence collection initiated",
"not_applicable": "Control not applicable to the environment",
}
# Expected evidence fields for a well-formed control entry
REQUIRED_FIELDS = ["control_id", "tsc_criteria", "description", "evidence_required"]
def load_matrix(filepath: str) -> List[Dict[str, Any]]:
"""Load a control matrix from a JSON file."""
try:
with open(filepath, "r") as f:
data = json.load(f)
except FileNotFoundError:
print(f"Error: File not found: {filepath}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON in {filepath}: {e}", file=sys.stderr)
sys.exit(1)
# Accept both {"controls": [...]} and plain [...]
if isinstance(data, dict) and "controls" in data:
controls = data["controls"]
elif isinstance(data, list):
controls = data
else:
print(
"Error: Expected JSON with 'controls' array or a plain array.",
file=sys.stderr,
)
sys.exit(1)
return controls
def classify_evidence_status(control: Dict[str, Any]) -> str:
"""Classify the evidence collection status for a control."""
status = control.get("status", "Not Started").lower().strip()
evidence_date = control.get("evidence_date", "")
if status in ("not_applicable", "n/a", "not applicable"):
return "not_applicable"
if status in ("collected", "complete", "done"):
return "collected"
if status in ("pending", "in progress", "in_progress"):
# Check if overdue
if evidence_date:
try:
due = datetime.strptime(evidence_date, "%Y-%m-%d")
if due < datetime.now():
return "overdue"
except ValueError:
pass
return "pending"
if status in ("overdue", "late"):
return "overdue"
return "not_started"
def generate_status_report(controls: List[Dict[str, Any]]) -> Dict[str, Any]:
"""Generate an evidence collection status report."""
total = len(controls)
status_counts = {s: 0 for s in EVIDENCE_STATUSES}
by_category: Dict[str, Dict[str, int]] = {}
issues: List[Dict[str, str]] = []
for ctrl in controls:
status = classify_evidence_status(ctrl)
status_counts[status] = status_counts.get(status, 0) + 1
category = ctrl.get("category", "Unknown")
if category not in by_category:
by_category[category] = {s: 0 for s in EVIDENCE_STATUSES}
by_category[category][status] += 1
# Flag issues
if status == "overdue":
issues.append(
{
"control_id": ctrl.get("control_id", "N/A"),
"tsc_criteria": ctrl.get("tsc_criteria", "N/A"),
"description": ctrl.get("description", "N/A"),
"issue": "Evidence collection overdue",
"evidence_date": ctrl.get("evidence_date", "N/A"),
}
)
elif status == "not_started":
issues.append(
{
"control_id": ctrl.get("control_id", "N/A"),
"tsc_criteria": ctrl.get("tsc_criteria", "N/A"),
"description": ctrl.get("description", "N/A"),
"issue": "Evidence collection not started",
}
)
# Check for missing required fields
missing = [f for f in REQUIRED_FIELDS if f not in ctrl or not ctrl[f]]
if missing:
issues.append(
{
"control_id": ctrl.get("control_id", "N/A"),
"issue": f"Missing fields: {', '.join(missing)}",
}
)
# Calculate readiness score
applicable = total - status_counts.get("not_applicable", 0)
collected = status_counts.get("collected", 0)
readiness_pct = round((collected / applicable * 100), 1) if applicable > 0 else 0.0
if readiness_pct >= 90:
readiness_rating = "Audit Ready"
elif readiness_pct >= 75:
readiness_rating = "Minor Gaps"
elif readiness_pct >= 50:
readiness_rating = "Significant Gaps"
else:
readiness_rating = "Not Ready"
return {
"summary": {
"total_controls": total,
"status_breakdown": status_counts,
"readiness_score": readiness_pct,
"readiness_rating": readiness_rating,
"report_date": datetime.now().strftime("%Y-%m-%d"),
},
"by_category": by_category,
"issues": issues,
}
def format_status_text(report: Dict[str, Any]) -> str:
"""Format the status report as human-readable text."""
lines = ["=" * 60, "SOC 2 Evidence Collection Status Report", "=" * 60, ""]
summary = report["summary"]
lines.append(f"Report Date: {summary['report_date']}")
lines.append(f"Total Controls: {summary['total_controls']}")
lines.append(
f"Readiness Score: {summary['readiness_score']}% ({summary['readiness_rating']})"
)
lines.append("")
# Status breakdown
lines.append("--- Status Breakdown ---")
for status, count in summary["status_breakdown"].items():
label = EVIDENCE_STATUSES.get(status, status)
lines.append(f" {status:15s}: {count:3d} ({label})")
lines.append("")
# By category
lines.append("--- By Category ---")
for cat, statuses in report["by_category"].items():
cat_total = sum(statuses.values())
cat_collected = statuses.get("collected", 0)
cat_pct = round(cat_collected / cat_total * 100, 1) if cat_total > 0 else 0
lines.append(f" {cat}: {cat_collected}/{cat_total} collected ({cat_pct}%)")
lines.append("")
# Issues
if report["issues"]:
lines.append(f"--- Issues ({len(report['issues'])}) ---")
for issue in report["issues"]:
ctrl_id = issue.get("control_id", "N/A")
desc = issue.get("issue", "Unknown issue")
lines.append(f" [{ctrl_id}] {desc}")
else:
lines.append("--- No Issues Found ---")
lines.append("")
return "\n".join(lines)
def main():
parser = argparse.ArgumentParser(
description="SOC 2 Evidence Tracker — tracks evidence collection status per control."
)
parser.add_argument(
"--matrix",
type=str,
required=True,
help="Path to JSON control matrix file (from control_matrix_builder.py)",
)
parser.add_argument(
"--status",
action="store_true",
help="Generate evidence collection status report",
)
parser.add_argument(
"--json",
action="store_true",
help="Output in JSON format",
)
args = parser.parse_args()
if not args.status:
parser.print_help()
print("\nError: --status flag is required.", file=sys.stderr)
sys.exit(1)
controls = load_matrix(args.matrix)
report = generate_status_report(controls)
if args.json:
print(json.dumps(report, indent=2))
else:
print(format_status_text(report))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,479 @@
#!/usr/bin/env python3
"""
SOC 2 Gap Analyzer
Analyzes current controls against SOC 2 Trust Service Criteria requirements
and identifies gaps. Supports both Type I (design) and Type II (design +
operating effectiveness) analysis.
Usage:
python gap_analyzer.py --controls current_controls.json --type type1
python gap_analyzer.py --controls current_controls.json --type type2 --json
"""
import argparse
import json
import sys
from datetime import datetime
from typing import Dict, List, Any, Tuple
# Minimum required TSC criteria coverage per category
REQUIRED_TSC = {
"security": {
"CC1.1": "Integrity and ethical values",
"CC1.2": "Board oversight",
"CC1.3": "Organizational structure",
"CC1.4": "Competence commitment",
"CC1.5": "Accountability",
"CC2.1": "Information quality",
"CC2.2": "Internal communication",
"CC2.3": "External communication",
"CC3.1": "Risk objectives",
"CC3.2": "Risk identification",
"CC3.3": "Fraud risk consideration",
"CC3.4": "Change risk assessment",
"CC4.1": "Monitoring evaluations",
"CC4.2": "Deficiency communication",
"CC5.1": "Control activities selection",
"CC5.2": "Technology controls",
"CC5.3": "Policy deployment",
"CC6.1": "Logical access security",
"CC6.2": "Access provisioning",
"CC6.3": "Access removal",
"CC6.4": "Access review",
"CC6.5": "Physical access",
"CC6.6": "Encryption",
"CC6.7": "Data transmission restrictions",
"CC6.8": "Unauthorized software prevention",
"CC7.1": "Vulnerability management",
"CC7.2": "Anomaly monitoring",
"CC7.3": "Event evaluation",
"CC7.4": "Incident response",
"CC7.5": "Incident recovery",
"CC8.1": "Change management",
"CC9.1": "Vendor risk management",
"CC9.2": "Risk mitigation/transfer",
},
"availability": {
"A1.1": "Capacity and performance management",
"A1.2": "Backup and recovery",
"A1.3": "Recovery testing",
},
"confidentiality": {
"C1.1": "Confidential data identification",
"C1.2": "Confidential data protection",
"C1.3": "Confidential data disposal",
},
"processing-integrity": {
"PI1.1": "Processing accuracy",
"PI1.2": "Processing completeness",
"PI1.3": "Processing timeliness",
"PI1.4": "Processing authorization",
},
"privacy": {
"P1.1": "Privacy notice",
"P2.1": "Choice and consent",
"P3.1": "Data collection",
"P4.1": "Use and retention",
"P4.2": "Disposal",
"P5.1": "Access rights",
"P5.2": "Correction rights",
"P6.1": "Disclosure controls",
"P6.2": "Breach notification",
"P7.1": "Data quality",
"P8.1": "Privacy monitoring",
},
}
# Type II additional checks
TYPE2_CHECKS = [
{
"check": "evidence_period",
"description": "Evidence covers the full observation period",
"severity": "critical",
},
{
"check": "operating_consistency",
"description": "Control operated consistently throughout the period",
"severity": "critical",
},
{
"check": "exception_handling",
"description": "Exceptions are documented and addressed",
"severity": "high",
},
{
"check": "owner_accountability",
"description": "Control owners documented and accountable",
"severity": "medium",
},
{
"check": "evidence_timestamps",
"description": "Evidence has timestamps within the observation period",
"severity": "high",
},
{
"check": "frequency_adherence",
"description": "Control executed at the specified frequency",
"severity": "critical",
},
]
def load_controls(filepath: str) -> List[Dict[str, Any]]:
"""Load current controls from a JSON file."""
try:
with open(filepath, "r") as f:
data = json.load(f)
except FileNotFoundError:
print(f"Error: File not found: {filepath}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON in {filepath}: {e}", file=sys.stderr)
sys.exit(1)
if isinstance(data, dict) and "controls" in data:
return data["controls"]
elif isinstance(data, list):
return data
else:
print(
"Error: Expected JSON with 'controls' array or a plain array.",
file=sys.stderr,
)
sys.exit(1)
def detect_categories(controls: List[Dict[str, Any]]) -> List[str]:
"""Detect which TSC categories are represented in the controls."""
tsc_values = set()
for ctrl in controls:
tsc = ctrl.get("tsc_criteria", "")
if tsc:
tsc_values.add(tsc)
categories = set()
for cat, criteria in REQUIRED_TSC.items():
for tsc_id in criteria:
if tsc_id in tsc_values:
categories.add(cat)
break
# Always include security as it's required
categories.add("security")
return sorted(categories)
def analyze_coverage(
controls: List[Dict[str, Any]], categories: List[str]
) -> Tuple[List[Dict], List[Dict], List[Dict]]:
"""Analyze TSC coverage and identify gaps."""
# Map existing controls by TSC criteria
covered_tsc = {}
for ctrl in controls:
tsc = ctrl.get("tsc_criteria", "")
if tsc:
if tsc not in covered_tsc:
covered_tsc[tsc] = []
covered_tsc[tsc].append(ctrl)
gaps = []
partial = []
covered = []
for cat in categories:
if cat not in REQUIRED_TSC:
continue
for tsc_id, tsc_desc in REQUIRED_TSC[cat].items():
if tsc_id not in covered_tsc:
gaps.append(
{
"tsc_criteria": tsc_id,
"description": tsc_desc,
"category": cat,
"gap_type": "missing",
"severity": "critical" if cat == "security" else "high",
"remediation": f"Implement control(s) addressing {tsc_id}: {tsc_desc}",
}
)
else:
ctrls = covered_tsc[tsc_id]
# Check for partial implementation
has_issues = False
for ctrl in ctrls:
status = ctrl.get("status", "").lower()
if status in ("not started", "not_started", ""):
has_issues = True
owner = ctrl.get("owner", "TBD")
if owner in ("TBD", "", "N/A"):
has_issues = True
if has_issues:
partial.append(
{
"tsc_criteria": tsc_id,
"description": tsc_desc,
"category": cat,
"gap_type": "partial",
"severity": "medium",
"controls": [c.get("control_id", "N/A") for c in ctrls],
"remediation": f"Complete implementation and assign owners for {tsc_id} controls",
}
)
else:
covered.append(
{
"tsc_criteria": tsc_id,
"description": tsc_desc,
"category": cat,
"controls": [c.get("control_id", "N/A") for c in ctrls],
}
)
return gaps, partial, covered
def analyze_type2_gaps(controls: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Additional gap analysis for Type II operating effectiveness."""
type2_gaps = []
for ctrl in controls:
ctrl_id = ctrl.get("control_id", "N/A")
issues = []
# Check for evidence date coverage
evidence_date = ctrl.get("evidence_date", "")
if not evidence_date:
issues.append(
{
"check": "evidence_period",
"severity": "critical",
"detail": "No evidence date recorded",
}
)
# Check owner assignment
owner = ctrl.get("owner", "TBD")
if owner in ("TBD", "", "N/A"):
issues.append(
{
"check": "owner_accountability",
"severity": "medium",
"detail": "No control owner assigned",
}
)
# Check status for operating evidence
status = ctrl.get("status", "").lower()
if status not in ("collected", "complete", "done"):
issues.append(
{
"check": "operating_consistency",
"severity": "critical",
"detail": f"Control status is '{ctrl.get('status', 'Not Started')}' — operating evidence needed",
}
)
# Check frequency is defined
frequency = ctrl.get("frequency", "")
if not frequency:
issues.append(
{
"check": "frequency_adherence",
"severity": "critical",
"detail": "No control frequency defined",
}
)
if issues:
type2_gaps.append(
{
"control_id": ctrl_id,
"tsc_criteria": ctrl.get("tsc_criteria", "N/A"),
"description": ctrl.get("description", "N/A"),
"issues": issues,
}
)
return type2_gaps
def build_report(
controls: List[Dict[str, Any]],
audit_type: str,
categories: List[str],
gaps: List[Dict],
partial: List[Dict],
covered: List[Dict],
type2_gaps: List[Dict],
) -> Dict[str, Any]:
"""Build the complete gap analysis report."""
total_criteria = sum(
len(REQUIRED_TSC[c]) for c in categories if c in REQUIRED_TSC
)
covered_count = len(covered)
gap_count = len(gaps)
partial_count = len(partial)
coverage_pct = (
round(covered_count / total_criteria * 100, 1) if total_criteria > 0 else 0
)
critical_gaps = len([g for g in gaps if g.get("severity") == "critical"])
if coverage_pct >= 90 and critical_gaps == 0:
readiness = "Ready"
elif coverage_pct >= 75:
readiness = "Near Ready — address gaps before audit"
elif coverage_pct >= 50:
readiness = "Significant work needed"
else:
readiness = "Not ready — major build-out required"
report = {
"report_metadata": {
"audit_type": audit_type,
"categories_assessed": categories,
"report_date": datetime.now().strftime("%Y-%m-%d"),
"total_controls_assessed": len(controls),
},
"coverage_summary": {
"total_criteria": total_criteria,
"covered": covered_count,
"partially_covered": partial_count,
"missing": gap_count,
"coverage_percentage": coverage_pct,
"critical_gaps": critical_gaps,
"readiness_assessment": readiness,
},
"gaps": gaps,
"partial_implementations": partial,
"covered_criteria": covered,
}
if audit_type == "type2":
type2_issue_count = sum(len(g["issues"]) for g in type2_gaps)
report["type2_operating_gaps"] = {
"controls_with_issues": len(type2_gaps),
"total_issues": type2_issue_count,
"details": type2_gaps,
}
return report
def format_text_report(report: Dict[str, Any]) -> str:
"""Format the gap analysis report as human-readable text."""
lines = [
"=" * 65,
"SOC 2 Gap Analysis Report",
"=" * 65,
"",
]
meta = report["report_metadata"]
lines.append(f"Audit Type: {meta['audit_type'].upper()}")
lines.append(f"Report Date: {meta['report_date']}")
lines.append(f"Categories: {', '.join(meta['categories_assessed'])}")
lines.append(f"Controls: {meta['total_controls_assessed']}")
lines.append("")
# Coverage summary
cov = report["coverage_summary"]
lines.append("--- Coverage Summary ---")
lines.append(f" Total TSC Criteria: {cov['total_criteria']}")
lines.append(f" Fully Covered: {cov['covered']}")
lines.append(f" Partially Covered: {cov['partially_covered']}")
lines.append(f" Missing: {cov['missing']}")
lines.append(f" Coverage: {cov['coverage_percentage']}%")
lines.append(f" Critical Gaps: {cov['critical_gaps']}")
lines.append(f" Readiness: {cov['readiness_assessment']}")
lines.append("")
# Gaps
gaps = report.get("gaps", [])
if gaps:
lines.append(f"--- Missing Controls ({len(gaps)}) ---")
for g in gaps:
sev = g["severity"].upper()
lines.append(
f" [{sev}] {g['tsc_criteria']}: {g['description']}"
)
lines.append(f" Remediation: {g['remediation']}")
lines.append("")
# Partial
partial = report.get("partial_implementations", [])
if partial:
lines.append(f"--- Partial Implementations ({len(partial)}) ---")
for p in partial:
ctrls = ", ".join(p.get("controls", []))
lines.append(
f" [{p['severity'].upper()}] {p['tsc_criteria']}: {p['description']}"
)
lines.append(f" Controls: {ctrls}")
lines.append(f" Remediation: {p['remediation']}")
lines.append("")
# Type II operating gaps
if "type2_operating_gaps" in report:
t2 = report["type2_operating_gaps"]
lines.append(
f"--- Type II Operating Gaps ({t2['controls_with_issues']} controls, {t2['total_issues']} issues) ---"
)
for detail in t2["details"]:
lines.append(f" [{detail['control_id']}] {detail['description']}")
for issue in detail["issues"]:
lines.append(
f" - [{issue['severity'].upper()}] {issue['check']}: {issue['detail']}"
)
lines.append("")
return "\n".join(lines)
def main():
parser = argparse.ArgumentParser(
description="SOC 2 Gap Analyzer — identifies gaps between current controls and SOC 2 requirements."
)
parser.add_argument(
"--controls",
type=str,
required=True,
help="Path to JSON file with current controls (from control_matrix_builder.py or custom)",
)
parser.add_argument(
"--type",
type=str,
choices=["type1", "type2"],
default="type1",
help="Audit type: type1 (design only) or type2 (design + operating effectiveness)",
)
parser.add_argument(
"--json",
action="store_true",
help="Output in JSON format",
)
args = parser.parse_args()
controls = load_controls(args.controls)
categories = detect_categories(controls)
gaps, partial, covered = analyze_coverage(controls, categories)
type2_gaps = []
if args.type == "type2":
type2_gaps = analyze_type2_gaps(controls)
report = build_report(
controls, args.type, categories, gaps, partial, covered, type2_gaps
)
if args.json:
print(json.dumps(report, indent=2))
else:
print(format_text_report(report))
if __name__ == "__main__":
main()