Merge pull request #408 from alirezarezvani/feature/sprint-improvements

improve(engineering): enhance 5 existing skills — tdd-guide, env-secrets-manager, senior-secops, database-designer, senior-devops
2026-03-25 14:22:04 +01:00
parent ea2b33ab52 67e2bfabfa
commit c1b2aacb74
5 changed files with 784 additions and 0 deletions
--- a/engineering-team/senior-devops/SKILL.md
+++ b/engineering-team/senior-devops/SKILL.md
@@ -270,6 +270,54 @@ kubectl get pods -n production -l app=myapp
 curl -sf https://app.example.com/healthz || echo "ROLLBACK FAILED — escalate"
 ```

+## Multi-Cloud Cross-References
+
+Use these companion skills for cloud-specific deep dives:
+
+| Skill | Cloud | Use When |
+|-------|-------|----------|
+| **aws-solution-architect** | AWS | ECS/EKS, Lambda, VPC design, cost optimization |
+| **azure-cloud-architect** | Azure | AKS, App Service, Virtual Networks, Azure DevOps |
+| **gcp-cloud-architect** | GCP | GKE, Cloud Run, VPC, Cloud Build *(coming soon)* |
+
+**Multi-cloud vs single-cloud decision:**
+- **Single-cloud** (default) — lower operational complexity, deeper managed-service integration, better cost leverage with committed-use discounts
+- **Multi-cloud** — required when mandated by compliance/data residency, acquiring companies on different clouds, or needing best-of-breed services across providers (e.g., AWS for compute + GCP for ML)
+- **Hybrid** — on-prem + cloud; use when regulated workloads must stay on-prem while burst/non-sensitive workloads run in the cloud
+
+> Start single-cloud. Add a second cloud only when there is a concrete business or compliance driver — not for theoretical redundancy.
+
+---
+
+## Cloud-Agnostic IaC
+
+### Terraform / OpenTofu (Default Choice)
+
+Terraform (or its open-source fork OpenTofu) is the recommended IaC tool for most teams:
+- Single language (HCL) across AWS, Azure, GCP, and 3,000+ providers
+- State management with remote backends (S3, GCS, Azure Blob)
+- Plan-before-apply workflow prevents drift surprises
+- Cross-reference **terraform-patterns** for module structure, state isolation, and CI/CD integration
+
+### Pulumi (Programming Language IaC)
+
+Choose Pulumi when the team strongly prefers TypeScript, Python, Go, or C# over HCL:
+- Full programming language — loops, conditionals, unit tests native
+- Same cloud provider coverage as Terraform
+- Easier onboarding for dev teams that resist learning HCL
+
+### When to Use Cloud-Native IaC
+
+| Tool | Use When |
+|------|----------|
+| **CloudFormation** | AWS-only shop; need native AWS support (StackSets, Service Catalog) |
+| **Bicep** | Azure-only shop; simpler syntax than ARM templates |
+| **Cloud Deployment Manager** | GCP-only; rare — most GCP teams prefer Terraform |
+
+> **Rule of thumb:** Use Terraform/OpenTofu unless you are 100% committed to a single cloud AND the cloud-native tool offers a feature Terraform cannot replicate (e.g., AWS Service Catalog integration).
+
+---
+
 ## Troubleshooting

 Check the comprehensive troubleshooting section in `references/deployment_strategies.md`.
--- a/engineering-team/senior-secops/SKILL.md
+++ b/engineering-team/senior-secops/SKILL.md
@@ -413,6 +413,89 @@ app.use((req, res, next) => {

 ---

+## OWASP Top 10 Quick-Check
+
+Rapid 15-minute assessment — run through each category and note pass/fail. For deep-dive testing, hand off to the **security-pen-testing** skill.
+
+| # | Category | One-Line Check |
+|---|----------|----------------|
+| A01 | Broken Access Control | Verify role checks on every endpoint; test horizontal privilege escalation |
+| A02 | Cryptographic Failures | Confirm TLS 1.2+ everywhere; no secrets in logs or source |
+| A03 | Injection | Run parameterized query audit; check ORM raw-query usage |
+| A04 | Insecure Design | Review threat model exists for critical flows |
+| A05 | Security Misconfiguration | Check default credentials removed; error pages generic |
+| A06 | Vulnerable Components | Run `vulnerability_assessor.py`; zero critical/high CVEs |
+| A07 | Auth Failures | Verify MFA on admin; brute-force protection active |
+| A08 | Software & Data Integrity | Confirm CI/CD pipeline signs artifacts; no unsigned deps |
+| A09 | Logging & Monitoring | Validate audit logs capture auth events; alerts configured |
+| A10 | SSRF | Test internal URL filters; block metadata endpoints (169.254.169.254) |
+
+> **Deep dive needed?** Hand off to `security-pen-testing` for full OWASP Testing Guide coverage.
+
+---
+
+## Secret Scanning Tools
+
+Choose the right scanner for each stage of your workflow:
+
+| Tool | Best For | Language | Pre-commit | CI/CD | Custom Rules |
+|------|----------|----------|:----------:|:-----:|:------------:|
+| **gitleaks** | CI pipelines, full-repo scans | Go | Yes | Yes | TOML regexes |
+| **detect-secrets** | Pre-commit hooks, incremental | Python | Yes | Partial | Plugin-based |
+| **truffleHog** | Deep history scans, entropy | Go | No | Yes | Regex + entropy |
+
+**Recommended setup:** Use `detect-secrets` as a pre-commit hook (catches secrets before they enter history) and `gitleaks` in CI (catches anything that slips through).
+
+```bash
+# detect-secrets pre-commit hook (.pre-commit-config.yaml)
+- repo: https://github.com/Yelp/detect-secrets
+  rev: v1.4.0
+  hooks:
+    - id: detect-secrets
+      args: ['--baseline', '.secrets.baseline']
+
+# gitleaks in GitHub Actions
+- name: gitleaks
+  uses: gitleaks/gitleaks-action@v2
+  env:
+    GITLEAKS_LICENSE: ${{ secrets.GITLEAKS_LICENSE }}
+```
+
+---
+
+## Supply Chain Security
+
+Protect against dependency and artifact tampering with SBOM generation, artifact signing, and SLSA compliance.
+
+**SBOM Generation:**
+- **syft** — generates SBOMs from container images or source dirs (SPDX, CycloneDX formats)
+- **cyclonedx-cli** — CycloneDX-native tooling; merge multiple SBOMs for mono-repos
+
+```bash
+# Generate SBOM from container image
+syft packages ghcr.io/org/app:latest -o cyclonedx-json > sbom.json
+```
+
+**Artifact Signing (Sigstore/cosign):**
+```bash
+# Sign a container image (keyless via OIDC)
+cosign sign ghcr.io/org/app:latest
+# Verify signature
+cosign verify ghcr.io/org/app:latest --certificate-identity=ci@org.com --certificate-oidc-issuer=https://token.actions.githubusercontent.com
+```
+
+**SLSA Levels Overview:**
+| Level | Requirement | What It Proves |
+|-------|-------------|----------------|
+| 1 | Build process documented | Provenance exists |
+| 2 | Hosted build service, signed provenance | Tamper-resistant provenance |
+| 3 | Hardened build platform, non-falsifiable provenance | Tamper-proof build |
+| 4 | Two-party review, hermetic builds | Maximum supply-chain assurance |
+
+> **Cross-references:** `security-pen-testing` (vulnerability exploitation testing), `dependency-auditor` (license and CVE audit for dependencies).
+
+---
+
 ## Reference Documentation

 | Document | Description |
--- a/engineering-team/tdd-guide/SKILL.md
+++ b/engineering-team/tdd-guide/SKILL.md
@@ -148,6 +148,254 @@ Additional scripts: `framework_adapter.py` (convert between frameworks), `metric

 ---

+## Spec-First Workflow
+
+TDD is most effective when driven by a written spec. The flow:
+
+1. **Write or receive a spec** — stored in `specs/<feature>.md`
+2. **Extract acceptance criteria** — each criterion becomes one or more test cases
+3. **Write failing tests (RED)** — one test per acceptance criterion
+4. **Implement minimal code (GREEN)** — satisfy each test in order
+5. **Refactor** — clean up while all tests stay green
+
+### Spec Directory Convention
+
+```
+project/
+├── specs/
+│   ├── user-auth.md          # Feature spec with acceptance criteria
+│   ├── payment-processing.md
+│   └── notification-system.md
+├── tests/
+│   ├── test_user_auth.py     # Tests derived from specs/user-auth.md
+│   ├── test_payments.py
+│   └── test_notifications.py
+└── src/
+```
+
+### Extracting Tests from Specs
+
+Each acceptance criterion in a spec maps to at least one test:
+
+| Spec Criterion | Test Case |
+|---------------|-----------|
+| "User can log in with valid credentials" | `test_login_valid_credentials_returns_token` |
+| "Invalid password returns 401" | `test_login_invalid_password_returns_401` |
+| "Account locks after 5 failed attempts" | `test_login_locks_after_five_failures` |
+
+**Tip:** Number your acceptance criteria in the spec. Reference the number in the test docstring for traceability (`# AC-3: Account locks after 5 failed attempts`).
+
+> **Cross-reference:** See `engineering/spec-driven-workflow` for the full spec methodology, including spec templates and review checklists.
+
+---
+
+## Red-Green-Refactor Examples Per Language
+
+### TypeScript / Jest
+
+```typescript
+// test/cart.test.ts
+describe("Cart", () => {
+  describe("addItem", () => {
+    it("should add a new item to an empty cart", () => {
+      const cart = new Cart();
+      cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 });
+
+      expect(cart.items).toHaveLength(1);
+      expect(cart.items[0].id).toBe("sku-1");
+    });
+
+    it("should increment quantity when adding an existing item", () => {
+      const cart = new Cart();
+      cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 });
+      cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 2 });
+
+      expect(cart.items).toHaveLength(1);
+      expect(cart.items[0].qty).toBe(3);
+    });
+
+    it("should throw when quantity is zero or negative", () => {
+      const cart = new Cart();
+      expect(() =>
+        cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 0 })
+      ).toThrow("Quantity must be positive");
+    });
+  });
+});
+```
+
+### Python / Pytest (Advanced Patterns)
+
+```python
+# tests/conftest.py — shared fixtures
+import pytest
+from app.db import create_engine, Session
+
+@pytest.fixture(scope="session")
+def db_engine():
+    engine = create_engine("sqlite:///:memory:")
+    yield engine
+    engine.dispose()
+
+@pytest.fixture
+def db_session(db_engine):
+    session = Session(bind=db_engine)
+    yield session
+    session.rollback()
+    session.close()
+
+# tests/test_pricing.py — parametrize for multiple cases
+import pytest
+from app.pricing import calculate_discount
+
+@pytest.mark.parametrize("subtotal, expected_discount", [
+    (50.0, 0.0),       # Below threshold — no discount
+    (100.0, 5.0),      # 5% tier
+    (250.0, 25.0),     # 10% tier
+    (500.0, 75.0),     # 15% tier
+])
+def test_calculate_discount(subtotal, expected_discount):
+    assert calculate_discount(subtotal) == pytest.approx(expected_discount)
+```
+
+### Go — Table-Driven Tests
+
+```go
+// cart_test.go
+package cart
+
+import "testing"
+
+func TestApplyDiscount(t *testing.T) {
+    tests := []struct {
+        name     string
+        subtotal float64
+        want     float64
+    }{
+        {"no discount below threshold", 50.0, 0.0},
+        {"5 percent tier", 100.0, 5.0},
+        {"10 percent tier", 250.0, 25.0},
+        {"15 percent tier", 500.0, 75.0},
+        {"zero subtotal", 0.0, 0.0},
+    }
+
+    for _, tt := range tests {
+        t.Run(tt.name, func(t *testing.T) {
+            got := ApplyDiscount(tt.subtotal)
+            if got != tt.want {
+                t.Errorf("ApplyDiscount(%v) = %v, want %v", tt.subtotal, got, tt.want)
+            }
+        })
+    }
+}
+```
+
+---
+
+## Bounded Autonomy Rules
+
+When generating tests autonomously, follow these rules to decide when to stop and ask the user:
+
+### Stop and Ask When
+
+- **Ambiguous requirements** — the spec or user story has conflicting or unclear acceptance criteria
+- **Missing edge cases** — you cannot determine boundary values without domain knowledge (e.g., max allowed transaction amount)
+- **Test count exceeds 50** — large test suites need human review before committing; present a summary and ask which areas to prioritize
+- **External dependencies unclear** — the feature relies on third-party APIs or services with undocumented behavior
+- **Security-sensitive logic** — authentication, authorization, encryption, or payment flows require human sign-off on test scenarios
+
+### Continue Autonomously When
+
+- **Clear spec with numbered acceptance criteria** — each criterion maps directly to tests
+- **Straightforward CRUD operations** — create, read, update, delete with well-defined models
+- **Well-defined API contracts** — OpenAPI spec or typed interfaces available
+- **Pure functions** — deterministic input/output with no side effects
+- **Existing test patterns** — the codebase already has similar tests to follow
+
+---
+
+## Property-Based Testing
+
+Property-based testing generates random inputs to verify invariants instead of relying on hand-picked examples. Use it when the input space is large and the expected behavior can be described as a property.
+
+### Python — Hypothesis
+
+```python
+from hypothesis import given, strategies as st
+from app.serializers import serialize, deserialize
+
+@given(st.text())
+def test_roundtrip_serialization(data):
+    """Serialization followed by deserialization returns the original."""
+    assert deserialize(serialize(data)) == data
+
+@given(st.integers(), st.integers())
+def test_addition_is_commutative(a, b):
+    assert a + b == b + a
+```
+
+### TypeScript — fast-check
+
+```typescript
+import fc from "fast-check";
+import { encode, decode } from "./codec";
+
+test("encode/decode roundtrip", () => {
+  fc.assert(
+    fc.property(fc.string(), (input) => {
+      expect(decode(encode(input))).toBe(input);
+    })
+  );
+});
+```
+
+### When to Use Property-Based Over Example-Based
+
+| Use Property-Based | Example |
+|-------------------|---------|
+| Data transformations | Serialize/deserialize roundtrips |
+| Mathematical properties | Commutativity, associativity, idempotency |
+| Encoding/decoding | Base64, URL encoding, compression |
+| Sorting and filtering | Output is sorted, length preserved |
+| Parser correctness | Valid input always parses without error |
+
+---
+
+## Mutation Testing
+
+Mutation testing modifies your production code (creates "mutants") and checks whether your tests catch the changes. If a mutant survives (tests still pass), your tests have a gap that coverage alone cannot reveal.
+
+### Tools
+
+| Language | Tool | Command |
+|----------|------|---------|
+| TypeScript/JavaScript | **Stryker** | `npx stryker run` |
+| Python | **mutmut** | `mutmut run --paths-to-mutate=src/` |
+| Java | **PIT** | `mvn org.pitest:pitest-maven:mutationCoverage` |
+
+### Why Mutation Testing Matters
+
+- **100% line coverage != good tests** — coverage tells you code was executed, not that it was verified
+- **Catches weak assertions** — tests that run code but assert nothing meaningful
+- **Finds missing boundary tests** — mutants that change `<` to `<=` expose off-by-one gaps
+- **Quantifiable quality metric** — mutation score (% mutants killed) is a stronger signal than coverage %
+
+**Recommendation:** Run mutation testing on critical paths (auth, payments, data processing) even if overall coverage is high. Target 85%+ mutation score on P0 modules.
+
+---
+
+## Cross-References
+
+| Skill | Relationship |
+|-------|-------------|
+| `engineering/spec-driven-workflow` | Spec → acceptance criteria → test extraction pipeline |
+| `engineering-team/focused-fix` | Phase 5 (Verify) uses TDD to confirm the fix with a regression test |
+| `engineering-team/senior-qa` | Broader QA strategy; TDD is one layer in the test pyramid |
+| `engineering-team/code-reviewer` | Review generated tests for assertion quality and coverage completeness |
+| `engineering-team/senior-fullstack` | Project scaffolders include testing infrastructure compatible with TDD workflows |
+
+---
+
 ## Limitations

 | Scope | Details |
--- a/engineering/database-designer/SKILL.md
+++ b/engineering/database-designer/SKILL.md
@@ -59,6 +59,229 @@ A comprehensive database design skill that provides expert-level analysis, optim
 4. **Validate inputs**: Prevent SQL injection attacks
 5. **Regular security updates**: Keep database software current

+## Query Generation Patterns
+
+### SELECT with JOINs
+
+```sql
+-- INNER JOIN: only matching rows
+SELECT o.id, c.name, o.total
+FROM orders o
+INNER JOIN customers c ON c.id = o.customer_id;
+
+-- LEFT JOIN: all left rows, NULLs for non-matches
+SELECT c.name, COUNT(o.id) AS order_count
+FROM customers c
+LEFT JOIN orders o ON o.customer_id = c.id
+GROUP BY c.name;
+
+-- Self-join: hierarchical data (employees/managers)
+SELECT e.name AS employee, m.name AS manager
+FROM employees e
+LEFT JOIN employees m ON m.id = e.manager_id;
+```
+
+### Common Table Expressions (CTEs)
+
+```sql
+-- Recursive CTE for org chart
+WITH RECURSIVE org AS (
+  SELECT id, name, manager_id, 1 AS depth
+  FROM employees WHERE manager_id IS NULL
+  UNION ALL
+  SELECT e.id, e.name, e.manager_id, o.depth + 1
+  FROM employees e INNER JOIN org o ON o.id = e.manager_id
+)
+SELECT * FROM org ORDER BY depth, name;
+```
+
+### Window Functions
+
+```sql
+-- ROW_NUMBER for pagination / dedup
+SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
+FROM orders;
+
+-- RANK with gaps, DENSE_RANK without gaps
+SELECT name, score, RANK() OVER (ORDER BY score DESC) AS rank FROM leaderboard;
+
+-- LAG/LEAD for comparing adjacent rows
+SELECT date, revenue,
+  revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change
+FROM daily_sales;
+```
+
+### Aggregation Patterns
+
+```sql
+-- FILTER clause (PostgreSQL) for conditional aggregation
+SELECT
+  COUNT(*) AS total,
+  COUNT(*) FILTER (WHERE status = 'active') AS active,
+  AVG(amount) FILTER (WHERE amount > 0) AS avg_positive
+FROM accounts;
+
+-- GROUPING SETS for multi-level rollups
+SELECT region, product, SUM(revenue)
+FROM sales
+GROUP BY GROUPING SETS ((region, product), (region), ());
+```
+
+---
+
+## Migration Patterns
+
+### Up/Down Migration Scripts
+
+Every migration must have a reversible counterpart. Name files with a timestamp prefix for ordering:
+
+```
+migrations/
+├── 20260101_000001_create_users.up.sql
+├── 20260101_000001_create_users.down.sql
+├── 20260115_000002_add_users_email_index.up.sql
+└── 20260115_000002_add_users_email_index.down.sql
+```
+
+### Zero-Downtime Migrations (Expand/Contract)
+
+Use the expand-contract pattern to avoid locking or breaking running code:
+
+1. **Expand** — add the new column/table (nullable, with default)
+2. **Migrate data** — backfill in batches; dual-write from application
+3. **Transition** — application reads from new column; stop writing to old
+4. **Contract** — drop old column in a follow-up migration
+
+### Data Backfill Strategies
+
+```sql
+-- Batch update to avoid long-running locks
+UPDATE users SET email_normalized = LOWER(email)
+WHERE id IN (SELECT id FROM users WHERE email_normalized IS NULL LIMIT 5000);
+-- Repeat in a loop until 0 rows affected
+```
+
+### Rollback Procedures
+
+- Always test the `down.sql` in staging before deploying `up.sql` to production
+- Keep rollback window short — if the contract step has run, rollback requires a new forward migration
+- For irreversible changes (dropping columns with data), take a logical backup first
+
+---
+
+## Performance Optimization
+
+### Indexing Strategies
+
+| Index Type | Use Case | Example |
+|------------|----------|---------|
+| **B-tree** (default) | Equality, range, ORDER BY | `CREATE INDEX idx_users_email ON users(email);` |
+| **GIN** | Full-text search, JSONB, arrays | `CREATE INDEX idx_docs_body ON docs USING gin(to_tsvector('english', body));` |
+| **GiST** | Geometry, range types, nearest-neighbor | `CREATE INDEX idx_locations ON places USING gist(coords);` |
+| **Partial** | Subset of rows (reduce size) | `CREATE INDEX idx_active ON users(email) WHERE active = true;` |
+| **Covering** | Index-only scans | `CREATE INDEX idx_cov ON orders(customer_id) INCLUDE (total, created_at);` |
+
+### EXPLAIN Plan Reading
+
+```sql
+EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;
+```
+
+Key signals to watch:
+- **Seq Scan** on large tables — missing index
+- **Nested Loop** with high row estimates — consider hash/merge join or add index
+- **Buffers shared read** much higher than **hit** — working set exceeds memory
+
+### N+1 Query Detection
+
+Symptoms: application issues one query per row (e.g., fetching related records in a loop).
+
+Fixes:
+- Use `JOIN` or subquery to fetch in one round-trip
+- ORM eager loading (`select_related` / `includes` / `with`)
+- DataLoader pattern for GraphQL resolvers
+
+### Connection Pooling
+
+| Tool | Protocol | Best For |
+|------|----------|----------|
+| **PgBouncer** | PostgreSQL | Transaction/statement pooling, low overhead |
+| **ProxySQL** | MySQL | Query routing, read/write splitting |
+| **Built-in pool** (HikariCP, SQLAlchemy pool) | Any | Application-level pooling |
+
+**Rule of thumb:** Set pool size to `(2 * CPU cores) + disk spindles`. For cloud SSDs, start with `2 * vCPUs` and tune.
+
+### Read Replicas and Query Routing
+
+- Route all `SELECT` queries to replicas; writes to primary
+- Account for replication lag (typically <1s for async, 0 for sync)
+- Use `pg_last_wal_replay_lsn()` to detect lag before reading critical data
+
+---
+
+## Multi-Database Decision Matrix
+
+| Criteria | PostgreSQL | MySQL | SQLite | SQL Server |
+|----------|-----------|-------|--------|------------|
+| **Best for** | Complex queries, JSONB, extensions | Web apps, read-heavy workloads | Embedded, dev/test, edge | Enterprise .NET stacks |
+| **JSON support** | Excellent (JSONB + GIN) | Good (JSON type) | Minimal | Good (OPENJSON) |
+| **Replication** | Streaming, logical | Group replication, InnoDB cluster | N/A | Always On AG |
+| **Licensing** | Open source (PostgreSQL License) | Open source (GPL) / commercial | Public domain | Commercial |
+| **Max practical size** | Multi-TB | Multi-TB | ~1 TB (single-writer) | Multi-TB |
+
+**When to choose:**
+- **PostgreSQL** — default choice for new projects; best extensibility and standards compliance
+- **MySQL** — existing MySQL ecosystem; simple read-heavy web applications
+- **SQLite** — mobile apps, CLI tools, unit test databases, IoT/edge
+- **SQL Server** — mandated by enterprise policy; deep .NET/Azure integration
+
+### NoSQL Considerations
+
+| Database | Model | Use When |
+|----------|-------|----------|
+| **MongoDB** | Document | Schema flexibility, rapid prototyping, content management |
+| **Redis** | Key-value / cache | Session store, rate limiting, leaderboards, pub/sub |
+| **DynamoDB** | Wide-column | Serverless AWS apps, single-digit-ms latency at any scale |
+
+> Use SQL as default. Reach for NoSQL only when the access pattern clearly benefits from it.
+
+---
+
+## Sharding & Replication
+
+### Horizontal vs Vertical Partitioning
+
+- **Vertical partitioning**: Split columns across tables (e.g., separate BLOB columns). Reduces I/O for narrow queries.
+- **Horizontal partitioning (sharding)**: Split rows across databases/servers. Required when a single node cannot hold the dataset or handle the throughput.
+
+### Sharding Strategies
+
+| Strategy | How It Works | Pros | Cons |
+|----------|-------------|------|------|
+| **Hash** | `shard = hash(key) % N` | Even distribution | Resharding is expensive |
+| **Range** | Shard by date or ID range | Simple, good for time-series | Hot spots on latest shard |
+| **Geographic** | Shard by user region | Data locality, compliance | Cross-region queries are hard |
+
+### Replication Patterns
+
+| Pattern | Consistency | Latency | Use Case |
+|---------|------------|---------|----------|
+| **Synchronous** | Strong | Higher write latency | Financial transactions |
+| **Asynchronous** | Eventual | Low write latency | Read-heavy web apps |
+| **Semi-synchronous** | At-least-one replica confirmed | Moderate | Balance of safety and speed |
+
+---
+
+## Cross-References
+
+- **sql-database-assistant** — query writing, optimization, and debugging for day-to-day SQL work
+- **database-schema-designer** — ERD modeling, normalization analysis, and schema generation
+- **migration-architect** — large-scale migration planning across database engines or major schema overhauls
+- **senior-backend** — application-layer patterns (connection pooling, ORM best practices)
+- **senior-devops** — infrastructure provisioning for database clusters and replicas
+
+---
+
 ## Conclusion

 Effective database design requires balancing multiple competing concerns: performance, scalability, maintainability, and business requirements. This skill provides the tools and knowledge to make informed decisions throughout the database lifecycle, from initial schema design through production optimization and evolution.
--- a/engineering/env-secrets-manager/SKILL.md
+++ b/engineering/env-secrets-manager/SKILL.md
@@ -76,3 +76,185 @@ python3 scripts/env_auditor.py /path/to/repo --json
 2. Keep dev env files local and gitignored.
 3. Enforce detection in CI before merge.
 4. Re-test application paths immediately after credential rotation.
+
+---
+
+## Cloud Secret Store Integration
+
+Production applications should never read secrets from `.env` files or environment variables baked into container images. Use a dedicated secret store instead.
+
+### Provider Comparison
+
+| Provider | Best For | Key Feature |
+|----------|----------|-------------|
+| **HashiCorp Vault** | Multi-cloud / hybrid | Dynamic secrets, policy engine, pluggable backends |
+| **AWS Secrets Manager** | AWS-native workloads | Native Lambda/ECS/EKS integration, automatic RDS rotation |
+| **Azure Key Vault** | Azure-native workloads | Managed HSM, Azure AD RBAC, certificate management |
+| **GCP Secret Manager** | GCP-native workloads | IAM-based access, automatic replication, versioning |
+
+### Selection Guidance
+
+- **Single cloud provider** — use the cloud-native secret manager. It integrates tightly with IAM, reduces operational overhead, and costs less than self-hosting.
+- **Multi-cloud or hybrid** — use HashiCorp Vault. It provides a uniform API across environments and supports dynamic secret generation (database credentials, cloud IAM keys) that expire automatically.
+- **Kubernetes-heavy** — combine External Secrets Operator with any backend above to sync secrets into K8s `Secret` objects without hardcoding.
+
+### Application Access Patterns
+
+1. **SDK/API pull** — application fetches secret at startup or on-demand via provider SDK.
+2. **Sidecar injection** — a sidecar container (e.g., Vault Agent) writes secrets to a shared volume or injects them as environment variables.
+3. **Init container** — a Kubernetes init container fetches secrets before the main container starts.
+4. **CSI driver** — secrets mount as a filesystem volume via the Secrets Store CSI Driver.
+
+> **Cross-reference:** See `engineering/secrets-vault-manager` for production vault infrastructure patterns, HA deployment, and disaster recovery procedures.
+
+---
+
+## Secret Rotation Workflow
+
+Stale secrets are a liability. Rotation ensures that even if a credential leaks, its useful lifetime is bounded.
+
+### Phase 1: Detection
+
+- Track secret creation and expiry dates in your secret store metadata.
+- Set alerts at 30, 14, and 7 days before expiry.
+- Use `scripts/env_auditor.py` to flag secrets with no recorded rotation date.
+
+### Phase 2: Rotation
+
+1. **Generate** a new credential (API key, database password, certificate).
+2. **Deploy** the new credential to all consumers (apps, services, pipelines) in parallel.
+3. **Verify** each consumer can authenticate using the new credential.
+4. **Revoke** the old credential only after all consumers are confirmed healthy.
+5. **Update** metadata with the new rotation timestamp and next rotation date.
+
+### Phase 3: Automation
+
+- **AWS Secrets Manager** — use built-in Lambda-based rotation for RDS, Redshift, and DocumentDB.
+- **HashiCorp Vault** — configure dynamic secrets with TTLs; credentials are generated on-demand and auto-expire.
+- **Azure Key Vault** — use Event Grid notifications to trigger rotation functions.
+- **GCP Secret Manager** — use Pub/Sub notifications tied to Cloud Functions for rotation logic.
+
+### Emergency Rotation Checklist
+
+When a secret is confirmed leaked:
+
+1. **Immediately revoke** the compromised credential at the provider level.
+2. Generate and deploy a replacement credential to all consumers.
+3. Audit access logs for unauthorized usage during the exposure window.
+4. Scan git history, CI logs, and artifact registries for the leaked value.
+5. File an incident report documenting scope, timeline, and remediation steps.
+6. Review and tighten detection controls to prevent recurrence.
+
+---
+
+## CI/CD Secret Injection
+
+Secrets in CI/CD pipelines require careful handling to avoid exposure in logs, artifacts, or pull request contexts.
+
+### GitHub Actions
+
+- Use **repository secrets** or **environment secrets** via `${{ secrets.SECRET_NAME }}`.
+- Prefer **OIDC federation** (`aws-actions/configure-aws-credentials` with `role-to-assume`) over long-lived access keys.
+- Environment secrets with required reviewers add approval gates for production deployments.
+- GitHub automatically masks secrets in logs, but avoid `echo` or `toJSON()` on secret values.
+
+### GitLab CI
+
+- Store secrets as **CI/CD variables** with the `masked` and `protected` flags enabled.
+- Use **HashiCorp Vault integration** (`secrets:vault`) for dynamic secret injection without storing values in GitLab.
+- Scope variables to specific environments (`production`, `staging`) to enforce least privilege.
+
+### Universal Patterns
+
+- **Never echo or print** secret values in pipeline output, even for debugging.
+- **Use short-lived tokens** (OIDC, STS AssumeRole) instead of static credentials wherever possible.
+- **Restrict PR access** — do not expose secrets to pipelines triggered by forks or untrusted branches.
+- **Rotate CI secrets** on the same schedule as application secrets; pipeline credentials are attack vectors too.
+- **Audit pipeline logs** periodically for accidental secret exposure that masking may have missed.
+
+---
+
+## Pre-Commit Secret Detection
+
+Catching secrets before they reach version control is the most cost-effective defense. Two leading tools cover this space.
+
+### gitleaks
+
+```toml
+# .gitleaks.toml — minimal configuration
+[extend]
+useDefault = true
+
+[[rules]]
+id = "custom-internal-token"
+description = "Internal service token pattern"
+regex = '''INTERNAL_TOKEN_[A-Za-z0-9]{32}'''
+secretGroup = 0
+```
+
+- Install: `brew install gitleaks` or download from GitHub releases.
+- Pre-commit hook: `gitleaks git --pre-commit --staged`
+- Baseline scanning: `gitleaks detect --source . --report-path gitleaks-report.json`
+- Manage false positives in `.gitleaksignore` (one fingerprint per line).
+
+### detect-secrets
+
+```bash
+# Generate baseline
+detect-secrets scan --all-files > .secrets.baseline
+
+# Pre-commit hook (via pre-commit framework)
+# .pre-commit-config.yaml
+repos:
+  - repo: https://github.com/Yelp/detect-secrets
+    rev: v1.5.0
+    hooks:
+      - id: detect-secrets
+        args: ['--baseline', '.secrets.baseline']
+```
+
+- Supports **custom plugins** for organization-specific patterns.
+- Audit workflow: `detect-secrets audit .secrets.baseline` interactively marks true/false positives.
+
+### False Positive Management
+
+- Maintain `.gitleaksignore` or `.secrets.baseline` in version control so the whole team shares exclusions.
+- Review false positive lists during security audits — patterns may mask real leaks over time.
+- Prefer tightening regex patterns over broadly ignoring files.
+
+---
+
+## Audit Logging
+
+Knowing who accessed which secret and when is critical for incident investigation and compliance.
+
+### Cloud-Native Audit Trails
+
+| Provider | Service | What It Captures |
+|----------|---------|-----------------|
+| **AWS** | CloudTrail | Every `GetSecretValue`, `DescribeSecret`, `RotateSecret` API call |
+| **Azure** | Activity Log + Diagnostic Logs | Key Vault access events, including caller identity and IP |
+| **GCP** | Cloud Audit Logs | Data access logs for Secret Manager with principal and timestamp |
+| **Vault** | Audit Backend | Full request/response logging (file, syslog, or socket backend) |
+
+### Alerting Strategy
+
+- Alert on **access from unknown IP ranges** or service accounts outside the expected set.
+- Alert on **bulk secret reads** (more than N secrets accessed within a time window).
+- Alert on **access outside deployment windows** when no CI/CD pipeline is running.
+- Feed audit logs into your SIEM (Splunk, Datadog, Elastic) for correlation with other security events.
+- Review audit logs quarterly as part of access recertification.
+
+---
+
+## Cross-References
+
+This skill covers env hygiene and secret detection. For deeper coverage of related domains, see:
+
+| Skill | Path | Relationship |
+|-------|------|-------------|
+| **Secrets Vault Manager** | `engineering/secrets-vault-manager` | Production vault infrastructure, HA deployment, DR |
+| **Senior SecOps** | `engineering/senior-secops` | Security operations perspective, incident response |
+| **CI/CD Pipeline Builder** | `engineering/ci-cd-pipeline-builder` | Pipeline architecture, secret injection patterns |
+| **Infrastructure as Code** | `engineering/infrastructure-as-code` | Terraform/Pulumi secret backend configuration |
+| **Container Orchestration** | `engineering/container-orchestration` | Kubernetes secret mounting, sealed secrets |