From 67e2bfabfa1da1e2bde62a80d38688c2301d5394 Mon Sep 17 00:00:00 2001 From: Reza Rezvani Date: Wed, 25 Mar 2026 13:49:25 +0100 Subject: [PATCH] improve(engineering): enhance tdd-guide, env-secrets-manager, senior-secops, database-designer, senior-devops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit tdd-guide (164 → 412 lines): - Spec-first workflow, per-language examples (TS/Python/Go) - Bounded autonomy rules, property-based testing, mutation testing env-secrets-manager (78 → 260 lines): - Cloud secret store integration (Vault, AWS SM, Azure KV, GCP SM) - Secret rotation workflow, CI/CD injection, pre-commit detection, audit logging senior-secops (422 → 505 lines): - OWASP Top 10 quick-check, secret scanning tools comparison - Supply chain security (SBOM, Sigstore, SLSA levels) database-designer (66 → 289 lines): - Query patterns (JOINs, CTEs, window functions), migration patterns - Performance optimization (indexing, EXPLAIN, N+1, connection pooling) - Multi-DB decision matrix, sharding & replication senior-devops (275 → 323 lines): - Multi-cloud cross-references (AWS, Azure, GCP architects) - Cloud-agnostic IaC section (Terraform/OpenTofu, Pulumi) Co-Authored-By: Claude Opus 4.6 (1M context) --- engineering-team/senior-devops/SKILL.md | 48 +++++ engineering-team/senior-secops/SKILL.md | 83 ++++++++ engineering-team/tdd-guide/SKILL.md | 248 +++++++++++++++++++++++ engineering/database-designer/SKILL.md | 223 ++++++++++++++++++++ engineering/env-secrets-manager/SKILL.md | 182 +++++++++++++++++ 5 files changed, 784 insertions(+) diff --git a/engineering-team/senior-devops/SKILL.md b/engineering-team/senior-devops/SKILL.md index 67f2e64..3a3887a 100644 --- a/engineering-team/senior-devops/SKILL.md +++ b/engineering-team/senior-devops/SKILL.md @@ -270,6 +270,54 @@ kubectl get pods -n production -l app=myapp curl -sf https://app.example.com/healthz || echo "ROLLBACK FAILED — escalate" ``` +## Multi-Cloud Cross-References + +Use these companion skills for cloud-specific deep dives: + +| Skill | Cloud | Use When | +|-------|-------|----------| +| **aws-solution-architect** | AWS | ECS/EKS, Lambda, VPC design, cost optimization | +| **azure-cloud-architect** | Azure | AKS, App Service, Virtual Networks, Azure DevOps | +| **gcp-cloud-architect** | GCP | GKE, Cloud Run, VPC, Cloud Build *(coming soon)* | + +**Multi-cloud vs single-cloud decision:** +- **Single-cloud** (default) — lower operational complexity, deeper managed-service integration, better cost leverage with committed-use discounts +- **Multi-cloud** — required when mandated by compliance/data residency, acquiring companies on different clouds, or needing best-of-breed services across providers (e.g., AWS for compute + GCP for ML) +- **Hybrid** — on-prem + cloud; use when regulated workloads must stay on-prem while burst/non-sensitive workloads run in the cloud + +> Start single-cloud. Add a second cloud only when there is a concrete business or compliance driver — not for theoretical redundancy. + +--- + +## Cloud-Agnostic IaC + +### Terraform / OpenTofu (Default Choice) + +Terraform (or its open-source fork OpenTofu) is the recommended IaC tool for most teams: +- Single language (HCL) across AWS, Azure, GCP, and 3,000+ providers +- State management with remote backends (S3, GCS, Azure Blob) +- Plan-before-apply workflow prevents drift surprises +- Cross-reference **terraform-patterns** for module structure, state isolation, and CI/CD integration + +### Pulumi (Programming Language IaC) + +Choose Pulumi when the team strongly prefers TypeScript, Python, Go, or C# over HCL: +- Full programming language — loops, conditionals, unit tests native +- Same cloud provider coverage as Terraform +- Easier onboarding for dev teams that resist learning HCL + +### When to Use Cloud-Native IaC + +| Tool | Use When | +|------|----------| +| **CloudFormation** | AWS-only shop; need native AWS support (StackSets, Service Catalog) | +| **Bicep** | Azure-only shop; simpler syntax than ARM templates | +| **Cloud Deployment Manager** | GCP-only; rare — most GCP teams prefer Terraform | + +> **Rule of thumb:** Use Terraform/OpenTofu unless you are 100% committed to a single cloud AND the cloud-native tool offers a feature Terraform cannot replicate (e.g., AWS Service Catalog integration). + +--- + ## Troubleshooting Check the comprehensive troubleshooting section in `references/deployment_strategies.md`. diff --git a/engineering-team/senior-secops/SKILL.md b/engineering-team/senior-secops/SKILL.md index da43793..5e48e38 100644 --- a/engineering-team/senior-secops/SKILL.md +++ b/engineering-team/senior-secops/SKILL.md @@ -413,6 +413,89 @@ app.use((req, res, next) => { --- +## OWASP Top 10 Quick-Check + +Rapid 15-minute assessment — run through each category and note pass/fail. For deep-dive testing, hand off to the **security-pen-testing** skill. + +| # | Category | One-Line Check | +|---|----------|----------------| +| A01 | Broken Access Control | Verify role checks on every endpoint; test horizontal privilege escalation | +| A02 | Cryptographic Failures | Confirm TLS 1.2+ everywhere; no secrets in logs or source | +| A03 | Injection | Run parameterized query audit; check ORM raw-query usage | +| A04 | Insecure Design | Review threat model exists for critical flows | +| A05 | Security Misconfiguration | Check default credentials removed; error pages generic | +| A06 | Vulnerable Components | Run `vulnerability_assessor.py`; zero critical/high CVEs | +| A07 | Auth Failures | Verify MFA on admin; brute-force protection active | +| A08 | Software & Data Integrity | Confirm CI/CD pipeline signs artifacts; no unsigned deps | +| A09 | Logging & Monitoring | Validate audit logs capture auth events; alerts configured | +| A10 | SSRF | Test internal URL filters; block metadata endpoints (169.254.169.254) | + +> **Deep dive needed?** Hand off to `security-pen-testing` for full OWASP Testing Guide coverage. + +--- + +## Secret Scanning Tools + +Choose the right scanner for each stage of your workflow: + +| Tool | Best For | Language | Pre-commit | CI/CD | Custom Rules | +|------|----------|----------|:----------:|:-----:|:------------:| +| **gitleaks** | CI pipelines, full-repo scans | Go | Yes | Yes | TOML regexes | +| **detect-secrets** | Pre-commit hooks, incremental | Python | Yes | Partial | Plugin-based | +| **truffleHog** | Deep history scans, entropy | Go | No | Yes | Regex + entropy | + +**Recommended setup:** Use `detect-secrets` as a pre-commit hook (catches secrets before they enter history) and `gitleaks` in CI (catches anything that slips through). + +```bash +# detect-secrets pre-commit hook (.pre-commit-config.yaml) +- repo: https://github.com/Yelp/detect-secrets + rev: v1.4.0 + hooks: + - id: detect-secrets + args: ['--baseline', '.secrets.baseline'] + +# gitleaks in GitHub Actions +- name: gitleaks + uses: gitleaks/gitleaks-action@v2 + env: + GITLEAKS_LICENSE: ${{ secrets.GITLEAKS_LICENSE }} +``` + +--- + +## Supply Chain Security + +Protect against dependency and artifact tampering with SBOM generation, artifact signing, and SLSA compliance. + +**SBOM Generation:** +- **syft** — generates SBOMs from container images or source dirs (SPDX, CycloneDX formats) +- **cyclonedx-cli** — CycloneDX-native tooling; merge multiple SBOMs for mono-repos + +```bash +# Generate SBOM from container image +syft packages ghcr.io/org/app:latest -o cyclonedx-json > sbom.json +``` + +**Artifact Signing (Sigstore/cosign):** +```bash +# Sign a container image (keyless via OIDC) +cosign sign ghcr.io/org/app:latest +# Verify signature +cosign verify ghcr.io/org/app:latest --certificate-identity=ci@org.com --certificate-oidc-issuer=https://token.actions.githubusercontent.com +``` + +**SLSA Levels Overview:** +| Level | Requirement | What It Proves | +|-------|-------------|----------------| +| 1 | Build process documented | Provenance exists | +| 2 | Hosted build service, signed provenance | Tamper-resistant provenance | +| 3 | Hardened build platform, non-falsifiable provenance | Tamper-proof build | +| 4 | Two-party review, hermetic builds | Maximum supply-chain assurance | + +> **Cross-references:** `security-pen-testing` (vulnerability exploitation testing), `dependency-auditor` (license and CVE audit for dependencies). + +--- + ## Reference Documentation | Document | Description | diff --git a/engineering-team/tdd-guide/SKILL.md b/engineering-team/tdd-guide/SKILL.md index 0987a7a..ea55066 100644 --- a/engineering-team/tdd-guide/SKILL.md +++ b/engineering-team/tdd-guide/SKILL.md @@ -148,6 +148,254 @@ Additional scripts: `framework_adapter.py` (convert between frameworks), `metric --- +## Spec-First Workflow + +TDD is most effective when driven by a written spec. The flow: + +1. **Write or receive a spec** — stored in `specs/.md` +2. **Extract acceptance criteria** — each criterion becomes one or more test cases +3. **Write failing tests (RED)** — one test per acceptance criterion +4. **Implement minimal code (GREEN)** — satisfy each test in order +5. **Refactor** — clean up while all tests stay green + +### Spec Directory Convention + +``` +project/ +├── specs/ +│ ├── user-auth.md # Feature spec with acceptance criteria +│ ├── payment-processing.md +│ └── notification-system.md +├── tests/ +│ ├── test_user_auth.py # Tests derived from specs/user-auth.md +│ ├── test_payments.py +│ └── test_notifications.py +└── src/ +``` + +### Extracting Tests from Specs + +Each acceptance criterion in a spec maps to at least one test: + +| Spec Criterion | Test Case | +|---------------|-----------| +| "User can log in with valid credentials" | `test_login_valid_credentials_returns_token` | +| "Invalid password returns 401" | `test_login_invalid_password_returns_401` | +| "Account locks after 5 failed attempts" | `test_login_locks_after_five_failures` | + +**Tip:** Number your acceptance criteria in the spec. Reference the number in the test docstring for traceability (`# AC-3: Account locks after 5 failed attempts`). + +> **Cross-reference:** See `engineering/spec-driven-workflow` for the full spec methodology, including spec templates and review checklists. + +--- + +## Red-Green-Refactor Examples Per Language + +### TypeScript / Jest + +```typescript +// test/cart.test.ts +describe("Cart", () => { + describe("addItem", () => { + it("should add a new item to an empty cart", () => { + const cart = new Cart(); + cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 }); + + expect(cart.items).toHaveLength(1); + expect(cart.items[0].id).toBe("sku-1"); + }); + + it("should increment quantity when adding an existing item", () => { + const cart = new Cart(); + cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 }); + cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 2 }); + + expect(cart.items).toHaveLength(1); + expect(cart.items[0].qty).toBe(3); + }); + + it("should throw when quantity is zero or negative", () => { + const cart = new Cart(); + expect(() => + cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 0 }) + ).toThrow("Quantity must be positive"); + }); + }); +}); +``` + +### Python / Pytest (Advanced Patterns) + +```python +# tests/conftest.py — shared fixtures +import pytest +from app.db import create_engine, Session + +@pytest.fixture(scope="session") +def db_engine(): + engine = create_engine("sqlite:///:memory:") + yield engine + engine.dispose() + +@pytest.fixture +def db_session(db_engine): + session = Session(bind=db_engine) + yield session + session.rollback() + session.close() + +# tests/test_pricing.py — parametrize for multiple cases +import pytest +from app.pricing import calculate_discount + +@pytest.mark.parametrize("subtotal, expected_discount", [ + (50.0, 0.0), # Below threshold — no discount + (100.0, 5.0), # 5% tier + (250.0, 25.0), # 10% tier + (500.0, 75.0), # 15% tier +]) +def test_calculate_discount(subtotal, expected_discount): + assert calculate_discount(subtotal) == pytest.approx(expected_discount) +``` + +### Go — Table-Driven Tests + +```go +// cart_test.go +package cart + +import "testing" + +func TestApplyDiscount(t *testing.T) { + tests := []struct { + name string + subtotal float64 + want float64 + }{ + {"no discount below threshold", 50.0, 0.0}, + {"5 percent tier", 100.0, 5.0}, + {"10 percent tier", 250.0, 25.0}, + {"15 percent tier", 500.0, 75.0}, + {"zero subtotal", 0.0, 0.0}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := ApplyDiscount(tt.subtotal) + if got != tt.want { + t.Errorf("ApplyDiscount(%v) = %v, want %v", tt.subtotal, got, tt.want) + } + }) + } +} +``` + +--- + +## Bounded Autonomy Rules + +When generating tests autonomously, follow these rules to decide when to stop and ask the user: + +### Stop and Ask When + +- **Ambiguous requirements** — the spec or user story has conflicting or unclear acceptance criteria +- **Missing edge cases** — you cannot determine boundary values without domain knowledge (e.g., max allowed transaction amount) +- **Test count exceeds 50** — large test suites need human review before committing; present a summary and ask which areas to prioritize +- **External dependencies unclear** — the feature relies on third-party APIs or services with undocumented behavior +- **Security-sensitive logic** — authentication, authorization, encryption, or payment flows require human sign-off on test scenarios + +### Continue Autonomously When + +- **Clear spec with numbered acceptance criteria** — each criterion maps directly to tests +- **Straightforward CRUD operations** — create, read, update, delete with well-defined models +- **Well-defined API contracts** — OpenAPI spec or typed interfaces available +- **Pure functions** — deterministic input/output with no side effects +- **Existing test patterns** — the codebase already has similar tests to follow + +--- + +## Property-Based Testing + +Property-based testing generates random inputs to verify invariants instead of relying on hand-picked examples. Use it when the input space is large and the expected behavior can be described as a property. + +### Python — Hypothesis + +```python +from hypothesis import given, strategies as st +from app.serializers import serialize, deserialize + +@given(st.text()) +def test_roundtrip_serialization(data): + """Serialization followed by deserialization returns the original.""" + assert deserialize(serialize(data)) == data + +@given(st.integers(), st.integers()) +def test_addition_is_commutative(a, b): + assert a + b == b + a +``` + +### TypeScript — fast-check + +```typescript +import fc from "fast-check"; +import { encode, decode } from "./codec"; + +test("encode/decode roundtrip", () => { + fc.assert( + fc.property(fc.string(), (input) => { + expect(decode(encode(input))).toBe(input); + }) + ); +}); +``` + +### When to Use Property-Based Over Example-Based + +| Use Property-Based | Example | +|-------------------|---------| +| Data transformations | Serialize/deserialize roundtrips | +| Mathematical properties | Commutativity, associativity, idempotency | +| Encoding/decoding | Base64, URL encoding, compression | +| Sorting and filtering | Output is sorted, length preserved | +| Parser correctness | Valid input always parses without error | + +--- + +## Mutation Testing + +Mutation testing modifies your production code (creates "mutants") and checks whether your tests catch the changes. If a mutant survives (tests still pass), your tests have a gap that coverage alone cannot reveal. + +### Tools + +| Language | Tool | Command | +|----------|------|---------| +| TypeScript/JavaScript | **Stryker** | `npx stryker run` | +| Python | **mutmut** | `mutmut run --paths-to-mutate=src/` | +| Java | **PIT** | `mvn org.pitest:pitest-maven:mutationCoverage` | + +### Why Mutation Testing Matters + +- **100% line coverage != good tests** — coverage tells you code was executed, not that it was verified +- **Catches weak assertions** — tests that run code but assert nothing meaningful +- **Finds missing boundary tests** — mutants that change `<` to `<=` expose off-by-one gaps +- **Quantifiable quality metric** — mutation score (% mutants killed) is a stronger signal than coverage % + +**Recommendation:** Run mutation testing on critical paths (auth, payments, data processing) even if overall coverage is high. Target 85%+ mutation score on P0 modules. + +--- + +## Cross-References + +| Skill | Relationship | +|-------|-------------| +| `engineering/spec-driven-workflow` | Spec → acceptance criteria → test extraction pipeline | +| `engineering-team/focused-fix` | Phase 5 (Verify) uses TDD to confirm the fix with a regression test | +| `engineering-team/senior-qa` | Broader QA strategy; TDD is one layer in the test pyramid | +| `engineering-team/code-reviewer` | Review generated tests for assertion quality and coverage completeness | +| `engineering-team/senior-fullstack` | Project scaffolders include testing infrastructure compatible with TDD workflows | + +--- + ## Limitations | Scope | Details | diff --git a/engineering/database-designer/SKILL.md b/engineering/database-designer/SKILL.md index 11da21b..9fa36ca 100644 --- a/engineering/database-designer/SKILL.md +++ b/engineering/database-designer/SKILL.md @@ -59,6 +59,229 @@ A comprehensive database design skill that provides expert-level analysis, optim 4. **Validate inputs**: Prevent SQL injection attacks 5. **Regular security updates**: Keep database software current +## Query Generation Patterns + +### SELECT with JOINs + +```sql +-- INNER JOIN: only matching rows +SELECT o.id, c.name, o.total +FROM orders o +INNER JOIN customers c ON c.id = o.customer_id; + +-- LEFT JOIN: all left rows, NULLs for non-matches +SELECT c.name, COUNT(o.id) AS order_count +FROM customers c +LEFT JOIN orders o ON o.customer_id = c.id +GROUP BY c.name; + +-- Self-join: hierarchical data (employees/managers) +SELECT e.name AS employee, m.name AS manager +FROM employees e +LEFT JOIN employees m ON m.id = e.manager_id; +``` + +### Common Table Expressions (CTEs) + +```sql +-- Recursive CTE for org chart +WITH RECURSIVE org AS ( + SELECT id, name, manager_id, 1 AS depth + FROM employees WHERE manager_id IS NULL + UNION ALL + SELECT e.id, e.name, e.manager_id, o.depth + 1 + FROM employees e INNER JOIN org o ON o.id = e.manager_id +) +SELECT * FROM org ORDER BY depth, name; +``` + +### Window Functions + +```sql +-- ROW_NUMBER for pagination / dedup +SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn +FROM orders; + +-- RANK with gaps, DENSE_RANK without gaps +SELECT name, score, RANK() OVER (ORDER BY score DESC) AS rank FROM leaderboard; + +-- LAG/LEAD for comparing adjacent rows +SELECT date, revenue, + revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change +FROM daily_sales; +``` + +### Aggregation Patterns + +```sql +-- FILTER clause (PostgreSQL) for conditional aggregation +SELECT + COUNT(*) AS total, + COUNT(*) FILTER (WHERE status = 'active') AS active, + AVG(amount) FILTER (WHERE amount > 0) AS avg_positive +FROM accounts; + +-- GROUPING SETS for multi-level rollups +SELECT region, product, SUM(revenue) +FROM sales +GROUP BY GROUPING SETS ((region, product), (region), ()); +``` + +--- + +## Migration Patterns + +### Up/Down Migration Scripts + +Every migration must have a reversible counterpart. Name files with a timestamp prefix for ordering: + +``` +migrations/ +├── 20260101_000001_create_users.up.sql +├── 20260101_000001_create_users.down.sql +├── 20260115_000002_add_users_email_index.up.sql +└── 20260115_000002_add_users_email_index.down.sql +``` + +### Zero-Downtime Migrations (Expand/Contract) + +Use the expand-contract pattern to avoid locking or breaking running code: + +1. **Expand** — add the new column/table (nullable, with default) +2. **Migrate data** — backfill in batches; dual-write from application +3. **Transition** — application reads from new column; stop writing to old +4. **Contract** — drop old column in a follow-up migration + +### Data Backfill Strategies + +```sql +-- Batch update to avoid long-running locks +UPDATE users SET email_normalized = LOWER(email) +WHERE id IN (SELECT id FROM users WHERE email_normalized IS NULL LIMIT 5000); +-- Repeat in a loop until 0 rows affected +``` + +### Rollback Procedures + +- Always test the `down.sql` in staging before deploying `up.sql` to production +- Keep rollback window short — if the contract step has run, rollback requires a new forward migration +- For irreversible changes (dropping columns with data), take a logical backup first + +--- + +## Performance Optimization + +### Indexing Strategies + +| Index Type | Use Case | Example | +|------------|----------|---------| +| **B-tree** (default) | Equality, range, ORDER BY | `CREATE INDEX idx_users_email ON users(email);` | +| **GIN** | Full-text search, JSONB, arrays | `CREATE INDEX idx_docs_body ON docs USING gin(to_tsvector('english', body));` | +| **GiST** | Geometry, range types, nearest-neighbor | `CREATE INDEX idx_locations ON places USING gist(coords);` | +| **Partial** | Subset of rows (reduce size) | `CREATE INDEX idx_active ON users(email) WHERE active = true;` | +| **Covering** | Index-only scans | `CREATE INDEX idx_cov ON orders(customer_id) INCLUDE (total, created_at);` | + +### EXPLAIN Plan Reading + +```sql +EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...; +``` + +Key signals to watch: +- **Seq Scan** on large tables — missing index +- **Nested Loop** with high row estimates — consider hash/merge join or add index +- **Buffers shared read** much higher than **hit** — working set exceeds memory + +### N+1 Query Detection + +Symptoms: application issues one query per row (e.g., fetching related records in a loop). + +Fixes: +- Use `JOIN` or subquery to fetch in one round-trip +- ORM eager loading (`select_related` / `includes` / `with`) +- DataLoader pattern for GraphQL resolvers + +### Connection Pooling + +| Tool | Protocol | Best For | +|------|----------|----------| +| **PgBouncer** | PostgreSQL | Transaction/statement pooling, low overhead | +| **ProxySQL** | MySQL | Query routing, read/write splitting | +| **Built-in pool** (HikariCP, SQLAlchemy pool) | Any | Application-level pooling | + +**Rule of thumb:** Set pool size to `(2 * CPU cores) + disk spindles`. For cloud SSDs, start with `2 * vCPUs` and tune. + +### Read Replicas and Query Routing + +- Route all `SELECT` queries to replicas; writes to primary +- Account for replication lag (typically <1s for async, 0 for sync) +- Use `pg_last_wal_replay_lsn()` to detect lag before reading critical data + +--- + +## Multi-Database Decision Matrix + +| Criteria | PostgreSQL | MySQL | SQLite | SQL Server | +|----------|-----------|-------|--------|------------| +| **Best for** | Complex queries, JSONB, extensions | Web apps, read-heavy workloads | Embedded, dev/test, edge | Enterprise .NET stacks | +| **JSON support** | Excellent (JSONB + GIN) | Good (JSON type) | Minimal | Good (OPENJSON) | +| **Replication** | Streaming, logical | Group replication, InnoDB cluster | N/A | Always On AG | +| **Licensing** | Open source (PostgreSQL License) | Open source (GPL) / commercial | Public domain | Commercial | +| **Max practical size** | Multi-TB | Multi-TB | ~1 TB (single-writer) | Multi-TB | + +**When to choose:** +- **PostgreSQL** — default choice for new projects; best extensibility and standards compliance +- **MySQL** — existing MySQL ecosystem; simple read-heavy web applications +- **SQLite** — mobile apps, CLI tools, unit test databases, IoT/edge +- **SQL Server** — mandated by enterprise policy; deep .NET/Azure integration + +### NoSQL Considerations + +| Database | Model | Use When | +|----------|-------|----------| +| **MongoDB** | Document | Schema flexibility, rapid prototyping, content management | +| **Redis** | Key-value / cache | Session store, rate limiting, leaderboards, pub/sub | +| **DynamoDB** | Wide-column | Serverless AWS apps, single-digit-ms latency at any scale | + +> Use SQL as default. Reach for NoSQL only when the access pattern clearly benefits from it. + +--- + +## Sharding & Replication + +### Horizontal vs Vertical Partitioning + +- **Vertical partitioning**: Split columns across tables (e.g., separate BLOB columns). Reduces I/O for narrow queries. +- **Horizontal partitioning (sharding)**: Split rows across databases/servers. Required when a single node cannot hold the dataset or handle the throughput. + +### Sharding Strategies + +| Strategy | How It Works | Pros | Cons | +|----------|-------------|------|------| +| **Hash** | `shard = hash(key) % N` | Even distribution | Resharding is expensive | +| **Range** | Shard by date or ID range | Simple, good for time-series | Hot spots on latest shard | +| **Geographic** | Shard by user region | Data locality, compliance | Cross-region queries are hard | + +### Replication Patterns + +| Pattern | Consistency | Latency | Use Case | +|---------|------------|---------|----------| +| **Synchronous** | Strong | Higher write latency | Financial transactions | +| **Asynchronous** | Eventual | Low write latency | Read-heavy web apps | +| **Semi-synchronous** | At-least-one replica confirmed | Moderate | Balance of safety and speed | + +--- + +## Cross-References + +- **sql-database-assistant** — query writing, optimization, and debugging for day-to-day SQL work +- **database-schema-designer** — ERD modeling, normalization analysis, and schema generation +- **migration-architect** — large-scale migration planning across database engines or major schema overhauls +- **senior-backend** — application-layer patterns (connection pooling, ORM best practices) +- **senior-devops** — infrastructure provisioning for database clusters and replicas + +--- + ## Conclusion Effective database design requires balancing multiple competing concerns: performance, scalability, maintainability, and business requirements. This skill provides the tools and knowledge to make informed decisions throughout the database lifecycle, from initial schema design through production optimization and evolution. diff --git a/engineering/env-secrets-manager/SKILL.md b/engineering/env-secrets-manager/SKILL.md index b8b9522..21217a4 100644 --- a/engineering/env-secrets-manager/SKILL.md +++ b/engineering/env-secrets-manager/SKILL.md @@ -76,3 +76,185 @@ python3 scripts/env_auditor.py /path/to/repo --json 2. Keep dev env files local and gitignored. 3. Enforce detection in CI before merge. 4. Re-test application paths immediately after credential rotation. + +--- + +## Cloud Secret Store Integration + +Production applications should never read secrets from `.env` files or environment variables baked into container images. Use a dedicated secret store instead. + +### Provider Comparison + +| Provider | Best For | Key Feature | +|----------|----------|-------------| +| **HashiCorp Vault** | Multi-cloud / hybrid | Dynamic secrets, policy engine, pluggable backends | +| **AWS Secrets Manager** | AWS-native workloads | Native Lambda/ECS/EKS integration, automatic RDS rotation | +| **Azure Key Vault** | Azure-native workloads | Managed HSM, Azure AD RBAC, certificate management | +| **GCP Secret Manager** | GCP-native workloads | IAM-based access, automatic replication, versioning | + +### Selection Guidance + +- **Single cloud provider** — use the cloud-native secret manager. It integrates tightly with IAM, reduces operational overhead, and costs less than self-hosting. +- **Multi-cloud or hybrid** — use HashiCorp Vault. It provides a uniform API across environments and supports dynamic secret generation (database credentials, cloud IAM keys) that expire automatically. +- **Kubernetes-heavy** — combine External Secrets Operator with any backend above to sync secrets into K8s `Secret` objects without hardcoding. + +### Application Access Patterns + +1. **SDK/API pull** — application fetches secret at startup or on-demand via provider SDK. +2. **Sidecar injection** — a sidecar container (e.g., Vault Agent) writes secrets to a shared volume or injects them as environment variables. +3. **Init container** — a Kubernetes init container fetches secrets before the main container starts. +4. **CSI driver** — secrets mount as a filesystem volume via the Secrets Store CSI Driver. + +> **Cross-reference:** See `engineering/secrets-vault-manager` for production vault infrastructure patterns, HA deployment, and disaster recovery procedures. + +--- + +## Secret Rotation Workflow + +Stale secrets are a liability. Rotation ensures that even if a credential leaks, its useful lifetime is bounded. + +### Phase 1: Detection + +- Track secret creation and expiry dates in your secret store metadata. +- Set alerts at 30, 14, and 7 days before expiry. +- Use `scripts/env_auditor.py` to flag secrets with no recorded rotation date. + +### Phase 2: Rotation + +1. **Generate** a new credential (API key, database password, certificate). +2. **Deploy** the new credential to all consumers (apps, services, pipelines) in parallel. +3. **Verify** each consumer can authenticate using the new credential. +4. **Revoke** the old credential only after all consumers are confirmed healthy. +5. **Update** metadata with the new rotation timestamp and next rotation date. + +### Phase 3: Automation + +- **AWS Secrets Manager** — use built-in Lambda-based rotation for RDS, Redshift, and DocumentDB. +- **HashiCorp Vault** — configure dynamic secrets with TTLs; credentials are generated on-demand and auto-expire. +- **Azure Key Vault** — use Event Grid notifications to trigger rotation functions. +- **GCP Secret Manager** — use Pub/Sub notifications tied to Cloud Functions for rotation logic. + +### Emergency Rotation Checklist + +When a secret is confirmed leaked: + +1. **Immediately revoke** the compromised credential at the provider level. +2. Generate and deploy a replacement credential to all consumers. +3. Audit access logs for unauthorized usage during the exposure window. +4. Scan git history, CI logs, and artifact registries for the leaked value. +5. File an incident report documenting scope, timeline, and remediation steps. +6. Review and tighten detection controls to prevent recurrence. + +--- + +## CI/CD Secret Injection + +Secrets in CI/CD pipelines require careful handling to avoid exposure in logs, artifacts, or pull request contexts. + +### GitHub Actions + +- Use **repository secrets** or **environment secrets** via `${{ secrets.SECRET_NAME }}`. +- Prefer **OIDC federation** (`aws-actions/configure-aws-credentials` with `role-to-assume`) over long-lived access keys. +- Environment secrets with required reviewers add approval gates for production deployments. +- GitHub automatically masks secrets in logs, but avoid `echo` or `toJSON()` on secret values. + +### GitLab CI + +- Store secrets as **CI/CD variables** with the `masked` and `protected` flags enabled. +- Use **HashiCorp Vault integration** (`secrets:vault`) for dynamic secret injection without storing values in GitLab. +- Scope variables to specific environments (`production`, `staging`) to enforce least privilege. + +### Universal Patterns + +- **Never echo or print** secret values in pipeline output, even for debugging. +- **Use short-lived tokens** (OIDC, STS AssumeRole) instead of static credentials wherever possible. +- **Restrict PR access** — do not expose secrets to pipelines triggered by forks or untrusted branches. +- **Rotate CI secrets** on the same schedule as application secrets; pipeline credentials are attack vectors too. +- **Audit pipeline logs** periodically for accidental secret exposure that masking may have missed. + +--- + +## Pre-Commit Secret Detection + +Catching secrets before they reach version control is the most cost-effective defense. Two leading tools cover this space. + +### gitleaks + +```toml +# .gitleaks.toml — minimal configuration +[extend] +useDefault = true + +[[rules]] +id = "custom-internal-token" +description = "Internal service token pattern" +regex = '''INTERNAL_TOKEN_[A-Za-z0-9]{32}''' +secretGroup = 0 +``` + +- Install: `brew install gitleaks` or download from GitHub releases. +- Pre-commit hook: `gitleaks git --pre-commit --staged` +- Baseline scanning: `gitleaks detect --source . --report-path gitleaks-report.json` +- Manage false positives in `.gitleaksignore` (one fingerprint per line). + +### detect-secrets + +```bash +# Generate baseline +detect-secrets scan --all-files > .secrets.baseline + +# Pre-commit hook (via pre-commit framework) +# .pre-commit-config.yaml +repos: + - repo: https://github.com/Yelp/detect-secrets + rev: v1.5.0 + hooks: + - id: detect-secrets + args: ['--baseline', '.secrets.baseline'] +``` + +- Supports **custom plugins** for organization-specific patterns. +- Audit workflow: `detect-secrets audit .secrets.baseline` interactively marks true/false positives. + +### False Positive Management + +- Maintain `.gitleaksignore` or `.secrets.baseline` in version control so the whole team shares exclusions. +- Review false positive lists during security audits — patterns may mask real leaks over time. +- Prefer tightening regex patterns over broadly ignoring files. + +--- + +## Audit Logging + +Knowing who accessed which secret and when is critical for incident investigation and compliance. + +### Cloud-Native Audit Trails + +| Provider | Service | What It Captures | +|----------|---------|-----------------| +| **AWS** | CloudTrail | Every `GetSecretValue`, `DescribeSecret`, `RotateSecret` API call | +| **Azure** | Activity Log + Diagnostic Logs | Key Vault access events, including caller identity and IP | +| **GCP** | Cloud Audit Logs | Data access logs for Secret Manager with principal and timestamp | +| **Vault** | Audit Backend | Full request/response logging (file, syslog, or socket backend) | + +### Alerting Strategy + +- Alert on **access from unknown IP ranges** or service accounts outside the expected set. +- Alert on **bulk secret reads** (more than N secrets accessed within a time window). +- Alert on **access outside deployment windows** when no CI/CD pipeline is running. +- Feed audit logs into your SIEM (Splunk, Datadog, Elastic) for correlation with other security events. +- Review audit logs quarterly as part of access recertification. + +--- + +## Cross-References + +This skill covers env hygiene and secret detection. For deeper coverage of related domains, see: + +| Skill | Path | Relationship | +|-------|------|-------------| +| **Secrets Vault Manager** | `engineering/secrets-vault-manager` | Production vault infrastructure, HA deployment, DR | +| **Senior SecOps** | `engineering/senior-secops` | Security operations perspective, incident response | +| **CI/CD Pipeline Builder** | `engineering/ci-cd-pipeline-builder` | Pipeline architecture, secret injection patterns | +| **Infrastructure as Code** | `engineering/infrastructure-as-code` | Terraform/Pulumi secret backend configuration | +| **Container Orchestration** | `engineering/container-orchestration` | Kubernetes secret mounting, sealed secrets |