Merge pull request #408 from alirezarezvani/feature/sprint-improvements

improve(engineering): enhance 5 existing skills — tdd-guide, env-secrets-manager, senior-secops, database-designer, senior-devops
This commit is contained in:
Alireza Rezvani
2026-03-25 14:22:04 +01:00
committed by GitHub
5 changed files with 784 additions and 0 deletions

View File

@@ -270,6 +270,54 @@ kubectl get pods -n production -l app=myapp
curl -sf https://app.example.com/healthz || echo "ROLLBACK FAILED — escalate"
```
## Multi-Cloud Cross-References
Use these companion skills for cloud-specific deep dives:
| Skill | Cloud | Use When |
|-------|-------|----------|
| **aws-solution-architect** | AWS | ECS/EKS, Lambda, VPC design, cost optimization |
| **azure-cloud-architect** | Azure | AKS, App Service, Virtual Networks, Azure DevOps |
| **gcp-cloud-architect** | GCP | GKE, Cloud Run, VPC, Cloud Build *(coming soon)* |
**Multi-cloud vs single-cloud decision:**
- **Single-cloud** (default) — lower operational complexity, deeper managed-service integration, better cost leverage with committed-use discounts
- **Multi-cloud** — required when mandated by compliance/data residency, acquiring companies on different clouds, or needing best-of-breed services across providers (e.g., AWS for compute + GCP for ML)
- **Hybrid** — on-prem + cloud; use when regulated workloads must stay on-prem while burst/non-sensitive workloads run in the cloud
> Start single-cloud. Add a second cloud only when there is a concrete business or compliance driver — not for theoretical redundancy.
---
## Cloud-Agnostic IaC
### Terraform / OpenTofu (Default Choice)
Terraform (or its open-source fork OpenTofu) is the recommended IaC tool for most teams:
- Single language (HCL) across AWS, Azure, GCP, and 3,000+ providers
- State management with remote backends (S3, GCS, Azure Blob)
- Plan-before-apply workflow prevents drift surprises
- Cross-reference **terraform-patterns** for module structure, state isolation, and CI/CD integration
### Pulumi (Programming Language IaC)
Choose Pulumi when the team strongly prefers TypeScript, Python, Go, or C# over HCL:
- Full programming language — loops, conditionals, unit tests native
- Same cloud provider coverage as Terraform
- Easier onboarding for dev teams that resist learning HCL
### When to Use Cloud-Native IaC
| Tool | Use When |
|------|----------|
| **CloudFormation** | AWS-only shop; need native AWS support (StackSets, Service Catalog) |
| **Bicep** | Azure-only shop; simpler syntax than ARM templates |
| **Cloud Deployment Manager** | GCP-only; rare — most GCP teams prefer Terraform |
> **Rule of thumb:** Use Terraform/OpenTofu unless you are 100% committed to a single cloud AND the cloud-native tool offers a feature Terraform cannot replicate (e.g., AWS Service Catalog integration).
---
## Troubleshooting
Check the comprehensive troubleshooting section in `references/deployment_strategies.md`.

View File

@@ -413,6 +413,89 @@ app.use((req, res, next) => {
---
## OWASP Top 10 Quick-Check
Rapid 15-minute assessment — run through each category and note pass/fail. For deep-dive testing, hand off to the **security-pen-testing** skill.
| # | Category | One-Line Check |
|---|----------|----------------|
| A01 | Broken Access Control | Verify role checks on every endpoint; test horizontal privilege escalation |
| A02 | Cryptographic Failures | Confirm TLS 1.2+ everywhere; no secrets in logs or source |
| A03 | Injection | Run parameterized query audit; check ORM raw-query usage |
| A04 | Insecure Design | Review threat model exists for critical flows |
| A05 | Security Misconfiguration | Check default credentials removed; error pages generic |
| A06 | Vulnerable Components | Run `vulnerability_assessor.py`; zero critical/high CVEs |
| A07 | Auth Failures | Verify MFA on admin; brute-force protection active |
| A08 | Software & Data Integrity | Confirm CI/CD pipeline signs artifacts; no unsigned deps |
| A09 | Logging & Monitoring | Validate audit logs capture auth events; alerts configured |
| A10 | SSRF | Test internal URL filters; block metadata endpoints (169.254.169.254) |
> **Deep dive needed?** Hand off to `security-pen-testing` for full OWASP Testing Guide coverage.
---
## Secret Scanning Tools
Choose the right scanner for each stage of your workflow:
| Tool | Best For | Language | Pre-commit | CI/CD | Custom Rules |
|------|----------|----------|:----------:|:-----:|:------------:|
| **gitleaks** | CI pipelines, full-repo scans | Go | Yes | Yes | TOML regexes |
| **detect-secrets** | Pre-commit hooks, incremental | Python | Yes | Partial | Plugin-based |
| **truffleHog** | Deep history scans, entropy | Go | No | Yes | Regex + entropy |
**Recommended setup:** Use `detect-secrets` as a pre-commit hook (catches secrets before they enter history) and `gitleaks` in CI (catches anything that slips through).
```bash
# detect-secrets pre-commit hook (.pre-commit-config.yaml)
- repo: https://github.com/Yelp/detect-secrets
rev: v1.4.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
# gitleaks in GitHub Actions
- name: gitleaks
uses: gitleaks/gitleaks-action@v2
env:
GITLEAKS_LICENSE: ${{ secrets.GITLEAKS_LICENSE }}
```
---
## Supply Chain Security
Protect against dependency and artifact tampering with SBOM generation, artifact signing, and SLSA compliance.
**SBOM Generation:**
- **syft** — generates SBOMs from container images or source dirs (SPDX, CycloneDX formats)
- **cyclonedx-cli** — CycloneDX-native tooling; merge multiple SBOMs for mono-repos
```bash
# Generate SBOM from container image
syft packages ghcr.io/org/app:latest -o cyclonedx-json > sbom.json
```
**Artifact Signing (Sigstore/cosign):**
```bash
# Sign a container image (keyless via OIDC)
cosign sign ghcr.io/org/app:latest
# Verify signature
cosign verify ghcr.io/org/app:latest --certificate-identity=ci@org.com --certificate-oidc-issuer=https://token.actions.githubusercontent.com
```
**SLSA Levels Overview:**
| Level | Requirement | What It Proves |
|-------|-------------|----------------|
| 1 | Build process documented | Provenance exists |
| 2 | Hosted build service, signed provenance | Tamper-resistant provenance |
| 3 | Hardened build platform, non-falsifiable provenance | Tamper-proof build |
| 4 | Two-party review, hermetic builds | Maximum supply-chain assurance |
> **Cross-references:** `security-pen-testing` (vulnerability exploitation testing), `dependency-auditor` (license and CVE audit for dependencies).
---
## Reference Documentation
| Document | Description |

View File

@@ -148,6 +148,254 @@ Additional scripts: `framework_adapter.py` (convert between frameworks), `metric
---
## Spec-First Workflow
TDD is most effective when driven by a written spec. The flow:
1. **Write or receive a spec** — stored in `specs/<feature>.md`
2. **Extract acceptance criteria** — each criterion becomes one or more test cases
3. **Write failing tests (RED)** — one test per acceptance criterion
4. **Implement minimal code (GREEN)** — satisfy each test in order
5. **Refactor** — clean up while all tests stay green
### Spec Directory Convention
```
project/
├── specs/
│ ├── user-auth.md # Feature spec with acceptance criteria
│ ├── payment-processing.md
│ └── notification-system.md
├── tests/
│ ├── test_user_auth.py # Tests derived from specs/user-auth.md
│ ├── test_payments.py
│ └── test_notifications.py
└── src/
```
### Extracting Tests from Specs
Each acceptance criterion in a spec maps to at least one test:
| Spec Criterion | Test Case |
|---------------|-----------|
| "User can log in with valid credentials" | `test_login_valid_credentials_returns_token` |
| "Invalid password returns 401" | `test_login_invalid_password_returns_401` |
| "Account locks after 5 failed attempts" | `test_login_locks_after_five_failures` |
**Tip:** Number your acceptance criteria in the spec. Reference the number in the test docstring for traceability (`# AC-3: Account locks after 5 failed attempts`).
> **Cross-reference:** See `engineering/spec-driven-workflow` for the full spec methodology, including spec templates and review checklists.
---
## Red-Green-Refactor Examples Per Language
### TypeScript / Jest
```typescript
// test/cart.test.ts
describe("Cart", () => {
describe("addItem", () => {
it("should add a new item to an empty cart", () => {
const cart = new Cart();
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 });
expect(cart.items).toHaveLength(1);
expect(cart.items[0].id).toBe("sku-1");
});
it("should increment quantity when adding an existing item", () => {
const cart = new Cart();
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 });
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 2 });
expect(cart.items).toHaveLength(1);
expect(cart.items[0].qty).toBe(3);
});
it("should throw when quantity is zero or negative", () => {
const cart = new Cart();
expect(() =>
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 0 })
).toThrow("Quantity must be positive");
});
});
});
```
### Python / Pytest (Advanced Patterns)
```python
# tests/conftest.py — shared fixtures
import pytest
from app.db import create_engine, Session
@pytest.fixture(scope="session")
def db_engine():
engine = create_engine("sqlite:///:memory:")
yield engine
engine.dispose()
@pytest.fixture
def db_session(db_engine):
session = Session(bind=db_engine)
yield session
session.rollback()
session.close()
# tests/test_pricing.py — parametrize for multiple cases
import pytest
from app.pricing import calculate_discount
@pytest.mark.parametrize("subtotal, expected_discount", [
(50.0, 0.0), # Below threshold — no discount
(100.0, 5.0), # 5% tier
(250.0, 25.0), # 10% tier
(500.0, 75.0), # 15% tier
])
def test_calculate_discount(subtotal, expected_discount):
assert calculate_discount(subtotal) == pytest.approx(expected_discount)
```
### Go — Table-Driven Tests
```go
// cart_test.go
package cart
import "testing"
func TestApplyDiscount(t *testing.T) {
tests := []struct {
name string
subtotal float64
want float64
}{
{"no discount below threshold", 50.0, 0.0},
{"5 percent tier", 100.0, 5.0},
{"10 percent tier", 250.0, 25.0},
{"15 percent tier", 500.0, 75.0},
{"zero subtotal", 0.0, 0.0},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := ApplyDiscount(tt.subtotal)
if got != tt.want {
t.Errorf("ApplyDiscount(%v) = %v, want %v", tt.subtotal, got, tt.want)
}
})
}
}
```
---
## Bounded Autonomy Rules
When generating tests autonomously, follow these rules to decide when to stop and ask the user:
### Stop and Ask When
- **Ambiguous requirements** — the spec or user story has conflicting or unclear acceptance criteria
- **Missing edge cases** — you cannot determine boundary values without domain knowledge (e.g., max allowed transaction amount)
- **Test count exceeds 50** — large test suites need human review before committing; present a summary and ask which areas to prioritize
- **External dependencies unclear** — the feature relies on third-party APIs or services with undocumented behavior
- **Security-sensitive logic** — authentication, authorization, encryption, or payment flows require human sign-off on test scenarios
### Continue Autonomously When
- **Clear spec with numbered acceptance criteria** — each criterion maps directly to tests
- **Straightforward CRUD operations** — create, read, update, delete with well-defined models
- **Well-defined API contracts** — OpenAPI spec or typed interfaces available
- **Pure functions** — deterministic input/output with no side effects
- **Existing test patterns** — the codebase already has similar tests to follow
---
## Property-Based Testing
Property-based testing generates random inputs to verify invariants instead of relying on hand-picked examples. Use it when the input space is large and the expected behavior can be described as a property.
### Python — Hypothesis
```python
from hypothesis import given, strategies as st
from app.serializers import serialize, deserialize
@given(st.text())
def test_roundtrip_serialization(data):
"""Serialization followed by deserialization returns the original."""
assert deserialize(serialize(data)) == data
@given(st.integers(), st.integers())
def test_addition_is_commutative(a, b):
assert a + b == b + a
```
### TypeScript — fast-check
```typescript
import fc from "fast-check";
import { encode, decode } from "./codec";
test("encode/decode roundtrip", () => {
fc.assert(
fc.property(fc.string(), (input) => {
expect(decode(encode(input))).toBe(input);
})
);
});
```
### When to Use Property-Based Over Example-Based
| Use Property-Based | Example |
|-------------------|---------|
| Data transformations | Serialize/deserialize roundtrips |
| Mathematical properties | Commutativity, associativity, idempotency |
| Encoding/decoding | Base64, URL encoding, compression |
| Sorting and filtering | Output is sorted, length preserved |
| Parser correctness | Valid input always parses without error |
---
## Mutation Testing
Mutation testing modifies your production code (creates "mutants") and checks whether your tests catch the changes. If a mutant survives (tests still pass), your tests have a gap that coverage alone cannot reveal.
### Tools
| Language | Tool | Command |
|----------|------|---------|
| TypeScript/JavaScript | **Stryker** | `npx stryker run` |
| Python | **mutmut** | `mutmut run --paths-to-mutate=src/` |
| Java | **PIT** | `mvn org.pitest:pitest-maven:mutationCoverage` |
### Why Mutation Testing Matters
- **100% line coverage != good tests** — coverage tells you code was executed, not that it was verified
- **Catches weak assertions** — tests that run code but assert nothing meaningful
- **Finds missing boundary tests** — mutants that change `<` to `<=` expose off-by-one gaps
- **Quantifiable quality metric** — mutation score (% mutants killed) is a stronger signal than coverage %
**Recommendation:** Run mutation testing on critical paths (auth, payments, data processing) even if overall coverage is high. Target 85%+ mutation score on P0 modules.
---
## Cross-References
| Skill | Relationship |
|-------|-------------|
| `engineering/spec-driven-workflow` | Spec → acceptance criteria → test extraction pipeline |
| `engineering-team/focused-fix` | Phase 5 (Verify) uses TDD to confirm the fix with a regression test |
| `engineering-team/senior-qa` | Broader QA strategy; TDD is one layer in the test pyramid |
| `engineering-team/code-reviewer` | Review generated tests for assertion quality and coverage completeness |
| `engineering-team/senior-fullstack` | Project scaffolders include testing infrastructure compatible with TDD workflows |
---
## Limitations
| Scope | Details |

View File

@@ -59,6 +59,229 @@ A comprehensive database design skill that provides expert-level analysis, optim
4. **Validate inputs**: Prevent SQL injection attacks
5. **Regular security updates**: Keep database software current
## Query Generation Patterns
### SELECT with JOINs
```sql
-- INNER JOIN: only matching rows
SELECT o.id, c.name, o.total
FROM orders o
INNER JOIN customers c ON c.id = o.customer_id;
-- LEFT JOIN: all left rows, NULLs for non-matches
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;
-- Self-join: hierarchical data (employees/managers)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.id = e.manager_id;
```
### Common Table Expressions (CTEs)
```sql
-- Recursive CTE for org chart
WITH RECURSIVE org AS (
SELECT id, name, manager_id, 1 AS depth
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.name, e.manager_id, o.depth + 1
FROM employees e INNER JOIN org o ON o.id = e.manager_id
)
SELECT * FROM org ORDER BY depth, name;
```
### Window Functions
```sql
-- ROW_NUMBER for pagination / dedup
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
FROM orders;
-- RANK with gaps, DENSE_RANK without gaps
SELECT name, score, RANK() OVER (ORDER BY score DESC) AS rank FROM leaderboard;
-- LAG/LEAD for comparing adjacent rows
SELECT date, revenue,
revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change
FROM daily_sales;
```
### Aggregation Patterns
```sql
-- FILTER clause (PostgreSQL) for conditional aggregation
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = 'active') AS active,
AVG(amount) FILTER (WHERE amount > 0) AS avg_positive
FROM accounts;
-- GROUPING SETS for multi-level rollups
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY GROUPING SETS ((region, product), (region), ());
```
---
## Migration Patterns
### Up/Down Migration Scripts
Every migration must have a reversible counterpart. Name files with a timestamp prefix for ordering:
```
migrations/
├── 20260101_000001_create_users.up.sql
├── 20260101_000001_create_users.down.sql
├── 20260115_000002_add_users_email_index.up.sql
└── 20260115_000002_add_users_email_index.down.sql
```
### Zero-Downtime Migrations (Expand/Contract)
Use the expand-contract pattern to avoid locking or breaking running code:
1. **Expand** — add the new column/table (nullable, with default)
2. **Migrate data** — backfill in batches; dual-write from application
3. **Transition** — application reads from new column; stop writing to old
4. **Contract** — drop old column in a follow-up migration
### Data Backfill Strategies
```sql
-- Batch update to avoid long-running locks
UPDATE users SET email_normalized = LOWER(email)
WHERE id IN (SELECT id FROM users WHERE email_normalized IS NULL LIMIT 5000);
-- Repeat in a loop until 0 rows affected
```
### Rollback Procedures
- Always test the `down.sql` in staging before deploying `up.sql` to production
- Keep rollback window short — if the contract step has run, rollback requires a new forward migration
- For irreversible changes (dropping columns with data), take a logical backup first
---
## Performance Optimization
### Indexing Strategies
| Index Type | Use Case | Example |
|------------|----------|---------|
| **B-tree** (default) | Equality, range, ORDER BY | `CREATE INDEX idx_users_email ON users(email);` |
| **GIN** | Full-text search, JSONB, arrays | `CREATE INDEX idx_docs_body ON docs USING gin(to_tsvector('english', body));` |
| **GiST** | Geometry, range types, nearest-neighbor | `CREATE INDEX idx_locations ON places USING gist(coords);` |
| **Partial** | Subset of rows (reduce size) | `CREATE INDEX idx_active ON users(email) WHERE active = true;` |
| **Covering** | Index-only scans | `CREATE INDEX idx_cov ON orders(customer_id) INCLUDE (total, created_at);` |
### EXPLAIN Plan Reading
```sql
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;
```
Key signals to watch:
- **Seq Scan** on large tables — missing index
- **Nested Loop** with high row estimates — consider hash/merge join or add index
- **Buffers shared read** much higher than **hit** — working set exceeds memory
### N+1 Query Detection
Symptoms: application issues one query per row (e.g., fetching related records in a loop).
Fixes:
- Use `JOIN` or subquery to fetch in one round-trip
- ORM eager loading (`select_related` / `includes` / `with`)
- DataLoader pattern for GraphQL resolvers
### Connection Pooling
| Tool | Protocol | Best For |
|------|----------|----------|
| **PgBouncer** | PostgreSQL | Transaction/statement pooling, low overhead |
| **ProxySQL** | MySQL | Query routing, read/write splitting |
| **Built-in pool** (HikariCP, SQLAlchemy pool) | Any | Application-level pooling |
**Rule of thumb:** Set pool size to `(2 * CPU cores) + disk spindles`. For cloud SSDs, start with `2 * vCPUs` and tune.
### Read Replicas and Query Routing
- Route all `SELECT` queries to replicas; writes to primary
- Account for replication lag (typically <1s for async, 0 for sync)
- Use `pg_last_wal_replay_lsn()` to detect lag before reading critical data
---
## Multi-Database Decision Matrix
| Criteria | PostgreSQL | MySQL | SQLite | SQL Server |
|----------|-----------|-------|--------|------------|
| **Best for** | Complex queries, JSONB, extensions | Web apps, read-heavy workloads | Embedded, dev/test, edge | Enterprise .NET stacks |
| **JSON support** | Excellent (JSONB + GIN) | Good (JSON type) | Minimal | Good (OPENJSON) |
| **Replication** | Streaming, logical | Group replication, InnoDB cluster | N/A | Always On AG |
| **Licensing** | Open source (PostgreSQL License) | Open source (GPL) / commercial | Public domain | Commercial |
| **Max practical size** | Multi-TB | Multi-TB | ~1 TB (single-writer) | Multi-TB |
**When to choose:**
- **PostgreSQL** — default choice for new projects; best extensibility and standards compliance
- **MySQL** — existing MySQL ecosystem; simple read-heavy web applications
- **SQLite** — mobile apps, CLI tools, unit test databases, IoT/edge
- **SQL Server** — mandated by enterprise policy; deep .NET/Azure integration
### NoSQL Considerations
| Database | Model | Use When |
|----------|-------|----------|
| **MongoDB** | Document | Schema flexibility, rapid prototyping, content management |
| **Redis** | Key-value / cache | Session store, rate limiting, leaderboards, pub/sub |
| **DynamoDB** | Wide-column | Serverless AWS apps, single-digit-ms latency at any scale |
> Use SQL as default. Reach for NoSQL only when the access pattern clearly benefits from it.
---
## Sharding & Replication
### Horizontal vs Vertical Partitioning
- **Vertical partitioning**: Split columns across tables (e.g., separate BLOB columns). Reduces I/O for narrow queries.
- **Horizontal partitioning (sharding)**: Split rows across databases/servers. Required when a single node cannot hold the dataset or handle the throughput.
### Sharding Strategies
| Strategy | How It Works | Pros | Cons |
|----------|-------------|------|------|
| **Hash** | `shard = hash(key) % N` | Even distribution | Resharding is expensive |
| **Range** | Shard by date or ID range | Simple, good for time-series | Hot spots on latest shard |
| **Geographic** | Shard by user region | Data locality, compliance | Cross-region queries are hard |
### Replication Patterns
| Pattern | Consistency | Latency | Use Case |
|---------|------------|---------|----------|
| **Synchronous** | Strong | Higher write latency | Financial transactions |
| **Asynchronous** | Eventual | Low write latency | Read-heavy web apps |
| **Semi-synchronous** | At-least-one replica confirmed | Moderate | Balance of safety and speed |
---
## Cross-References
- **sql-database-assistant** — query writing, optimization, and debugging for day-to-day SQL work
- **database-schema-designer** — ERD modeling, normalization analysis, and schema generation
- **migration-architect** — large-scale migration planning across database engines or major schema overhauls
- **senior-backend** — application-layer patterns (connection pooling, ORM best practices)
- **senior-devops** — infrastructure provisioning for database clusters and replicas
---
## Conclusion
Effective database design requires balancing multiple competing concerns: performance, scalability, maintainability, and business requirements. This skill provides the tools and knowledge to make informed decisions throughout the database lifecycle, from initial schema design through production optimization and evolution.

View File

@@ -76,3 +76,185 @@ python3 scripts/env_auditor.py /path/to/repo --json
2. Keep dev env files local and gitignored.
3. Enforce detection in CI before merge.
4. Re-test application paths immediately after credential rotation.
---
## Cloud Secret Store Integration
Production applications should never read secrets from `.env` files or environment variables baked into container images. Use a dedicated secret store instead.
### Provider Comparison
| Provider | Best For | Key Feature |
|----------|----------|-------------|
| **HashiCorp Vault** | Multi-cloud / hybrid | Dynamic secrets, policy engine, pluggable backends |
| **AWS Secrets Manager** | AWS-native workloads | Native Lambda/ECS/EKS integration, automatic RDS rotation |
| **Azure Key Vault** | Azure-native workloads | Managed HSM, Azure AD RBAC, certificate management |
| **GCP Secret Manager** | GCP-native workloads | IAM-based access, automatic replication, versioning |
### Selection Guidance
- **Single cloud provider** — use the cloud-native secret manager. It integrates tightly with IAM, reduces operational overhead, and costs less than self-hosting.
- **Multi-cloud or hybrid** — use HashiCorp Vault. It provides a uniform API across environments and supports dynamic secret generation (database credentials, cloud IAM keys) that expire automatically.
- **Kubernetes-heavy** — combine External Secrets Operator with any backend above to sync secrets into K8s `Secret` objects without hardcoding.
### Application Access Patterns
1. **SDK/API pull** — application fetches secret at startup or on-demand via provider SDK.
2. **Sidecar injection** — a sidecar container (e.g., Vault Agent) writes secrets to a shared volume or injects them as environment variables.
3. **Init container** — a Kubernetes init container fetches secrets before the main container starts.
4. **CSI driver** — secrets mount as a filesystem volume via the Secrets Store CSI Driver.
> **Cross-reference:** See `engineering/secrets-vault-manager` for production vault infrastructure patterns, HA deployment, and disaster recovery procedures.
---
## Secret Rotation Workflow
Stale secrets are a liability. Rotation ensures that even if a credential leaks, its useful lifetime is bounded.
### Phase 1: Detection
- Track secret creation and expiry dates in your secret store metadata.
- Set alerts at 30, 14, and 7 days before expiry.
- Use `scripts/env_auditor.py` to flag secrets with no recorded rotation date.
### Phase 2: Rotation
1. **Generate** a new credential (API key, database password, certificate).
2. **Deploy** the new credential to all consumers (apps, services, pipelines) in parallel.
3. **Verify** each consumer can authenticate using the new credential.
4. **Revoke** the old credential only after all consumers are confirmed healthy.
5. **Update** metadata with the new rotation timestamp and next rotation date.
### Phase 3: Automation
- **AWS Secrets Manager** — use built-in Lambda-based rotation for RDS, Redshift, and DocumentDB.
- **HashiCorp Vault** — configure dynamic secrets with TTLs; credentials are generated on-demand and auto-expire.
- **Azure Key Vault** — use Event Grid notifications to trigger rotation functions.
- **GCP Secret Manager** — use Pub/Sub notifications tied to Cloud Functions for rotation logic.
### Emergency Rotation Checklist
When a secret is confirmed leaked:
1. **Immediately revoke** the compromised credential at the provider level.
2. Generate and deploy a replacement credential to all consumers.
3. Audit access logs for unauthorized usage during the exposure window.
4. Scan git history, CI logs, and artifact registries for the leaked value.
5. File an incident report documenting scope, timeline, and remediation steps.
6. Review and tighten detection controls to prevent recurrence.
---
## CI/CD Secret Injection
Secrets in CI/CD pipelines require careful handling to avoid exposure in logs, artifacts, or pull request contexts.
### GitHub Actions
- Use **repository secrets** or **environment secrets** via `${{ secrets.SECRET_NAME }}`.
- Prefer **OIDC federation** (`aws-actions/configure-aws-credentials` with `role-to-assume`) over long-lived access keys.
- Environment secrets with required reviewers add approval gates for production deployments.
- GitHub automatically masks secrets in logs, but avoid `echo` or `toJSON()` on secret values.
### GitLab CI
- Store secrets as **CI/CD variables** with the `masked` and `protected` flags enabled.
- Use **HashiCorp Vault integration** (`secrets:vault`) for dynamic secret injection without storing values in GitLab.
- Scope variables to specific environments (`production`, `staging`) to enforce least privilege.
### Universal Patterns
- **Never echo or print** secret values in pipeline output, even for debugging.
- **Use short-lived tokens** (OIDC, STS AssumeRole) instead of static credentials wherever possible.
- **Restrict PR access** — do not expose secrets to pipelines triggered by forks or untrusted branches.
- **Rotate CI secrets** on the same schedule as application secrets; pipeline credentials are attack vectors too.
- **Audit pipeline logs** periodically for accidental secret exposure that masking may have missed.
---
## Pre-Commit Secret Detection
Catching secrets before they reach version control is the most cost-effective defense. Two leading tools cover this space.
### gitleaks
```toml
# .gitleaks.toml — minimal configuration
[extend]
useDefault = true
[[rules]]
id = "custom-internal-token"
description = "Internal service token pattern"
regex = '''INTERNAL_TOKEN_[A-Za-z0-9]{32}'''
secretGroup = 0
```
- Install: `brew install gitleaks` or download from GitHub releases.
- Pre-commit hook: `gitleaks git --pre-commit --staged`
- Baseline scanning: `gitleaks detect --source . --report-path gitleaks-report.json`
- Manage false positives in `.gitleaksignore` (one fingerprint per line).
### detect-secrets
```bash
# Generate baseline
detect-secrets scan --all-files > .secrets.baseline
# Pre-commit hook (via pre-commit framework)
# .pre-commit-config.yaml
repos:
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
```
- Supports **custom plugins** for organization-specific patterns.
- Audit workflow: `detect-secrets audit .secrets.baseline` interactively marks true/false positives.
### False Positive Management
- Maintain `.gitleaksignore` or `.secrets.baseline` in version control so the whole team shares exclusions.
- Review false positive lists during security audits — patterns may mask real leaks over time.
- Prefer tightening regex patterns over broadly ignoring files.
---
## Audit Logging
Knowing who accessed which secret and when is critical for incident investigation and compliance.
### Cloud-Native Audit Trails
| Provider | Service | What It Captures |
|----------|---------|-----------------|
| **AWS** | CloudTrail | Every `GetSecretValue`, `DescribeSecret`, `RotateSecret` API call |
| **Azure** | Activity Log + Diagnostic Logs | Key Vault access events, including caller identity and IP |
| **GCP** | Cloud Audit Logs | Data access logs for Secret Manager with principal and timestamp |
| **Vault** | Audit Backend | Full request/response logging (file, syslog, or socket backend) |
### Alerting Strategy
- Alert on **access from unknown IP ranges** or service accounts outside the expected set.
- Alert on **bulk secret reads** (more than N secrets accessed within a time window).
- Alert on **access outside deployment windows** when no CI/CD pipeline is running.
- Feed audit logs into your SIEM (Splunk, Datadog, Elastic) for correlation with other security events.
- Review audit logs quarterly as part of access recertification.
---
## Cross-References
This skill covers env hygiene and secret detection. For deeper coverage of related domains, see:
| Skill | Path | Relationship |
|-------|------|-------------|
| **Secrets Vault Manager** | `engineering/secrets-vault-manager` | Production vault infrastructure, HA deployment, DR |
| **Senior SecOps** | `engineering/senior-secops` | Security operations perspective, incident response |
| **CI/CD Pipeline Builder** | `engineering/ci-cd-pipeline-builder` | Pipeline architecture, secret injection patterns |
| **Infrastructure as Code** | `engineering/infrastructure-as-code` | Terraform/Pulumi secret backend configuration |
| **Container Orchestration** | `engineering/container-orchestration` | Kubernetes secret mounting, sealed secrets |