- Codex CLI: 174 skills synced, 11 new symlinks - Gemini CLI: 262 items synced, 11 new - engineering plugin.json: 33 → 35 skills - engineering-team plugin.json: 28 → 29 skills - Docs regenerated: 261 pages (214 skills + 25 agents + 22 commands) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
title, description
| title | description |
|---|---|
| TDD Guide — Agent Skill & Codex Plugin | Test-driven development skill for writing unit tests, generating test fixtures and mocks, analyzing coverage gaps, and guiding red-green-refactor. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw. |
TDD Guide
claude /plugin install engineering-skills
Test-driven development skill for generating tests, analyzing coverage, and guiding red-green-refactor workflows across Jest, Pytest, JUnit, and Vitest.
Workflows
Generate Tests from Code
- Provide source code (TypeScript, JavaScript, Python, Java)
- Specify target framework (Jest, Pytest, JUnit, Vitest)
- Run
test_generator.pywith requirements - Review generated test stubs
- Validation: Tests compile and cover happy path, error cases, edge cases
Analyze Coverage Gaps
- Generate coverage report from test runner (
npm test -- --coverage) - Run
coverage_analyzer.pyon LCOV/JSON/XML report - Review prioritized gaps (P0/P1/P2)
- Generate missing tests for uncovered paths
- Validation: Coverage meets target threshold (typically 80%+)
TDD New Feature
- Write failing test first (RED)
- Run
tdd_workflow.py --phase redto validate - Implement minimal code to pass (GREEN)
- Run
tdd_workflow.py --phase greento validate - Refactor while keeping tests green (REFACTOR)
- Validation: All tests pass after each cycle
Examples
Test Generation — Input → Output (Pytest)
Input source function (math_utils.py):
def divide(a: float, b: float) -> float:
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
Command:
python scripts/test_generator.py --input math_utils.py --framework pytest
Generated test output (test_math_utils.py):
import pytest
from math_utils import divide
class TestDivide:
def test_divide_positive_numbers(self):
assert divide(10, 2) == 5.0
def test_divide_negative_numerator(self):
assert divide(-10, 2) == -5.0
def test_divide_float_result(self):
assert divide(1, 3) == pytest.approx(0.333, rel=1e-3)
def test_divide_by_zero_raises_value_error(self):
with pytest.raises(ValueError, match="Cannot divide by zero"):
divide(10, 0)
def test_divide_zero_numerator(self):
assert divide(0, 5) == 0.0
Coverage Analysis — Sample P0/P1/P2 Output
Command:
python scripts/coverage_analyzer.py --report lcov.info --threshold 80
Sample output:
Coverage Report — Overall: 63% (threshold: 80%)
P0 — Critical gaps (uncovered error paths):
auth/login.py:42-58 handle_expired_token() 0% covered
payments/process.py:91-110 handle_payment_failure() 0% covered
P1 — High-value gaps (core logic branches):
users/service.py:77 update_profile() — else branch 0% covered
orders/cart.py:134 apply_discount() — zero-qty guard 0% covered
P2 — Low-risk gaps (utility / helper functions):
utils/formatting.py:12 format_currency() 0% covered
Recommended: Generate tests for P0 items first to reach 80% threshold.
Key Tools
| Tool | Purpose | Usage |
|---|---|---|
test_generator.py |
Generate test cases from code/requirements | python scripts/test_generator.py --input source.py --framework pytest |
coverage_analyzer.py |
Parse and analyze coverage reports | python scripts/coverage_analyzer.py --report lcov.info --threshold 80 |
tdd_workflow.py |
Guide red-green-refactor cycles | python scripts/tdd_workflow.py --phase red --test test_auth.py |
fixture_generator.py |
Generate test data and mocks | python scripts/fixture_generator.py --entity User --count 5 |
Additional scripts: framework_adapter.py (convert between frameworks), metrics_calculator.py (quality metrics), format_detector.py (detect language/framework), output_formatter.py (CLI/desktop/CI output).
Input Requirements
For Test Generation:
- Source code (file path or pasted content)
- Target framework (Jest, Pytest, JUnit, Vitest)
- Coverage scope (unit, integration, edge cases)
For Coverage Analysis:
- Coverage report file (LCOV, JSON, or XML format)
- Optional: Source code for context
- Optional: Target threshold percentage
For TDD Workflow:
- Feature requirements or user story
- Current phase (RED, GREEN, REFACTOR)
- Test code and implementation status
Spec-First Workflow
TDD is most effective when driven by a written spec. The flow:
- Write or receive a spec — stored in
specs/<feature>.md - Extract acceptance criteria — each criterion becomes one or more test cases
- Write failing tests (RED) — one test per acceptance criterion
- Implement minimal code (GREEN) — satisfy each test in order
- Refactor — clean up while all tests stay green
Spec Directory Convention
project/
├── specs/
│ ├── user-auth.md # Feature spec with acceptance criteria
│ ├── payment-processing.md
│ └── notification-system.md
├── tests/
│ ├── test_user_auth.py # Tests derived from specs/user-auth.md
│ ├── test_payments.py
│ └── test_notifications.py
└── src/
Extracting Tests from Specs
Each acceptance criterion in a spec maps to at least one test:
| Spec Criterion | Test Case |
|---|---|
| "User can log in with valid credentials" | test_login_valid_credentials_returns_token |
| "Invalid password returns 401" | test_login_invalid_password_returns_401 |
| "Account locks after 5 failed attempts" | test_login_locks_after_five_failures |
Tip: Number your acceptance criteria in the spec. Reference the number in the test docstring for traceability (# AC-3: Account locks after 5 failed attempts).
Cross-reference: See
engineering/spec-driven-workflowfor the full spec methodology, including spec templates and review checklists.
Red-Green-Refactor Examples Per Language
TypeScript / Jest
// test/cart.test.ts
describe("Cart", () => {
describe("addItem", () => {
it("should add a new item to an empty cart", () => {
const cart = new Cart();
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 });
expect(cart.items).toHaveLength(1);
expect(cart.items[0].id).toBe("sku-1");
});
it("should increment quantity when adding an existing item", () => {
const cart = new Cart();
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 1 });
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 2 });
expect(cart.items).toHaveLength(1);
expect(cart.items[0].qty).toBe(3);
});
it("should throw when quantity is zero or negative", () => {
const cart = new Cart();
expect(() =>
cart.addItem({ id: "sku-1", name: "Widget", price: 9.99, qty: 0 })
).toThrow("Quantity must be positive");
});
});
});
Python / Pytest (Advanced Patterns)
# tests/conftest.py — shared fixtures
import pytest
from app.db import create_engine, Session
@pytest.fixture(scope="session")
def db_engine():
engine = create_engine("sqlite:///:memory:")
yield engine
engine.dispose()
@pytest.fixture
def db_session(db_engine):
session = Session(bind=db_engine)
yield session
session.rollback()
session.close()
# tests/test_pricing.py — parametrize for multiple cases
import pytest
from app.pricing import calculate_discount
@pytest.mark.parametrize("subtotal, expected_discount", [
(50.0, 0.0), # Below threshold — no discount
(100.0, 5.0), # 5% tier
(250.0, 25.0), # 10% tier
(500.0, 75.0), # 15% tier
])
def test_calculate_discount(subtotal, expected_discount):
assert calculate_discount(subtotal) == pytest.approx(expected_discount)
Go — Table-Driven Tests
// cart_test.go
package cart
import "testing"
func TestApplyDiscount(t *testing.T) {
tests := []struct {
name string
subtotal float64
want float64
}{
{"no discount below threshold", 50.0, 0.0},
{"5 percent tier", 100.0, 5.0},
{"10 percent tier", 250.0, 25.0},
{"15 percent tier", 500.0, 75.0},
{"zero subtotal", 0.0, 0.0},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := ApplyDiscount(tt.subtotal)
if got != tt.want {
t.Errorf("ApplyDiscount(%v) = %v, want %v", tt.subtotal, got, tt.want)
}
})
}
}
Bounded Autonomy Rules
When generating tests autonomously, follow these rules to decide when to stop and ask the user:
Stop and Ask When
- Ambiguous requirements — the spec or user story has conflicting or unclear acceptance criteria
- Missing edge cases — you cannot determine boundary values without domain knowledge (e.g., max allowed transaction amount)
- Test count exceeds 50 — large test suites need human review before committing; present a summary and ask which areas to prioritize
- External dependencies unclear — the feature relies on third-party APIs or services with undocumented behavior
- Security-sensitive logic — authentication, authorization, encryption, or payment flows require human sign-off on test scenarios
Continue Autonomously When
- Clear spec with numbered acceptance criteria — each criterion maps directly to tests
- Straightforward CRUD operations — create, read, update, delete with well-defined models
- Well-defined API contracts — OpenAPI spec or typed interfaces available
- Pure functions — deterministic input/output with no side effects
- Existing test patterns — the codebase already has similar tests to follow
Property-Based Testing
Property-based testing generates random inputs to verify invariants instead of relying on hand-picked examples. Use it when the input space is large and the expected behavior can be described as a property.
Python — Hypothesis
from hypothesis import given, strategies as st
from app.serializers import serialize, deserialize
@given(st.text())
def test_roundtrip_serialization(data):
"""Serialization followed by deserialization returns the original."""
assert deserialize(serialize(data)) == data
@given(st.integers(), st.integers())
def test_addition_is_commutative(a, b):
assert a + b == b + a
TypeScript — fast-check
import fc from "fast-check";
import { encode, decode } from "./codec";
test("encode/decode roundtrip", () => {
fc.assert(
fc.property(fc.string(), (input) => {
expect(decode(encode(input))).toBe(input);
})
);
});
When to Use Property-Based Over Example-Based
| Use Property-Based | Example |
|---|---|
| Data transformations | Serialize/deserialize roundtrips |
| Mathematical properties | Commutativity, associativity, idempotency |
| Encoding/decoding | Base64, URL encoding, compression |
| Sorting and filtering | Output is sorted, length preserved |
| Parser correctness | Valid input always parses without error |
Mutation Testing
Mutation testing modifies your production code (creates "mutants") and checks whether your tests catch the changes. If a mutant survives (tests still pass), your tests have a gap that coverage alone cannot reveal.
Tools
| Language | Tool | Command |
|---|---|---|
| TypeScript/JavaScript | Stryker | npx stryker run |
| Python | mutmut | mutmut run --paths-to-mutate=src/ |
| Java | PIT | mvn org.pitest:pitest-maven:mutationCoverage |
Why Mutation Testing Matters
- 100% line coverage != good tests — coverage tells you code was executed, not that it was verified
- Catches weak assertions — tests that run code but assert nothing meaningful
- Finds missing boundary tests — mutants that change
<to<=expose off-by-one gaps - Quantifiable quality metric — mutation score (% mutants killed) is a stronger signal than coverage %
Recommendation: Run mutation testing on critical paths (auth, payments, data processing) even if overall coverage is high. Target 85%+ mutation score on P0 modules.
Cross-References
| Skill | Relationship |
|---|---|
engineering/spec-driven-workflow |
Spec → acceptance criteria → test extraction pipeline |
engineering-team/focused-fix |
Phase 5 (Verify) uses TDD to confirm the fix with a regression test |
engineering-team/senior-qa |
Broader QA strategy; TDD is one layer in the test pyramid |
engineering-team/code-reviewer |
Review generated tests for assertion quality and coverage completeness |
engineering-team/senior-fullstack |
Project scaffolders include testing infrastructure compatible with TDD workflows |
Limitations
| Scope | Details |
|---|---|
| Unit test focus | Integration and E2E tests require different patterns |
| Static analysis | Cannot execute tests or measure runtime behavior |
| Language support | Best for TypeScript, JavaScript, Python, Java |
| Report formats | LCOV, JSON, XML only; other formats need conversion |
| Generated tests | Provide scaffolding; require human review for complex logic |
When to use other tools:
- E2E testing: Playwright, Cypress, Selenium
- Performance testing: k6, JMeter, Locust
- Security testing: OWASP ZAP, Burp Suite