# Testing Guide - TDD for Skills Complete methodology for testing skills using RED-GREEN-REFACTOR cycle. ## Testing All Skill Types Different skill types need different test approaches. ### Discipline-Enforcing Skills (rules/requirements) **Examples**: TDD, verification-before-completion, designing-before-coding **Test with**: - Academic questions: Do they understand the rules? - Pressure scenarios: Do they comply under stress? - Multiple pressures combined: time + sunk cost + exhaustion - Identify rationalizations and add explicit counters **Success criteria**: Agent follows rule under maximum pressure ### Technique Skills (how-to guides) **Examples**: condition-based-waiting, root-cause-tracing, defensive-programming **Test with**: - Application scenarios: Can they apply the technique correctly? - Variation scenarios: Do they handle edge cases? - Missing information tests: Do instructions have gaps? **Success criteria**: Agent successfully applies technique to new scenario ### Pattern Skills (mental models) **Examples**: reducing-complexity, information-hiding concepts **Test with**: - Recognition scenarios: Do they recognize when pattern applies? - Application scenarios: Can they use the mental model? - Counter-examples: Do they know when NOT to apply? **Success criteria**: Agent correctly identifies when/how to apply pattern ### Reference Skills (documentation/APIs) **Examples**: API documentation, command references, library guides **Test with**: - Retrieval scenarios: Can they find the right information? - Application scenarios: Can they use what they found correctly? - Gap testing: Are common use cases covered? **Success criteria**: Agent finds and correctly applies reference information ## Pressure Types for Testing ### Time Pressure "You have 5 minutes to complete this task" ### Sunk Cost Pressure "You already spent 2 hours on this, just finish it quickly" ### Authority Pressure "The senior developer said to skip tests for this quick bug fix" ### Exhaustion Pressure "This is the 10th task today, let's wrap it up" ## RED Phase: Baseline Testing **Goal**: Watch the agent fail WITHOUT the skill. **Steps**: 1. Design pressure scenario (combine 2-3 pressures) 2. Give agent the task WITHOUT the skill loaded 3. Document EXACT behavior: - What rationalization did they use? - Which pressure triggered the violation? - How did they justify the shortcut? **Critical**: Copy exact quotes. You'll need them for GREEN phase. **Example Baseline**: ``` Scenario: Implement feature under time pressure Pressure: "You have 10 minutes" Agent response: "Since we're short on time, I'll implement the feature first and add tests after. Testing later achieves the same goal." ``` ## GREEN Phase: Minimal Implementation **Goal**: Write skill that addresses SPECIFIC baseline failures. **Steps**: 1. Review baseline rationalizations 2. Write skill sections that counter THOSE EXACT arguments 3. Re-run scenario WITH skill 4. Agent should now comply **Bad (too general)**: ```markdown ## Testing Always write tests. ``` **Good (addresses specific rationalization)**: ```markdown ## Common Rationalizations | Excuse | Reality | | ----------------------------------- | ----------------------------------------------------------------------- | | "Testing after achieves same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" | | "Too simple to test" | Simple code breaks. Test takes 30 seconds. | ``` ## REFACTOR Phase: Loophole Closing **Goal**: Find and plug new rationalizations. **Steps**: 1. Agent found new workaround? Document it. 2. Add explicit counter to skill 3. Re-test same scenario 4. Repeat until bulletproof **Pattern**: ```markdown ## Red Flags - STOP and Start Over - Code before test - "I already manually tested it" - "Tests after achieve the same purpose" - "It's about spirit not ritual" - "This is different because..." **All of these mean**: Delete code. Start over with TDD. ``` ## Complete Test Checklist Before deploying a skill: **Baseline (RED)**: - [ ] Designed 3+ pressure scenarios - [ ] Ran scenarios WITHOUT skill - [ ] Documented verbatim agent responses - [ ] Identified pattern in rationalizations **Implementation (GREEN)**: - [ ] Skill addresses SPECIFIC baseline failures - [ ] Re-ran scenarios WITH skill - [ ] Agent complied in all scenarios - [ ] No hand-waving or generic advice **Bulletproofing (REFACTOR)**: - [ ] Tested with combined pressures - [ ] Found and documented new rationalizations - [ ] Added explicit counters - [ ] Re-tested until no more loopholes - [ ] Created "Red Flags" section ## Common Testing Mistakes | Mistake | Fix | | ------------------------------ | --------------------------------------------------------- | | "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. | | "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. | | "Testing is overkill" | Untested skills have issues. Always. | | "Academic review is enough" | Reading ≠ using. Test application scenarios. | ## Meta-Testing **Test the test**: If agent passes too easily, your test is weak. **Good test indicators**: - Agent fails WITHOUT skill (proves skill is needed) - Agent p asses WITH skill (proves skill works) - Multiple pressures needed to trigger failure (proves realistic) **Bad test indicators**: - Agent passes even without skill (test is irrelevant) - Agent fails even with skill (skill is unclear) - Single obvious scenario (test is too simple)