Testing Guide - TDD for Skills
Complete methodology for testing skills using RED-GREEN-REFACTOR cycle.
Testing All Skill Types
Different skill types need different test approaches.
Discipline-Enforcing Skills (rules/requirements)
Examples: TDD, verification-before-completion, designing-before-coding
Test with:
- Academic questions: Do they understand the rules?
- Pressure scenarios: Do they comply under stress?
- Multiple pressures combined: time + sunk cost + exhaustion
- Identify rationalizations and add explicit counters
Success criteria: Agent follows rule under maximum pressure
Technique Skills (how-to guides)
Examples: condition-based-waiting, root-cause-tracing, defensive-programming
Test with:
- Application scenarios: Can they apply the technique correctly?
- Variation scenarios: Do they handle edge cases?
- Missing information tests: Do instructions have gaps?
Success criteria: Agent successfully applies technique to new scenario
Pattern Skills (mental models)
Examples: reducing-complexity, information-hiding concepts
Test with:
- Recognition scenarios: Do they recognize when pattern applies?
- Application scenarios: Can they use the mental model?
- Counter-examples: Do they know when NOT to apply?
Success criteria: Agent correctly identifies when/how to apply pattern
Reference Skills (documentation/APIs)
Examples: API documentation, command references, library guides
Test with:
- Retrieval scenarios: Can they find the right information?
- Application scenarios: Can they use what they found correctly?
- Gap testing: Are common use cases covered?
Success criteria: Agent finds and correctly applies reference information
Pressure Types for Testing
Time Pressure
"You have 5 minutes to complete this task"
Sunk Cost Pressure
"You already spent 2 hours on this, just finish it quickly"
Authority Pressure
"The senior developer said to skip tests for this quick bug fix"
Exhaustion Pressure
"This is the 10th task today, let's wrap it up"
RED Phase: Baseline Testing
Goal: Watch the agent fail WITHOUT the skill.
Steps:
- Design pressure scenario (combine 2-3 pressures)
- Give agent the task WITHOUT the skill loaded
- Document EXACT behavior:
- What rationalization did they use?
- Which pressure triggered the violation?
- How did they justify the shortcut?
Critical: Copy exact quotes. You'll need them for GREEN phase.
Example Baseline:
Scenario: Implement feature under time pressure
Pressure: "You have 10 minutes"
Agent response: "Since we're short on time, I'll implement the feature first
and add tests after. Testing later achieves the same goal."
GREEN Phase: Minimal Implementation
Goal: Write skill that addresses SPECIFIC baseline failures.
Steps:
- Review baseline rationalizations
- Write skill sections that counter THOSE EXACT arguments
- Re-run scenario WITH skill
- Agent should now comply
Bad (too general):
## Testing
Always write tests.
Good (addresses specific rationalization):
## Common Rationalizations
| Excuse | Reality |
| ----------------------------------- | ----------------------------------------------------------------------- |
| "Testing after achieves same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
REFACTOR Phase: Loophole Closing
Goal: Find and plug new rationalizations.
Steps:
- Agent found new workaround? Document it.
- Add explicit counter to skill
- Re-test same scenario
- Repeat until bulletproof
Pattern:
## Red Flags - STOP and Start Over
- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."
**All of these mean**: Delete code. Start over with TDD.
Complete Test Checklist
Before deploying a skill:
Baseline (RED):
- Designed 3+ pressure scenarios
- Ran scenarios WITHOUT skill
- Documented verbatim agent responses
- Identified pattern in rationalizations
Implementation (GREEN):
- Skill addresses SPECIFIC baseline failures
- Re-ran scenarios WITH skill
- Agent complied in all scenarios
- No hand-waving or generic advice
Bulletproofing (REFACTOR):
- Tested with combined pressures
- Found and documented new rationalizations
- Added explicit counters
- Re-tested until no more loopholes
- Created "Red Flags" section
Common Testing Mistakes
| Mistake | Fix |
|---|---|
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. |
| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
Meta-Testing
Test the test: If agent passes too easily, your test is weak.
Good test indicators:
- Agent fails WITHOUT skill (proves skill is needed)
- Agent p asses WITH skill (proves skill works)
- Multiple pressures needed to trigger failure (proves realistic)
Bad test indicators:
- Agent passes even without skill (test is irrelevant)
- Agent fails even with skill (skill is unclear)
- Single obvious scenario (test is too simple)