- Add eval/ directory with 10 pilot skill eval configs - Add GitHub Action (skill-eval.yml) for automated eval on PR - Add generate-eval-config.py script for bootstrapping new evals - Add reusable assertion helpers (skill-quality.js) - Add eval README with setup and usage docs Skills covered: copywriting, cto-advisor, seo-audit, content-strategy, aws-solution-architect, agile-product-owner, senior-frontend, senior-security, mcp-server-builder, launch-strategy CI integration: - Triggers on PR to dev when SKILL.md files change - Detects which skills changed and runs only those evals - Posts results as PR comments (non-blocking) - Uploads full results as artifacts No existing files modified.
3.8 KiB
3.8 KiB
Skill Evaluation Pipeline
Automated quality evaluation for skills using promptfoo.
Quick Start
# Run a single skill eval
npx promptfoo@latest eval -c eval/skills/copywriting.yaml
# View results in browser
npx promptfoo@latest view
# Run all pilot skill evals
for config in eval/skills/*.yaml; do
npx promptfoo@latest eval -c "$config" --no-cache
done
Requirements
- Node.js 18+
ANTHROPIC_API_KEYenvironment variable set- No additional dependencies (promptfoo runs via npx)
How It Works
Each skill has an eval config in eval/skills/<skill-name>.yaml that:
- Loads the skill's
SKILL.mdcontent as context - Sends realistic task prompts to an LLM with the skill loaded
- Evaluates outputs against quality assertions (LLM rubrics + programmatic checks)
- Reports pass/fail per assertion
CI/CD Integration
The GitHub Action (.github/workflows/skill-eval.yml) runs automatically when:
- A PR to
devchanges anySKILL.mdfile - The changed skill has an eval config in
eval/skills/ - Results are posted as PR comments
Currently non-blocking — evals are informational, not gates.
Adding Evals for a New Skill
Option 1: Auto-generate
python eval/scripts/generate-eval-config.py marketing-skill/my-new-skill
This creates a boilerplate config with default prompts and assertions. Always customize the generated config with domain-specific test cases.
Option 2: Manual
Copy an existing config and modify:
cp eval/skills/copywriting.yaml eval/skills/my-skill.yaml
Eval Config Structure
description: "What this eval tests"
prompts:
- |
You are an expert AI assistant with this skill:
---BEGIN SKILL---
{{skill_content}}
---END SKILL---
Task: {{task}}
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 4096
tests:
- vars:
skill_content: file://../../path/to/SKILL.md
task: "A realistic user request"
assert:
- type: llm-rubric
value: "What good output looks like"
- type: javascript
value: "output.length > 200"
Assertion Types
| Type | Use For | Example |
|---|---|---|
llm-rubric |
Qualitative checks (expertise, relevance) | "Response includes actionable next steps" |
contains |
Required terms | "React" |
javascript |
Programmatic checks | "output.length > 500" |
similar |
Semantic similarity | Compare against reference output |
Reading Results
# Terminal output (after eval)
npx promptfoo@latest eval -c eval/skills/copywriting.yaml
# Web UI (interactive)
npx promptfoo@latest view
# JSON output (for scripting)
npx promptfoo@latest eval -c eval/skills/copywriting.yaml --output results.json
File Structure
eval/
├── promptfooconfig.yaml # Master config (reference)
├── skills/ # Per-skill eval configs
│ ├── copywriting.yaml # ← 10 pilot skills
│ ├── cto-advisor.yaml
│ └── ...
├── assertions/
│ └── skill-quality.js # Reusable assertion helpers
├── scripts/
│ └── generate-eval-config.py # Config generator
└── README.md # This file
Running Locally vs CI
| Local | CI | |
|---|---|---|
| Command | npx promptfoo@latest eval -c eval/skills/X.yaml |
Automatic on PR |
| Results | Terminal + web viewer | PR comment + artifact |
| Caching | Enabled (faster iteration) | Disabled (--no-cache) |
| Cost | Your API key | Repo secret ANTHROPIC_API_KEY |
Cost Estimate
Each skill eval runs 2-3 test cases × 4K tokens output = $0.05-0.10 per skill eval**.12K tokens per skill.$3/M input, $15/M output): **
At Sonnet pricing (
Full 10-skill pilot batch: ~$0.50-1.00 per run.