feat: add promptfoo eval pipeline for skill quality testing
- Add eval/ directory with 10 pilot skill eval configs - Add GitHub Action (skill-eval.yml) for automated eval on PR - Add generate-eval-config.py script for bootstrapping new evals - Add reusable assertion helpers (skill-quality.js) - Add eval README with setup and usage docs Skills covered: copywriting, cto-advisor, seo-audit, content-strategy, aws-solution-architect, agile-product-owner, senior-frontend, senior-security, mcp-server-builder, launch-strategy CI integration: - Triggers on PR to dev when SKILL.md files change - Detects which skills changed and runs only those evals - Posts results as PR comments (non-blocking) - Uploads full results as artifacts No existing files modified.
This commit is contained in:
142
eval/README.md
Normal file
142
eval/README.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Skill Evaluation Pipeline
|
||||
|
||||
Automated quality evaluation for skills using [promptfoo](https://promptfoo.dev).
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Run a single skill eval
|
||||
npx promptfoo@latest eval -c eval/skills/copywriting.yaml
|
||||
|
||||
# View results in browser
|
||||
npx promptfoo@latest view
|
||||
|
||||
# Run all pilot skill evals
|
||||
for config in eval/skills/*.yaml; do
|
||||
npx promptfoo@latest eval -c "$config" --no-cache
|
||||
done
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Node.js 18+
|
||||
- `ANTHROPIC_API_KEY` environment variable set
|
||||
- No additional dependencies (promptfoo runs via npx)
|
||||
|
||||
## How It Works
|
||||
|
||||
Each skill has an eval config in `eval/skills/<skill-name>.yaml` that:
|
||||
|
||||
1. Loads the skill's `SKILL.md` content as context
|
||||
2. Sends realistic task prompts to an LLM with the skill loaded
|
||||
3. Evaluates outputs against quality assertions (LLM rubrics + programmatic checks)
|
||||
4. Reports pass/fail per assertion
|
||||
|
||||
### CI/CD Integration
|
||||
|
||||
The GitHub Action (`.github/workflows/skill-eval.yml`) runs automatically when:
|
||||
- A PR to `dev` changes any `SKILL.md` file
|
||||
- The changed skill has an eval config in `eval/skills/`
|
||||
- Results are posted as PR comments
|
||||
|
||||
Currently **non-blocking** — evals are informational, not gates.
|
||||
|
||||
## Adding Evals for a New Skill
|
||||
|
||||
### Option 1: Auto-generate
|
||||
|
||||
```bash
|
||||
python eval/scripts/generate-eval-config.py marketing-skill/my-new-skill
|
||||
```
|
||||
|
||||
This creates a boilerplate config with default prompts and assertions. **Always customize** the generated config with domain-specific test cases.
|
||||
|
||||
### Option 2: Manual
|
||||
|
||||
Copy an existing config and modify:
|
||||
|
||||
```bash
|
||||
cp eval/skills/copywriting.yaml eval/skills/my-skill.yaml
|
||||
```
|
||||
|
||||
### Eval Config Structure
|
||||
|
||||
```yaml
|
||||
description: "What this eval tests"
|
||||
|
||||
prompts:
|
||||
- |
|
||||
You are an expert AI assistant with this skill:
|
||||
---BEGIN SKILL---
|
||||
{{skill_content}}
|
||||
---END SKILL---
|
||||
Task: {{task}}
|
||||
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
config:
|
||||
max_tokens: 4096
|
||||
|
||||
tests:
|
||||
- vars:
|
||||
skill_content: file://../../path/to/SKILL.md
|
||||
task: "A realistic user request"
|
||||
assert:
|
||||
- type: llm-rubric
|
||||
value: "What good output looks like"
|
||||
- type: javascript
|
||||
value: "output.length > 200"
|
||||
```
|
||||
|
||||
### Assertion Types
|
||||
|
||||
| Type | Use For | Example |
|
||||
|------|---------|---------|
|
||||
| `llm-rubric` | Qualitative checks (expertise, relevance) | `"Response includes actionable next steps"` |
|
||||
| `contains` | Required terms | `"React"` |
|
||||
| `javascript` | Programmatic checks | `"output.length > 500"` |
|
||||
| `similar` | Semantic similarity | Compare against reference output |
|
||||
|
||||
## Reading Results
|
||||
|
||||
```bash
|
||||
# Terminal output (after eval)
|
||||
npx promptfoo@latest eval -c eval/skills/copywriting.yaml
|
||||
|
||||
# Web UI (interactive)
|
||||
npx promptfoo@latest view
|
||||
|
||||
# JSON output (for scripting)
|
||||
npx promptfoo@latest eval -c eval/skills/copywriting.yaml --output results.json
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
eval/
|
||||
├── promptfooconfig.yaml # Master config (reference)
|
||||
├── skills/ # Per-skill eval configs
|
||||
│ ├── copywriting.yaml # ← 10 pilot skills
|
||||
│ ├── cto-advisor.yaml
|
||||
│ └── ...
|
||||
├── assertions/
|
||||
│ └── skill-quality.js # Reusable assertion helpers
|
||||
├── scripts/
|
||||
│ └── generate-eval-config.py # Config generator
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Running Locally vs CI
|
||||
|
||||
| | Local | CI |
|
||||
|---|---|---|
|
||||
| **Command** | `npx promptfoo@latest eval -c eval/skills/X.yaml` | Automatic on PR |
|
||||
| **Results** | Terminal + web viewer | PR comment + artifact |
|
||||
| **Caching** | Enabled (faster iteration) | Disabled (`--no-cache`) |
|
||||
| **Cost** | Your API key | Repo secret `ANTHROPIC_API_KEY` |
|
||||
|
||||
## Cost Estimate
|
||||
|
||||
Each skill eval runs 2-3 test cases × ~4K tokens output = ~12K tokens per skill.
|
||||
At Sonnet pricing (~$3/M input, $15/M output): **~$0.05-0.10 per skill eval**.
|
||||
Full 10-skill pilot batch: **~$0.50-1.00 per run**.
|
||||
Reference in New Issue
Block a user