feat: add promptfoo eval pipeline for skill quality testing

- Add eval/ directory with 10 pilot skill eval configs - Add GitHub Action (skill-eval.yml) for automated eval on PR - Add generate-eval-config.py script for bootstrapping new evals - Add reusable assertion helpers (skill-quality.js) - Add eval README with setup and usage docs Skills covered: copywriting, cto-advisor, seo-audit, content-strategy, aws-solution-architect, agile-product-owner, senior-frontend, senior-security, mcp-server-builder, launch-strategy CI integration: - Triggers on PR to dev when SKILL.md files change - Detects which skills changed and runs only those evals - Posts results as PR comments (non-blocking) - Uploads full results as artifacts No existing files modified.
2026-03-12 05:39:24 +01:00
parent 713e2deb82
commit 75fa9de2bb
15 changed files with 1055 additions and 0 deletions
--- a/eval/README.md
+++ b/eval/README.md
@@ -0,0 +1,142 @@
+# Skill Evaluation Pipeline
+
+Automated quality evaluation for skills using [promptfoo](https://promptfoo.dev).
+
+## Quick Start
+
+```bash
+# Run a single skill eval
+npx promptfoo@latest eval -c eval/skills/copywriting.yaml
+
+# View results in browser
+npx promptfoo@latest view
+
+# Run all pilot skill evals
+for config in eval/skills/*.yaml; do
+  npx promptfoo@latest eval -c "$config" --no-cache
+done
+```
+
+## Requirements
+
+- Node.js 18+
+- `ANTHROPIC_API_KEY` environment variable set
+- No additional dependencies (promptfoo runs via npx)
+
+## How It Works
+
+Each skill has an eval config in `eval/skills/<skill-name>.yaml` that:
+
+1. Loads the skill's `SKILL.md` content as context
+2. Sends realistic task prompts to an LLM with the skill loaded
+3. Evaluates outputs against quality assertions (LLM rubrics + programmatic checks)
+4. Reports pass/fail per assertion
+
+### CI/CD Integration
+
+The GitHub Action (`.github/workflows/skill-eval.yml`) runs automatically when:
+- A PR to `dev` changes any `SKILL.md` file
+- The changed skill has an eval config in `eval/skills/`
+- Results are posted as PR comments
+
+Currently **non-blocking** — evals are informational, not gates.
+
+## Adding Evals for a New Skill
+
+### Option 1: Auto-generate
+
+```bash
+python eval/scripts/generate-eval-config.py marketing-skill/my-new-skill
+```
+
+This creates a boilerplate config with default prompts and assertions. **Always customize** the generated config with domain-specific test cases.
+
+### Option 2: Manual
+
+Copy an existing config and modify:
+
+```bash
+cp eval/skills/copywriting.yaml eval/skills/my-skill.yaml
+```
+
+### Eval Config Structure
+
+```yaml
+description: "What this eval tests"
+
+prompts:
+  - |
+    You are an expert AI assistant with this skill:
+    ---BEGIN SKILL---
+    {{skill_content}}
+    ---END SKILL---
+    Task: {{task}}
+
+providers:
+  - id: anthropic:messages:claude-sonnet-4-6
+    config:
+      max_tokens: 4096
+
+tests:
+  - vars:
+      skill_content: file://../../path/to/SKILL.md
+      task: "A realistic user request"
+    assert:
+      - type: llm-rubric
+        value: "What good output looks like"
+      - type: javascript
+        value: "output.length > 200"
+```
+
+### Assertion Types
+
+| Type | Use For | Example |
+|------|---------|---------|
+| `llm-rubric` | Qualitative checks (expertise, relevance) | `"Response includes actionable next steps"` |
+| `contains` | Required terms | `"React"` |
+| `javascript` | Programmatic checks | `"output.length > 500"` |
+| `similar` | Semantic similarity | Compare against reference output |
+
+## Reading Results
+
+```bash
+# Terminal output (after eval)
+npx promptfoo@latest eval -c eval/skills/copywriting.yaml
+
+# Web UI (interactive)
+npx promptfoo@latest view
+
+# JSON output (for scripting)
+npx promptfoo@latest eval -c eval/skills/copywriting.yaml --output results.json
+```
+
+## File Structure
+
+```
+eval/
+├── promptfooconfig.yaml      # Master config (reference)
+├── skills/                   # Per-skill eval configs
+│   ├── copywriting.yaml      # ← 10 pilot skills
+│   ├── cto-advisor.yaml
+│   └── ...
+├── assertions/
+│   └── skill-quality.js      # Reusable assertion helpers
+├── scripts/
+│   └── generate-eval-config.py  # Config generator
+└── README.md                 # This file
+```
+
+## Running Locally vs CI
+
+| | Local | CI |
+|---|---|---|
+| **Command** | `npx promptfoo@latest eval -c eval/skills/X.yaml` | Automatic on PR |
+| **Results** | Terminal + web viewer | PR comment + artifact |
+| **Caching** | Enabled (faster iteration) | Disabled (`--no-cache`) |
+| **Cost** | Your API key | Repo secret `ANTHROPIC_API_KEY` |
+
+## Cost Estimate
+
+Each skill eval runs 2-3 test cases × ~4K tokens output = ~12K tokens per skill.  
+At Sonnet pricing (~$3/M input, $15/M output): **~$0.05-0.10 per skill eval**.  
+Full 10-skill pilot batch: **~$0.50-1.00 per run**.