firefrost-gaming/claude-skills-reference

Files

Leo 75fa9de2bb feat: add promptfoo eval pipeline for skill quality testing

- Add eval/ directory with 10 pilot skill eval configs
- Add GitHub Action (skill-eval.yml) for automated eval on PR
- Add generate-eval-config.py script for bootstrapping new evals
- Add reusable assertion helpers (skill-quality.js)
- Add eval README with setup and usage docs

Skills covered: copywriting, cto-advisor, seo-audit, content-strategy,
aws-solution-architect, agile-product-owner, senior-frontend,
senior-security, mcp-server-builder, launch-strategy

CI integration:
- Triggers on PR to dev when SKILL.md files change
- Detects which skills changed and runs only those evals
- Posts results as PR comments (non-blocking)
- Uploads full results as artifacts

No existing files modified.

2026-03-12 05:39:24 +01:00

3.8 KiB

Raw Blame History

Skill Evaluation Pipeline

Automated quality evaluation for skills using promptfoo.

Quick Start

# Run a single skill eval
npx promptfoo@latest eval -c eval/skills/copywriting.yaml

# View results in browser
npx promptfoo@latest view

# Run all pilot skill evals
for config in eval/skills/*.yaml; do
  npx promptfoo@latest eval -c "$config" --no-cache
done

Requirements

Node.js 18+
ANTHROPIC_API_KEY environment variable set
No additional dependencies (promptfoo runs via npx)

How It Works

Each skill has an eval config in eval/skills/<skill-name>.yaml that:

Loads the skill's SKILL.md content as context
Sends realistic task prompts to an LLM with the skill loaded
Evaluates outputs against quality assertions (LLM rubrics + programmatic checks)
Reports pass/fail per assertion

CI/CD Integration

The GitHub Action (.github/workflows/skill-eval.yml) runs automatically when:

A PR to dev changes any SKILL.md file
The changed skill has an eval config in eval/skills/
Results are posted as PR comments

Currently non-blocking — evals are informational, not gates.

Adding Evals for a New Skill

Option 1: Auto-generate

python eval/scripts/generate-eval-config.py marketing-skill/my-new-skill

This creates a boilerplate config with default prompts and assertions. Always customize the generated config with domain-specific test cases.

Option 2: Manual

Copy an existing config and modify:

cp eval/skills/copywriting.yaml eval/skills/my-skill.yaml

Eval Config Structure

description: "What this eval tests"

prompts:
  - |
    You are an expert AI assistant with this skill:
    ---BEGIN SKILL---
    {{skill_content}}
    ---END SKILL---
    Task: {{task}}

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 4096

tests:
  - vars:
      skill_content: file://../../path/to/SKILL.md
      task: "A realistic user request"
    assert:
      - type: llm-rubric
        value: "What good output looks like"
      - type: javascript
        value: "output.length > 200"

Assertion Types

Type	Use For	Example
`llm-rubric`	Qualitative checks (expertise, relevance)	`"Response includes actionable next steps"`
`contains`	Required terms	`"React"`
`javascript`	Programmatic checks	`"output.length > 500"`
`similar`	Semantic similarity	Compare against reference output

Reading Results

# Terminal output (after eval)
npx promptfoo@latest eval -c eval/skills/copywriting.yaml

# Web UI (interactive)
npx promptfoo@latest view

# JSON output (for scripting)
npx promptfoo@latest eval -c eval/skills/copywriting.yaml --output results.json

File Structure

eval/
├── promptfooconfig.yaml      # Master config (reference)
├── skills/                   # Per-skill eval configs
│   ├── copywriting.yaml      # ← 10 pilot skills
│   ├── cto-advisor.yaml
│   └── ...
├── assertions/
│   └── skill-quality.js      # Reusable assertion helpers
├── scripts/
│   └── generate-eval-config.py  # Config generator
└── README.md                 # This file

Running Locally vs CI

	Local	CI
Command	`npx promptfoo@latest eval -c eval/skills/X.yaml`	Automatic on PR
Results	Terminal + web viewer	PR comment + artifact
Caching	Enabled (faster iteration)	Disabled (`--no-cache`)
Cost	Your API key	Repo secret `ANTHROPIC_API_KEY`

Cost Estimate

Each skill eval runs 2-3 test cases × 4K tokens output = 12K tokens per skill.
At Sonnet pricing ($3/M input, $15/M output): **$0.05-0.10 per skill eval**.
Full 10-skill pilot batch: ~$0.50-1.00 per run.

3.8 KiB Raw Blame History Unescape Escape