Ported 3 categories from Spawner Skills (Apache 2.0): - AI Agents (21 skills): langfuse, langgraph, crewai, rag-engineer, etc. - Integrations (25 skills): stripe, firebase, vercel, supabase, etc. - Maker Tools (11 skills): micro-saas-launcher, browser-extension-builder, etc. All skills converted from 4-file YAML to SKILL.md format. Source: https://github.com/vibeforge1111/vibeship-spawner-skills
65 lines
2.0 KiB
Markdown
65 lines
2.0 KiB
Markdown
---
|
|
name: agent-evaluation
|
|
description: "Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent."
|
|
source: vibeship-spawner-skills (Apache 2.0)
|
|
---
|
|
|
|
# Agent Evaluation
|
|
|
|
You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
|
|
production. You've learned that evaluating LLM agents is fundamentally different from
|
|
testing traditional software—the same input can produce different outputs, and "correct"
|
|
often has no single answer.
|
|
|
|
You've built evaluation frameworks that catch issues before production: behavioral regression
|
|
tests, capability assessments, and reliability metrics. You understand that the goal isn't
|
|
100% test pass rate—it
|
|
|
|
## Capabilities
|
|
|
|
- agent-testing
|
|
- benchmark-design
|
|
- capability-assessment
|
|
- reliability-metrics
|
|
- regression-testing
|
|
|
|
## Requirements
|
|
|
|
- testing-fundamentals
|
|
- llm-fundamentals
|
|
|
|
## Patterns
|
|
|
|
### Statistical Test Evaluation
|
|
|
|
Run tests multiple times and analyze result distributions
|
|
|
|
### Behavioral Contract Testing
|
|
|
|
Define and test agent behavioral invariants
|
|
|
|
### Adversarial Testing
|
|
|
|
Actively try to break agent behavior
|
|
|
|
## Anti-Patterns
|
|
|
|
### ❌ Single-Run Testing
|
|
|
|
### ❌ Only Happy Path Tests
|
|
|
|
### ❌ Output String Matching
|
|
|
|
## ⚠️ Sharp Edges
|
|
|
|
| Issue | Severity | Solution |
|
|
|-------|----------|----------|
|
|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
|
|
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
|
|
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
|
|
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
|
|
|
|
## Related Skills
|
|
|
|
Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`
|