Issue #49 feedback implementation: SKILL.md: - Added YAML frontmatter with trigger phrases - Removed marketing language ("world-class", etc.) - Added Table of Contents - Converted vague bullets to concrete workflows - Added input/output examples for all tools Reference files (all 3 previously 100% identical): - prompt_engineering_patterns.md: 10 patterns with examples (Zero-Shot, Few-Shot, CoT, Role, Structured Output, etc.) - llm_evaluation_frameworks.md: 7 sections on metrics (BLEU, ROUGE, BERTScore, RAG metrics, A/B testing) - agentic_system_design.md: 6 agent architecture sections (ReAct, Plan-Execute, Tool Use, Multi-Agent, Memory) Python scripts (all 3 previously identical placeholders): - prompt_optimizer.py: Token counting, clarity analysis, few-shot extraction, optimization suggestions - rag_evaluator.py: Context relevance, faithfulness, retrieval metrics (Precision@K, MRR, NDCG) - agent_orchestrator.py: Config parsing, validation, ASCII/Mermaid visualization, cost estimation Total: 3,571 lines added, 587 deleted Before: ~785 lines duplicate boilerplate After: 3,750 lines unique, actionable content Closes #49 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
13 KiB
LLM Evaluation Frameworks
Concrete metrics, scoring methods, comparison tables, and A/B testing frameworks.
Frameworks Index
- Evaluation Metrics Overview
- Text Generation Metrics
- RAG-Specific Metrics
- Human Evaluation Frameworks
- A/B Testing for Prompts
- Benchmark Datasets
- Evaluation Pipeline Design
1. Evaluation Metrics Overview
Metric Categories
| Category | Metrics | When to Use |
|---|---|---|
| Lexical | BLEU, ROUGE, Exact Match | Reference-based comparison |
| Semantic | BERTScore, Embedding similarity | Meaning preservation |
| Task-specific | F1, Accuracy, Precision/Recall | Classification, extraction |
| Quality | Coherence, Fluency, Relevance | Open-ended generation |
| Safety | Toxicity, Bias scores | Content moderation |
Choosing the Right Metric
Is there a single correct answer?
├── Yes → Exact Match or F1
└── No
└── Is there a reference output?
├── Yes → BLEU, ROUGE, or BERTScore
└── No
└── Can you define quality criteria?
├── Yes → Human evaluation + LLM-as-judge
└── No → A/B testing with user metrics
2. Text Generation Metrics
BLEU (Bilingual Evaluation Understudy)
What it measures: N-gram overlap between generated and reference text.
Score range: 0 to 1 (higher is better)
Calculation:
BLEU = BP × exp(Σ wn × log(pn))
Where:
- BP = brevity penalty (penalizes short outputs)
- pn = precision of n-grams
- wn = weight (typically 0.25 for BLEU-4)
Interpretation:
| BLEU Score | Quality |
|---|---|
| > 0.6 | Excellent |
| 0.4 - 0.6 | Good |
| 0.2 - 0.4 | Acceptable |
| < 0.2 | Poor |
Example:
Reference: "The quick brown fox jumps over the lazy dog"
Generated: "A fast brown fox leaps over the lazy dog"
1-gram precision: 7/9 = 0.78 (matched: brown, fox, over, the, lazy, dog)
2-gram precision: 4/8 = 0.50 (matched: brown fox, the lazy, lazy dog)
BLEU-4: ~0.35
Limitations:
- Doesn't capture meaning (synonyms penalized)
- Position-independent
- Requires reference text
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
What it measures: Overlap focused on recall (coverage of reference).
Variants:
| Variant | Measures |
|---|---|
| ROUGE-1 | Unigram overlap |
| ROUGE-2 | Bigram overlap |
| ROUGE-L | Longest common subsequence |
| ROUGE-Lsum | LCS with sentence-level computation |
Calculation:
ROUGE-N Recall = (matching n-grams) / (n-grams in reference)
ROUGE-N Precision = (matching n-grams) / (n-grams in generated)
ROUGE-N F1 = 2 × (Precision × Recall) / (Precision + Recall)
Example:
Reference: "The cat sat on the mat"
Generated: "The cat was sitting on the mat"
ROUGE-1:
Recall: 5/6 = 0.83 (matched: the, cat, on, the, mat)
Precision: 5/7 = 0.71
F1: 0.77
ROUGE-2:
Recall: 2/5 = 0.40 (matched: "the cat", "the mat")
Precision: 2/6 = 0.33
F1: 0.36
Best for: Summarization, text compression
BERTScore
What it measures: Semantic similarity using contextual embeddings.
How it works:
- Generate BERT embeddings for each token
- Compute cosine similarity between token pairs
- Apply greedy matching to find best alignment
- Aggregate into Precision, Recall, F1
Advantages over lexical metrics:
- Captures synonyms and paraphrases
- Context-aware matching
- Better correlation with human judgment
Example:
Reference: "The movie was excellent"
Generated: "The film was outstanding"
Lexical (BLEU): Low score (only "The" and "was" match)
BERTScore: High score (semantic meaning preserved)
Interpretation:
| BERTScore F1 | Quality |
|---|---|
| > 0.9 | Excellent |
| 0.8 - 0.9 | Good |
| 0.7 - 0.8 | Acceptable |
| < 0.7 | Review needed |
3. RAG-Specific Metrics
Context Relevance
What it measures: How relevant retrieved documents are to the query.
Calculation methods:
Method 1: Embedding similarity
relevance = cosine_similarity(
embed(query),
embed(context)
)
Method 2: LLM-as-judge
Prompt: "Rate the relevance of this context to the question.
Question: {question}
Context: {context}
Rate from 1-5 where 5 is highly relevant."
Target: > 0.8 for top-k contexts
Answer Faithfulness
What it measures: Whether the answer is supported by the context (no hallucination).
Evaluation prompt:
Given the context and answer, determine if every claim in the
answer is supported by the context.
Context: {context}
Answer: {answer}
For each claim in the answer:
1. Identify the claim
2. Find supporting evidence in context (or mark as unsupported)
3. Rate: Supported / Partially Supported / Not Supported
Overall faithfulness score: [0-1]
Scoring:
Faithfulness = (supported claims) / (total claims)
Target: > 0.95 for production systems
Retrieval Metrics
| Metric | Formula | What it measures |
|---|---|---|
| Precision@k | (relevant in top-k) / k | Quality of top results |
| Recall@k | (relevant in top-k) / (total relevant) | Coverage |
| MRR | 1 / (rank of first relevant) | Position of first hit |
| NDCG@k | DCG@k / IDCG@k | Ranking quality |
Example:
Query: "What is photosynthesis?"
Retrieved docs (k=5): [R, N, R, N, R] (R=relevant, N=not relevant)
Total relevant in corpus: 10
Precision@5 = 3/5 = 0.6
Recall@5 = 3/10 = 0.3
MRR = 1/1 = 1.0 (first doc is relevant)
4. Human Evaluation Frameworks
Likert Scale Evaluation
Setup:
Rate the following response on a scale of 1-5:
Response: {generated_response}
Criteria:
- Relevance (1-5): Does it address the question?
- Accuracy (1-5): Is the information correct?
- Fluency (1-5): Is it well-written?
- Helpfulness (1-5): Would this be useful to the user?
Sample size guidance:
| Confidence Level | Margin of Error | Required Samples |
|---|---|---|
| 95% | ±5% | 385 |
| 95% | ±10% | 97 |
| 90% | ±10% | 68 |
Comparative Evaluation (Side-by-Side)
Setup:
Compare these two responses to the question:
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better?
[ ] A is much better
[ ] A is slightly better
[ ] About the same
[ ] B is slightly better
[ ] B is much better
Why? _______________
Advantages:
- Easier for humans than absolute scoring
- Reduces calibration issues
- Clear winner for A/B decisions
Analysis:
Win rate = (A wins + 0.5 × ties) / total
Bradley-Terry model for ranking multiple variants
LLM-as-Judge
Setup:
You are an expert evaluator. Rate the quality of this response.
Question: {question}
Response: {response}
Reference (if available): {reference}
Evaluate on:
1. Correctness (0-10): Is the information accurate?
2. Completeness (0-10): Does it fully address the question?
3. Clarity (0-10): Is it easy to understand?
4. Conciseness (0-10): Is it appropriately brief?
Provide scores and brief justification for each.
Overall score (0-10):
Calibration techniques:
- Include reference responses with known scores
- Use chain-of-thought for reasoning
- Compare against human baseline periodically
Known biases:
| Bias | Mitigation |
|---|---|
| Position bias | Randomize order |
| Length bias | Normalize or specify length |
| Self-preference | Use different model as judge |
| Verbosity preference | Penalize unnecessary length |
5. A/B Testing for Prompts
Experiment Design
Hypothesis template:
H0: Prompt A and Prompt B have equal performance on [metric]
H1: Prompt B improves [metric] by at least [minimum detectable effect]
Sample size calculation:
n = 2 × ((z_α + z_β)² × σ²) / δ²
Where:
- z_α = 1.96 for 95% confidence
- z_β = 0.84 for 80% power
- σ = standard deviation of metric
- δ = minimum detectable effect
Quick reference:
| MDE | Baseline Rate | Required n/variant |
|---|---|---|
| 5% relative | 50% | 3,200 |
| 10% relative | 50% | 800 |
| 20% relative | 50% | 200 |
Metrics to Track
Primary metrics:
| Metric | Measurement |
|---|---|
| Task success rate | % of queries with correct/helpful response |
| User satisfaction | Thumbs up/down or 1-5 rating |
| Engagement | Follow-up questions, session length |
Guardrail metrics:
| Metric | Threshold |
|---|---|
| Error rate | < 1% |
| Latency P95 | < 2s |
| Toxicity rate | < 0.1% |
| Cost per query | Within budget |
Analysis Framework
Statistical test selection:
Is the metric binary (success/failure)?
├── Yes → Chi-squared test or Z-test for proportions
└── No
└── Is the data normally distributed?
├── Yes → Two-sample t-test
└── No → Mann-Whitney U test
Interpreting results:
p-value < 0.05: Statistically significant
Effect size (Cohen's d):
- Small: 0.2
- Medium: 0.5
- Large: 0.8
Decision: Ship if p < 0.05 AND effect size meets threshold AND guardrails pass
6. Benchmark Datasets
General NLP Benchmarks
| Benchmark | Task | Size | Metric |
|---|---|---|---|
| MMLU | Knowledge QA | 14K | Accuracy |
| HellaSwag | Commonsense | 10K | Accuracy |
| TruthfulQA | Factuality | 817 | % Truthful |
| HumanEval | Code generation | 164 | pass@k |
| GSM8K | Math reasoning | 8.5K | Accuracy |
RAG Benchmarks
| Benchmark | Focus | Metrics |
|---|---|---|
| Natural Questions | Wikipedia QA | EM, F1 |
| HotpotQA | Multi-hop reasoning | EM, F1 |
| MS MARCO | Web search | MRR, Recall |
| BEIR | Zero-shot retrieval | NDCG@10 |
Creating Custom Benchmarks
Template:
{
"id": "custom-001",
"input": "What are the symptoms of diabetes?",
"expected_output": "Common symptoms include...",
"metadata": {
"category": "medical",
"difficulty": "easy",
"source": "internal docs"
},
"evaluation": {
"type": "semantic_similarity",
"threshold": 0.85
}
}
Best practices:
- Minimum 100 examples per category
- Include edge cases (10-20%)
- Balance difficulty levels
- Version control your benchmark
- Update quarterly
7. Evaluation Pipeline Design
Automated Evaluation Pipeline
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prompt │────▶│ LLM API │────▶│ Output │
│ Version │ │ │ │ Storage │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
┌──────────────────────────┘
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Metrics │◀────│ Evaluator │◀────│ Benchmark │
│ Dashboard │ │ Service │ │ Dataset │
└─────────────┘ └─────────────┘ └─────────────┘
Implementation Checklist
□ Define success metrics
□ Primary metric (what you're optimizing)
□ Guardrail metrics (what must not regress)
□ Monitoring metrics (operational health)
□ Create benchmark dataset
□ Representative samples from production
□ Edge cases and failure modes
□ Golden answers or human labels
□ Set up evaluation infrastructure
□ Automated scoring pipeline
□ Version control for prompts
□ Results tracking and comparison
□ Establish baseline
□ Run current prompt against benchmark
□ Document scores for all metrics
□ Set improvement targets
□ Run experiments
□ Test one change at a time
□ Use statistical significance testing
□ Check all guardrail metrics
□ Deploy and monitor
□ Gradual rollout (canary)
□ Real-time metric monitoring
□ Rollback plan if regression
Quick Reference: Metric Selection
| Use Case | Primary Metric | Secondary Metrics |
|---|---|---|
| Summarization | ROUGE-L | BERTScore, Compression ratio |
| Translation | BLEU | chrF, Human pref |
| QA (extractive) | Exact Match, F1 | |
| QA (generative) | BERTScore | Faithfulness, Relevance |
| Code generation | pass@k | Syntax errors |
| Classification | Accuracy, F1 | Precision, Recall |
| RAG | Faithfulness | Context relevance, MRR |
| Open-ended chat | Human eval | Helpfulness, Safety |