firefrost-gaming/claude-skills-reference

Files

Alireza Rezvani 6723bc6977 fix(skill): rewrite senior-prompt-engineer with unique, actionable content (#91 )

Issue #49 feedback implementation:

SKILL.md:
- Added YAML frontmatter with trigger phrases
- Removed marketing language ("world-class", etc.)
- Added Table of Contents
- Converted vague bullets to concrete workflows
- Added input/output examples for all tools

Reference files (all 3 previously 100% identical):
- prompt_engineering_patterns.md: 10 patterns with examples
  (Zero-Shot, Few-Shot, CoT, Role, Structured Output, etc.)
- llm_evaluation_frameworks.md: 7 sections on metrics
  (BLEU, ROUGE, BERTScore, RAG metrics, A/B testing)
- agentic_system_design.md: 6 agent architecture sections
  (ReAct, Plan-Execute, Tool Use, Multi-Agent, Memory)

Python scripts (all 3 previously identical placeholders):
- prompt_optimizer.py: Token counting, clarity analysis,
  few-shot extraction, optimization suggestions
- rag_evaluator.py: Context relevance, faithfulness,
  retrieval metrics (Precision@K, MRR, NDCG)
- agent_orchestrator.py: Config parsing, validation,
  ASCII/Mermaid visualization, cost estimation

Total: 3,571 lines added, 587 deleted
Before: ~785 lines duplicate boilerplate
After: 3,750 lines unique, actionable content

Closes #49

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-26 11:03:37 +01:00

13 KiB

Raw Blame History

LLM Evaluation Frameworks

Concrete metrics, scoring methods, comparison tables, and A/B testing frameworks.

1. Evaluation Metrics Overview

Metric Categories

Category	Metrics	When to Use
Lexical	BLEU, ROUGE, Exact Match	Reference-based comparison
Semantic	BERTScore, Embedding similarity	Meaning preservation
Task-specific	F1, Accuracy, Precision/Recall	Classification, extraction
Quality	Coherence, Fluency, Relevance	Open-ended generation
Safety	Toxicity, Bias scores	Content moderation

Choosing the Right Metric

Is there a single correct answer?
├── Yes → Exact Match or F1
└── No
    └── Is there a reference output?
        ├── Yes → BLEU, ROUGE, or BERTScore
        └── No
            └── Can you define quality criteria?
                ├── Yes → Human evaluation + LLM-as-judge
                └── No → A/B testing with user metrics

2. Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

What it measures: N-gram overlap between generated and reference text.

Score range: 0 to 1 (higher is better)

Calculation:

BLEU = BP × exp(Σ wn × log(pn))

Where:
- BP = brevity penalty (penalizes short outputs)
- pn = precision of n-grams
- wn = weight (typically 0.25 for BLEU-4)

Interpretation:

BLEU Score	Quality
> 0.6	Excellent
0.4 - 0.6	Good
0.2 - 0.4	Acceptable
< 0.2	Poor

Example:

Reference: "The quick brown fox jumps over the lazy dog"
Generated: "A fast brown fox leaps over the lazy dog"

1-gram precision: 7/9 = 0.78 (matched: brown, fox, over, the, lazy, dog)
2-gram precision: 4/8 = 0.50 (matched: brown fox, the lazy, lazy dog)
BLEU-4: ~0.35

Limitations:

Doesn't capture meaning (synonyms penalized)
Position-independent
Requires reference text

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

What it measures: Overlap focused on recall (coverage of reference).

Variants:

Variant	Measures
ROUGE-1	Unigram overlap
ROUGE-2	Bigram overlap
ROUGE-L	Longest common subsequence
ROUGE-Lsum	LCS with sentence-level computation

Calculation:

ROUGE-N Recall = (matching n-grams) / (n-grams in reference)
ROUGE-N Precision = (matching n-grams) / (n-grams in generated)
ROUGE-N F1 = 2 × (Precision × Recall) / (Precision + Recall)

Example:

Reference: "The cat sat on the mat"
Generated: "The cat was sitting on the mat"

ROUGE-1:
  Recall: 5/6 = 0.83 (matched: the, cat, on, the, mat)
  Precision: 5/7 = 0.71
  F1: 0.77

ROUGE-2:
  Recall: 2/5 = 0.40 (matched: "the cat", "the mat")
  Precision: 2/6 = 0.33
  F1: 0.36

Best for: Summarization, text compression

BERTScore

What it measures: Semantic similarity using contextual embeddings.

How it works:

Generate BERT embeddings for each token
Compute cosine similarity between token pairs
Apply greedy matching to find best alignment
Aggregate into Precision, Recall, F1

Advantages over lexical metrics:

Captures synonyms and paraphrases
Context-aware matching
Better correlation with human judgment

Example:

Reference: "The movie was excellent"
Generated: "The film was outstanding"

Lexical (BLEU): Low score (only "The" and "was" match)
BERTScore: High score (semantic meaning preserved)

Interpretation:

BERTScore F1	Quality
> 0.9	Excellent
0.8 - 0.9	Good
0.7 - 0.8	Acceptable
< 0.7	Review needed

3. RAG-Specific Metrics

Context Relevance

What it measures: How relevant retrieved documents are to the query.

Calculation methods:

Method 1: Embedding similarity

relevance = cosine_similarity(
    embed(query),
    embed(context)
)

Method 2: LLM-as-judge

Prompt: "Rate the relevance of this context to the question.
Question: {question}
Context: {context}
Rate from 1-5 where 5 is highly relevant."

Target: > 0.8 for top-k contexts

Answer Faithfulness

What it measures: Whether the answer is supported by the context (no hallucination).

Evaluation prompt:

Given the context and answer, determine if every claim in the
answer is supported by the context.

Context: {context}
Answer: {answer}

For each claim in the answer:
1. Identify the claim
2. Find supporting evidence in context (or mark as unsupported)
3. Rate: Supported / Partially Supported / Not Supported

Overall faithfulness score: [0-1]

Scoring:

Faithfulness = (supported claims) / (total claims)

Target: > 0.95 for production systems

Retrieval Metrics

Metric	Formula	What it measures
Precision@k	(relevant in top-k) / k	Quality of top results
Recall@k	(relevant in top-k) / (total relevant)	Coverage
MRR	1 / (rank of first relevant)	Position of first hit
NDCG@k	DCG@k / IDCG@k	Ranking quality

Example:

Query: "What is photosynthesis?"
Retrieved docs (k=5): [R, N, R, N, R]  (R=relevant, N=not relevant)
Total relevant in corpus: 10

Precision@5 = 3/5 = 0.6
Recall@5 = 3/10 = 0.3
MRR = 1/1 = 1.0 (first doc is relevant)

4. Human Evaluation Frameworks

Likert Scale Evaluation

Setup:

Rate the following response on a scale of 1-5:

Response: {generated_response}

Criteria:
- Relevance (1-5): Does it address the question?
- Accuracy (1-5): Is the information correct?
- Fluency (1-5): Is it well-written?
- Helpfulness (1-5): Would this be useful to the user?

Sample size guidance:

Confidence Level	Margin of Error	Required Samples
95%	±5%	385
95%	±10%	97
90%	±10%	68

Comparative Evaluation (Side-by-Side)

Setup:

Compare these two responses to the question:

Question: {question}

Response A: {response_a}
Response B: {response_b}

Which response is better?
[ ] A is much better
[ ] A is slightly better
[ ] About the same
[ ] B is slightly better
[ ] B is much better

Why? _______________

Advantages:

Easier for humans than absolute scoring
Reduces calibration issues
Clear winner for A/B decisions

Analysis:

Win rate = (A wins + 0.5 × ties) / total
Bradley-Terry model for ranking multiple variants

LLM-as-Judge

Setup:

You are an expert evaluator. Rate the quality of this response.

Question: {question}
Response: {response}
Reference (if available): {reference}

Evaluate on:
1. Correctness (0-10): Is the information accurate?
2. Completeness (0-10): Does it fully address the question?
3. Clarity (0-10): Is it easy to understand?
4. Conciseness (0-10): Is it appropriately brief?

Provide scores and brief justification for each.
Overall score (0-10):

Calibration techniques:

Include reference responses with known scores
Use chain-of-thought for reasoning
Compare against human baseline periodically

Known biases:

Bias	Mitigation
Position bias	Randomize order
Length bias	Normalize or specify length
Self-preference	Use different model as judge
Verbosity preference	Penalize unnecessary length

5. A/B Testing for Prompts

Experiment Design

Hypothesis template:

H0: Prompt A and Prompt B have equal performance on [metric]
H1: Prompt B improves [metric] by at least [minimum detectable effect]

Sample size calculation:

n = 2 × ((z_α + z_β)² × σ²) / δ²

Where:
- z_α = 1.96 for 95% confidence
- z_β = 0.84 for 80% power
- σ = standard deviation of metric
- δ = minimum detectable effect

Quick reference:

MDE	Baseline Rate	Required n/variant
5% relative	50%	3,200
10% relative	50%	800
20% relative	50%	200

Metrics to Track

Primary metrics:

Metric	Measurement
Task success rate	% of queries with correct/helpful response
User satisfaction	Thumbs up/down or 1-5 rating
Engagement	Follow-up questions, session length

Guardrail metrics:

Metric	Threshold
Error rate	< 1%
Latency P95	< 2s
Toxicity rate	< 0.1%
Cost per query	Within budget

Analysis Framework

Statistical test selection:

Is the metric binary (success/failure)?
├── Yes → Chi-squared test or Z-test for proportions
└── No
    └── Is the data normally distributed?
        ├── Yes → Two-sample t-test
        └── No → Mann-Whitney U test

Interpreting results:

p-value < 0.05: Statistically significant
Effect size (Cohen's d):
  - Small: 0.2
  - Medium: 0.5
  - Large: 0.8

Decision: Ship if p < 0.05 AND effect size meets threshold AND guardrails pass

6. Benchmark Datasets

General NLP Benchmarks

Benchmark	Task	Size	Metric
MMLU	Knowledge QA	14K	Accuracy
HellaSwag	Commonsense	10K	Accuracy
TruthfulQA	Factuality	817	% Truthful
HumanEval	Code generation	164	pass@k
GSM8K	Math reasoning	8.5K	Accuracy

RAG Benchmarks

Benchmark	Focus	Metrics
Natural Questions	Wikipedia QA	EM, F1
HotpotQA	Multi-hop reasoning	EM, F1
MS MARCO	Web search	MRR, Recall
BEIR	Zero-shot retrieval	NDCG@10

Creating Custom Benchmarks

Template:

{
  "id": "custom-001",
  "input": "What are the symptoms of diabetes?",
  "expected_output": "Common symptoms include...",
  "metadata": {
    "category": "medical",
    "difficulty": "easy",
    "source": "internal docs"
  },
  "evaluation": {
    "type": "semantic_similarity",
    "threshold": 0.85
  }
}

Best practices:

Minimum 100 examples per category
Include edge cases (10-20%)
Balance difficulty levels
Version control your benchmark
Update quarterly

7. Evaluation Pipeline Design

Automated Evaluation Pipeline

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│   LLM API   │────▶│   Output    │
│   Version   │     │             │     │   Storage   │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                               │
                    ┌──────────────────────────┘
                    ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Metrics   │◀────│  Evaluator  │◀────│  Benchmark  │
│   Dashboard │     │   Service   │     │   Dataset   │
└─────────────┘     └─────────────┘     └─────────────┘

Implementation Checklist

□ Define success metrics
  □ Primary metric (what you're optimizing)
  □ Guardrail metrics (what must not regress)
  □ Monitoring metrics (operational health)

□ Create benchmark dataset
  □ Representative samples from production
  □ Edge cases and failure modes
  □ Golden answers or human labels

□ Set up evaluation infrastructure
  □ Automated scoring pipeline
  □ Version control for prompts
  □ Results tracking and comparison

□ Establish baseline
  □ Run current prompt against benchmark
  □ Document scores for all metrics
  □ Set improvement targets

□ Run experiments
  □ Test one change at a time
  □ Use statistical significance testing
  □ Check all guardrail metrics

□ Deploy and monitor
  □ Gradual rollout (canary)
  □ Real-time metric monitoring
  □ Rollback plan if regression

Quick Reference: Metric Selection

Use Case	Primary Metric	Secondary Metrics
Summarization	ROUGE-L	BERTScore, Compression ratio
Translation	BLEU	chrF, Human pref
QA (extractive)	Exact Match, F1
QA (generative)	BERTScore	Faithfulness, Relevance
Code generation	pass@k	Syntax errors
Classification	Accuracy, F1	Precision, Recall
RAG	Faithfulness	Context relevance, MRR
Open-ended chat	Human eval	Helpfulness, Safety

13 KiB Raw Blame History Unescape Escape

LLM Evaluation Frameworks

Frameworks Index

1. Evaluation Metrics Overview

Metric Categories

Choosing the Right Metric

2. Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

BERTScore

3. RAG-Specific Metrics

Context Relevance

Answer Faithfulness

Retrieval Metrics

4. Human Evaluation Frameworks

Likert Scale Evaluation

Comparative Evaluation (Side-by-Side)

LLM-as-Judge

5. A/B Testing for Prompts

Experiment Design

Metrics to Track

Analysis Framework

6. Benchmark Datasets

General NLP Benchmarks

RAG Benchmarks

Creating Custom Benchmarks

7. Evaluation Pipeline Design

Automated Evaluation Pipeline

Implementation Checklist

Quick Reference: Metric Selection

13 KiB

Raw Blame History