# LLM Evaluation Frameworks Concrete metrics, scoring methods, comparison tables, and A/B testing frameworks. ## Frameworks Index 1. [Evaluation Metrics Overview](#1-evaluation-metrics-overview) 2. [Text Generation Metrics](#2-text-generation-metrics) 3. [RAG-Specific Metrics](#3-rag-specific-metrics) 4. [Human Evaluation Frameworks](#4-human-evaluation-frameworks) 5. [A/B Testing for Prompts](#5-ab-testing-for-prompts) 6. [Benchmark Datasets](#6-benchmark-datasets) 7. [Evaluation Pipeline Design](#7-evaluation-pipeline-design) --- ## 1. Evaluation Metrics Overview ### Metric Categories | Category | Metrics | When to Use | |----------|---------|-------------| | **Lexical** | BLEU, ROUGE, Exact Match | Reference-based comparison | | **Semantic** | BERTScore, Embedding similarity | Meaning preservation | | **Task-specific** | F1, Accuracy, Precision/Recall | Classification, extraction | | **Quality** | Coherence, Fluency, Relevance | Open-ended generation | | **Safety** | Toxicity, Bias scores | Content moderation | ### Choosing the Right Metric ``` Is there a single correct answer? ├── Yes → Exact Match or F1 └── No └── Is there a reference output? ├── Yes → BLEU, ROUGE, or BERTScore └── No └── Can you define quality criteria? ├── Yes → Human evaluation + LLM-as-judge └── No → A/B testing with user metrics ``` --- ## 2. Text Generation Metrics ### BLEU (Bilingual Evaluation Understudy) **What it measures:** N-gram overlap between generated and reference text. **Score range:** 0 to 1 (higher is better) **Calculation:** ``` BLEU = BP × exp(Σ wn × log(pn)) Where: - BP = brevity penalty (penalizes short outputs) - pn = precision of n-grams - wn = weight (typically 0.25 for BLEU-4) ``` **Interpretation:** | BLEU Score | Quality | |------------|---------| | > 0.6 | Excellent | | 0.4 - 0.6 | Good | | 0.2 - 0.4 | Acceptable | | < 0.2 | Poor | **Example:** ``` Reference: "The quick brown fox jumps over the lazy dog" Generated: "A fast brown fox leaps over the lazy dog" 1-gram precision: 7/9 = 0.78 (matched: brown, fox, over, the, lazy, dog) 2-gram precision: 4/8 = 0.50 (matched: brown fox, the lazy, lazy dog) BLEU-4: ~0.35 ``` **Limitations:** - Doesn't capture meaning (synonyms penalized) - Position-independent - Requires reference text --- ### ROUGE (Recall-Oriented Understudy for Gisting Evaluation) **What it measures:** Overlap focused on recall (coverage of reference). **Variants:** | Variant | Measures | |---------|----------| | ROUGE-1 | Unigram overlap | | ROUGE-2 | Bigram overlap | | ROUGE-L | Longest common subsequence | | ROUGE-Lsum | LCS with sentence-level computation | **Calculation:** ``` ROUGE-N Recall = (matching n-grams) / (n-grams in reference) ROUGE-N Precision = (matching n-grams) / (n-grams in generated) ROUGE-N F1 = 2 × (Precision × Recall) / (Precision + Recall) ``` **Example:** ``` Reference: "The cat sat on the mat" Generated: "The cat was sitting on the mat" ROUGE-1: Recall: 5/6 = 0.83 (matched: the, cat, on, the, mat) Precision: 5/7 = 0.71 F1: 0.77 ROUGE-2: Recall: 2/5 = 0.40 (matched: "the cat", "the mat") Precision: 2/6 = 0.33 F1: 0.36 ``` **Best for:** Summarization, text compression --- ### BERTScore **What it measures:** Semantic similarity using contextual embeddings. **How it works:** 1. Generate BERT embeddings for each token 2. Compute cosine similarity between token pairs 3. Apply greedy matching to find best alignment 4. Aggregate into Precision, Recall, F1 **Advantages over lexical metrics:** - Captures synonyms and paraphrases - Context-aware matching - Better correlation with human judgment **Example:** ``` Reference: "The movie was excellent" Generated: "The film was outstanding" Lexical (BLEU): Low score (only "The" and "was" match) BERTScore: High score (semantic meaning preserved) ``` **Interpretation:** | BERTScore F1 | Quality | |--------------|---------| | > 0.9 | Excellent | | 0.8 - 0.9 | Good | | 0.7 - 0.8 | Acceptable | | < 0.7 | Review needed | --- ## 3. RAG-Specific Metrics ### Context Relevance **What it measures:** How relevant retrieved documents are to the query. **Calculation methods:** **Method 1: Embedding similarity** ```python relevance = cosine_similarity( embed(query), embed(context) ) ``` **Method 2: LLM-as-judge** ``` Prompt: "Rate the relevance of this context to the question. Question: {question} Context: {context} Rate from 1-5 where 5 is highly relevant." ``` **Target:** > 0.8 for top-k contexts --- ### Answer Faithfulness **What it measures:** Whether the answer is supported by the context (no hallucination). **Evaluation prompt:** ``` Given the context and answer, determine if every claim in the answer is supported by the context. Context: {context} Answer: {answer} For each claim in the answer: 1. Identify the claim 2. Find supporting evidence in context (or mark as unsupported) 3. Rate: Supported / Partially Supported / Not Supported Overall faithfulness score: [0-1] ``` **Scoring:** ``` Faithfulness = (supported claims) / (total claims) ``` **Target:** > 0.95 for production systems --- ### Retrieval Metrics | Metric | Formula | What it measures | |--------|---------|------------------| | **Precision@k** | (relevant in top-k) / k | Quality of top results | | **Recall@k** | (relevant in top-k) / (total relevant) | Coverage | | **MRR** | 1 / (rank of first relevant) | Position of first hit | | **NDCG@k** | DCG@k / IDCG@k | Ranking quality | **Example:** ``` Query: "What is photosynthesis?" Retrieved docs (k=5): [R, N, R, N, R] (R=relevant, N=not relevant) Total relevant in corpus: 10 Precision@5 = 3/5 = 0.6 Recall@5 = 3/10 = 0.3 MRR = 1/1 = 1.0 (first doc is relevant) ``` --- ## 4. Human Evaluation Frameworks ### Likert Scale Evaluation **Setup:** ``` Rate the following response on a scale of 1-5: Response: {generated_response} Criteria: - Relevance (1-5): Does it address the question? - Accuracy (1-5): Is the information correct? - Fluency (1-5): Is it well-written? - Helpfulness (1-5): Would this be useful to the user? ``` **Sample size guidance:** | Confidence Level | Margin of Error | Required Samples | |-----------------|-----------------|------------------| | 95% | ±5% | 385 | | 95% | ±10% | 97 | | 90% | ±10% | 68 | --- ### Comparative Evaluation (Side-by-Side) **Setup:** ``` Compare these two responses to the question: Question: {question} Response A: {response_a} Response B: {response_b} Which response is better? [ ] A is much better [ ] A is slightly better [ ] About the same [ ] B is slightly better [ ] B is much better Why? _______________ ``` **Advantages:** - Easier for humans than absolute scoring - Reduces calibration issues - Clear winner for A/B decisions **Analysis:** ``` Win rate = (A wins + 0.5 × ties) / total Bradley-Terry model for ranking multiple variants ``` --- ### LLM-as-Judge **Setup:** ``` You are an expert evaluator. Rate the quality of this response. Question: {question} Response: {response} Reference (if available): {reference} Evaluate on: 1. Correctness (0-10): Is the information accurate? 2. Completeness (0-10): Does it fully address the question? 3. Clarity (0-10): Is it easy to understand? 4. Conciseness (0-10): Is it appropriately brief? Provide scores and brief justification for each. Overall score (0-10): ``` **Calibration techniques:** - Include reference responses with known scores - Use chain-of-thought for reasoning - Compare against human baseline periodically **Known biases:** | Bias | Mitigation | |------|------------| | Position bias | Randomize order | | Length bias | Normalize or specify length | | Self-preference | Use different model as judge | | Verbosity preference | Penalize unnecessary length | --- ## 5. A/B Testing for Prompts ### Experiment Design **Hypothesis template:** ``` H0: Prompt A and Prompt B have equal performance on [metric] H1: Prompt B improves [metric] by at least [minimum detectable effect] ``` **Sample size calculation:** ``` n = 2 × ((z_α + z_β)² × σ²) / δ² Where: - z_α = 1.96 for 95% confidence - z_β = 0.84 for 80% power - σ = standard deviation of metric - δ = minimum detectable effect ``` **Quick reference:** | MDE | Baseline Rate | Required n/variant | |-----|---------------|-------------------| | 5% relative | 50% | 3,200 | | 10% relative | 50% | 800 | | 20% relative | 50% | 200 | --- ### Metrics to Track **Primary metrics:** | Metric | Measurement | |--------|-------------| | Task success rate | % of queries with correct/helpful response | | User satisfaction | Thumbs up/down or 1-5 rating | | Engagement | Follow-up questions, session length | **Guardrail metrics:** | Metric | Threshold | |--------|-----------| | Error rate | < 1% | | Latency P95 | < 2s | | Toxicity rate | < 0.1% | | Cost per query | Within budget | --- ### Analysis Framework **Statistical test selection:** ``` Is the metric binary (success/failure)? ├── Yes → Chi-squared test or Z-test for proportions └── No └── Is the data normally distributed? ├── Yes → Two-sample t-test └── No → Mann-Whitney U test ``` **Interpreting results:** ``` p-value < 0.05: Statistically significant Effect size (Cohen's d): - Small: 0.2 - Medium: 0.5 - Large: 0.8 Decision: Ship if p < 0.05 AND effect size meets threshold AND guardrails pass ``` --- ## 6. Benchmark Datasets ### General NLP Benchmarks | Benchmark | Task | Size | Metric | |-----------|------|------|--------| | **MMLU** | Knowledge QA | 14K | Accuracy | | **HellaSwag** | Commonsense | 10K | Accuracy | | **TruthfulQA** | Factuality | 817 | % Truthful | | **HumanEval** | Code generation | 164 | pass@k | | **GSM8K** | Math reasoning | 8.5K | Accuracy | ### RAG Benchmarks | Benchmark | Focus | Metrics | |-----------|-------|---------| | **Natural Questions** | Wikipedia QA | EM, F1 | | **HotpotQA** | Multi-hop reasoning | EM, F1 | | **MS MARCO** | Web search | MRR, Recall | | **BEIR** | Zero-shot retrieval | NDCG@10 | ### Creating Custom Benchmarks **Template:** ```json { "id": "custom-001", "input": "What are the symptoms of diabetes?", "expected_output": "Common symptoms include...", "metadata": { "category": "medical", "difficulty": "easy", "source": "internal docs" }, "evaluation": { "type": "semantic_similarity", "threshold": 0.85 } } ``` **Best practices:** - Minimum 100 examples per category - Include edge cases (10-20%) - Balance difficulty levels - Version control your benchmark - Update quarterly --- ## 7. Evaluation Pipeline Design ### Automated Evaluation Pipeline ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Prompt │────▶│ LLM API │────▶│ Output │ │ Version │ │ │ │ Storage │ └─────────────┘ └─────────────┘ └──────┬──────┘ │ ┌──────────────────────────┘ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Metrics │◀────│ Evaluator │◀────│ Benchmark │ │ Dashboard │ │ Service │ │ Dataset │ └─────────────┘ └─────────────┘ └─────────────┘ ``` ### Implementation Checklist ``` □ Define success metrics □ Primary metric (what you're optimizing) □ Guardrail metrics (what must not regress) □ Monitoring metrics (operational health) □ Create benchmark dataset □ Representative samples from production □ Edge cases and failure modes □ Golden answers or human labels □ Set up evaluation infrastructure □ Automated scoring pipeline □ Version control for prompts □ Results tracking and comparison □ Establish baseline □ Run current prompt against benchmark □ Document scores for all metrics □ Set improvement targets □ Run experiments □ Test one change at a time □ Use statistical significance testing □ Check all guardrail metrics □ Deploy and monitor □ Gradual rollout (canary) □ Real-time metric monitoring □ Rollback plan if regression ``` --- ## Quick Reference: Metric Selection | Use Case | Primary Metric | Secondary Metrics | |----------|---------------|-------------------| | Summarization | ROUGE-L | BERTScore, Compression ratio | | Translation | BLEU | chrF, Human pref | | QA (extractive) | Exact Match, F1 | | | QA (generative) | BERTScore | Faithfulness, Relevance | | Code generation | pass@k | Syntax errors | | Classification | Accuracy, F1 | Precision, Recall | | RAG | Faithfulness | Context relevance, MRR | | Open-ended chat | Human eval | Helpfulness, Safety |