Files
claude-skills-reference/engineering/rag-architect/references/rag_evaluation_framework.md
Leo effb867982 feat: add rag-architect POWERFUL-tier skill
- chunking_optimizer.py: analyzes document corpus, recommends chunking strategies
- retrieval_evaluator.py: evaluates retrieval quality with built-in TF-IDF baseline
- rag_pipeline_designer.py: designs end-to-end RAG pipelines with Mermaid diagrams
- References, sample corpus, expected outputs included
- Zero external dependencies
2026-02-16 16:22:44 +00:00

431 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RAG Evaluation Framework
## Overview
Evaluating Retrieval-Augmented Generation (RAG) systems requires a comprehensive approach that measures both retrieval quality and generation performance. This framework provides methodologies, metrics, and tools for systematic RAG evaluation across different stages of the pipeline.
## Evaluation Dimensions
### 1. Retrieval Quality (Information Retrieval Metrics)
**Precision@K**: Fraction of retrieved documents that are relevant
- Formula: `Precision@K = Relevant Retrieved@K / K`
- Use Case: Measuring result quality at different cutoff points
- Target Values: >0.7 for K=1, >0.5 for K=5, >0.3 for K=10
**Recall@K**: Fraction of relevant documents that are retrieved
- Formula: `Recall@K = Relevant Retrieved@K / Total Relevant`
- Use Case: Measuring coverage of relevant information
- Target Values: >0.8 for K=10, >0.9 for K=20
**Mean Reciprocal Rank (MRR)**: Average reciprocal rank of first relevant result
- Formula: `MRR = (1/Q) × Σ(1/rank_i)` where rank_i is position of first relevant result
- Use Case: Measuring how quickly users find relevant information
- Target Values: >0.6 for good systems, >0.8 for excellent systems
**Normalized Discounted Cumulative Gain (NDCG@K)**: Position-aware relevance metric
- Formula: `NDCG@K = DCG@K / IDCG@K`
- Use Case: Penalizing relevant documents that appear lower in rankings
- Target Values: >0.7 for K=5, >0.6 for K=10
### 2. Generation Quality (RAG-Specific Metrics)
**Faithfulness**: How well the generated answer is grounded in retrieved context
- Measurement: NLI-based entailment scoring, fact verification
- Implementation: Check if each claim in answer is supported by context
- Target Values: >0.95 for factual systems, >0.85 for general applications
**Answer Relevance**: How well the generated answer addresses the original question
- Measurement: Semantic similarity between question and answer
- Implementation: Embedding similarity, keyword overlap, LLM-as-judge
- Target Values: >0.8 for focused answers, >0.7 for comprehensive responses
**Context Relevance**: How relevant the retrieved context is to the question
- Measurement: Relevance scoring of each retrieved chunk
- Implementation: Question-context similarity, manual annotation
- Target Values: >0.7 for average relevance of top-5 chunks
**Context Precision**: Fraction of relevant sentences in retrieved context
- Measurement: Sentence-level relevance annotation
- Implementation: Binary classification of each sentence's relevance
- Target Values: >0.6 for efficient context usage
**Context Recall**: Coverage of necessary information for answering the question
- Measurement: Whether all required facts are present in context
- Implementation: Expert annotation or automated fact extraction
- Target Values: >0.8 for comprehensive coverage
### 3. End-to-End Quality
**Correctness**: Factual accuracy of the generated answer
- Measurement: Expert evaluation, automated fact-checking
- Implementation: Compare against ground truth, verify claims
- Scoring: Binary (correct/incorrect) or scaled (1-5)
**Completeness**: Whether the answer addresses all aspects of the question
- Measurement: Coverage of question components
- Implementation: Aspect-based evaluation, expert annotation
- Scoring: Fraction of question aspects covered
**Helpfulness**: Overall utility of the response to the user
- Measurement: User ratings, task completion rates
- Implementation: Human evaluation, A/B testing
- Scoring: 1-5 Likert scale or thumbs up/down
## Evaluation Methodologies
### 1. Offline Evaluation
**Dataset Requirements**:
- Diverse query set (100+ queries for statistical significance)
- Ground truth relevance judgments
- Reference answers (for generation evaluation)
- Representative document corpus
**Evaluation Pipeline**:
1. Query Processing: Standardize query format and preprocessing
2. Retrieval Execution: Run retrieval with consistent parameters
3. Generation Execution: Generate answers using retrieved context
4. Metric Calculation: Compute all relevant metrics
5. Statistical Analysis: Significance testing, confidence intervals
**Best Practices**:
- Stratify queries by type (factual, analytical, conversational)
- Include edge cases (ambiguous queries, no-answer situations)
- Use multiple annotators with inter-rater agreement analysis
- Regular re-evaluation as system evolves
### 2. Online Evaluation (A/B Testing)
**Metrics to Track**:
- User engagement: Click-through rates, time on page
- User satisfaction: Explicit ratings, implicit feedback
- Task completion: Success rates for specific user goals
- System performance: Latency, error rates
**Experimental Design**:
- Randomized assignment to treatment/control groups
- Sufficient sample size (typically 1000+ users per group)
- Runtime duration (1-4 weeks for stable results)
- Proper randomization and bias mitigation
### 3. Human Evaluation
**Evaluation Aspects**:
- Factual Accuracy: Is the information correct?
- Relevance: Does the answer address the question?
- Completeness: Are all aspects covered?
- Clarity: Is the answer easy to understand?
- Conciseness: Is the answer appropriately brief?
**Annotation Guidelines**:
- Clear scoring rubrics (e.g., 1-5 scales with examples)
- Multiple annotators per sample (typically 3-5)
- Training and calibration sessions
- Regular quality checks and inter-rater agreement
## Implementation Framework
### 1. Automated Evaluation Pipeline
```python
class RAGEvaluator:
def __init__(self, retriever, generator, metrics_config):
self.retriever = retriever
self.generator = generator
self.metrics = self._initialize_metrics(metrics_config)
def evaluate_query(self, query, ground_truth):
# Retrieval evaluation
retrieved_docs = self.retriever.search(query)
retrieval_metrics = self.evaluate_retrieval(
retrieved_docs, ground_truth['relevant_docs']
)
# Generation evaluation
generated_answer = self.generator.generate(query, retrieved_docs)
generation_metrics = self.evaluate_generation(
query, generated_answer, retrieved_docs, ground_truth['answer']
)
return {**retrieval_metrics, **generation_metrics}
```
### 2. Metric Implementations
**Faithfulness Score**:
```python
def calculate_faithfulness(answer, context):
# Split answer into claims
claims = extract_claims(answer)
# Check each claim against context
faithful_claims = 0
for claim in claims:
if is_supported_by_context(claim, context):
faithful_claims += 1
return faithful_claims / len(claims) if claims else 0
```
**Context Relevance Score**:
```python
def calculate_context_relevance(query, contexts):
relevance_scores = []
for context in contexts:
similarity = embedding_similarity(query, context)
relevance_scores.append(similarity)
return {
'average_relevance': mean(relevance_scores),
'top_k_relevance': mean(relevance_scores[:k]),
'relevance_distribution': relevance_scores
}
```
### 3. Evaluation Dataset Creation
**Query Collection Strategies**:
1. **User Log Analysis**: Extract real user queries from production systems
2. **Expert Generation**: Domain experts create representative queries
3. **Synthetic Generation**: LLM-generated queries based on document content
4. **Community Sourcing**: Crowdsourced query collection
**Ground Truth Creation**:
1. **Document Relevance**: Expert annotation of relevant documents per query
2. **Answer Creation**: Expert-written reference answers
3. **Aspect Annotation**: Mark which aspects of complex questions are addressed
4. **Quality Control**: Multiple annotators with disagreement resolution
## Evaluation Datasets and Benchmarks
### 1. General Domain Benchmarks
**MS MARCO**: Large-scale reading comprehension dataset
- 100K real user queries from Bing search
- Passage-level and document-level evaluation
- Both retrieval and generation evaluation supported
**Natural Questions**: Google search queries with Wikipedia answers
- 307K training examples, 8K development examples
- Natural language questions from real users
- Both short and long answer evaluation
**SQUAD 2.0**: Reading comprehension with unanswerable questions
- 150K question-answer pairs
- Includes questions that cannot be answered from context
- Tests system's ability to recognize unanswerable queries
### 2. Domain-Specific Benchmarks
**TREC-COVID**: Scientific literature search
- 50 queries on COVID-19 research topics
- 171K scientific papers as corpus
- Expert relevance judgments
**FiQA**: Financial question answering
- 648 questions from financial forums
- 57K financial forum posts as corpus
- Domain-specific terminology and concepts
**BioASQ**: Biomedical semantic indexing and question answering
- 3K biomedical questions
- PubMed abstracts as corpus
- Expert physician annotations
### 3. Multilingual Benchmarks
**Mr. TyDi**: Multilingual question answering
- 11 languages including Arabic, Bengali, Korean
- Wikipedia passages in each language
- Cultural and linguistic diversity testing
**MLQA**: Cross-lingual question answering
- Questions in one language, answers in another
- 7 languages with all pair combinations
- Tests multilingual retrieval capabilities
## Continuous Evaluation Framework
### 1. Monitoring Pipeline
**Real-time Metrics**:
- System latency (p50, p95, p99)
- Error rates and failure modes
- User satisfaction scores
- Query volume and patterns
**Batch Evaluation**:
- Weekly/monthly evaluation on test sets
- Performance trend analysis
- Regression detection
- Model drift monitoring
### 2. Quality Assurance
**Automated Quality Checks**:
- Hallucination detection
- Toxicity and bias screening
- Factual consistency verification
- Output format validation
**Human Review Process**:
- Random sampling of responses (1-5% of production queries)
- Expert review of edge cases and failures
- User feedback integration
- Regular calibration of automated metrics
### 3. Performance Optimization
**A/B Testing Framework**:
- Infrastructure for controlled experiments
- Statistical significance testing
- Multi-armed bandit optimization
- Gradual rollout procedures
**Feedback Loop Integration**:
- User feedback incorporation into training data
- Error analysis and root cause identification
- Iterative improvement processes
- Model fine-tuning based on evaluation results
## Tools and Libraries
### 1. Open Source Tools
**RAGAS**: RAG Assessment framework
- Comprehensive metric implementations
- Easy integration with popular RAG frameworks
- Support for both synthetic and human evaluation
**TruEra TruLens**: ML observability for RAG
- Real-time monitoring and evaluation
- Comprehensive metric tracking
- Integration with popular vector databases
**LangSmith**: LangChain evaluation and monitoring
- End-to-end RAG pipeline evaluation
- Human feedback integration
- Performance analytics and debugging
### 2. Commercial Solutions
**Weights & Biases**: ML experiment tracking
- A/B testing infrastructure
- Comprehensive metrics dashboard
- Team collaboration features
**Neptune**: ML metadata store
- Experiment comparison and analysis
- Model performance monitoring
- Integration with popular ML frameworks
**Comet**: ML platform for tracking experiments
- Real-time monitoring
- Model comparison and selection
- Automated report generation
## Best Practices
### 1. Evaluation Design
**Metric Selection**:
- Choose metrics aligned with business objectives
- Use multiple complementary metrics
- Include both automated and human evaluation
- Consider computational cost vs. insight value
**Dataset Preparation**:
- Ensure representative query distribution
- Include edge cases and failure modes
- Maintain high annotation quality
- Regular dataset updates and validation
### 2. Statistical Rigor
**Sample Sizes**:
- Minimum 100 queries for basic evaluation
- 1000+ queries for robust statistical analysis
- Power analysis for A/B testing
- Confidence interval reporting
**Significance Testing**:
- Use appropriate statistical tests (t-tests, Mann-Whitney U)
- Multiple comparison corrections (Bonferroni, FDR)
- Effect size reporting alongside p-values
- Bootstrap confidence intervals for stability
### 3. Operational Integration
**Automated Pipelines**:
- Continuous integration/deployment integration
- Automated regression testing
- Performance threshold enforcement
- Alert systems for quality degradation
**Human-in-the-Loop**:
- Regular expert review processes
- User feedback collection and analysis
- Annotation quality control
- Bias detection and mitigation
## Common Pitfalls and Solutions
### 1. Evaluation Bias
**Problem**: Test set not representative of production queries
**Solution**: Continuous test set updates from production data
**Problem**: Annotator bias in relevance judgments
**Solution**: Multiple annotators, clear guidelines, bias training
### 2. Metric Gaming
**Problem**: Optimizing for metrics rather than user satisfaction
**Solution**: Multiple complementary metrics, regular metric validation
**Problem**: Overfitting to evaluation set
**Solution**: Hold-out validation sets, temporal splits
### 3. Scale Challenges
**Problem**: Evaluation becomes too expensive at scale
**Solution**: Sampling strategies, automated metrics, efficient tooling
**Problem**: Human evaluation bottlenecks
**Solution**: Active learning for annotation, LLM-as-judge validation
## Future Directions
### 1. Advanced Metrics
- **Semantic Coherence**: Measuring logical flow in generated answers
- **Factual Consistency**: Cross-document fact verification
- **Personalization Quality**: User-specific relevance assessment
- **Multimodal Evaluation**: Text, image, audio integration metrics
### 2. Automated Evaluation
- **LLM-as-Judge**: Using large language models for quality assessment
- **Adversarial Testing**: Systematic stress testing of RAG systems
- **Causal Evaluation**: Understanding why systems fail
- **Real-time Adaptation**: Dynamic metric adjustment based on context
### 3. Holistic Assessment
- **User Journey Evaluation**: Multi-turn conversation quality
- **Task Success Measurement**: Goal completion rather than single query
- **Temporal Consistency**: Performance stability over time
- **Fairness and Bias**: Systematic bias detection and measurement
## Conclusion
Effective RAG evaluation requires a multi-faceted approach combining automated metrics, human judgment, and continuous monitoring. The key principles are:
1. **Comprehensive Coverage**: Evaluate all pipeline components
2. **Multiple Perspectives**: Combine different evaluation methodologies
3. **Continuous Improvement**: Regular evaluation and iteration
4. **Business Alignment**: Metrics should reflect actual user value
5. **Statistical Rigor**: Proper experimental design and analysis
This framework provides the foundation for building robust, high-quality RAG systems that deliver real value to users while maintaining reliability and trustworthiness.