- chunking_optimizer.py: analyzes document corpus, recommends chunking strategies - retrieval_evaluator.py: evaluates retrieval quality with built-in TF-IDF baseline - rag_pipeline_designer.py: designs end-to-end RAG pipelines with Mermaid diagrams - References, sample corpus, expected outputs included - Zero external dependencies
431 lines
15 KiB
Markdown
431 lines
15 KiB
Markdown
# RAG Evaluation Framework
|
||
|
||
## Overview
|
||
|
||
Evaluating Retrieval-Augmented Generation (RAG) systems requires a comprehensive approach that measures both retrieval quality and generation performance. This framework provides methodologies, metrics, and tools for systematic RAG evaluation across different stages of the pipeline.
|
||
|
||
## Evaluation Dimensions
|
||
|
||
### 1. Retrieval Quality (Information Retrieval Metrics)
|
||
|
||
**Precision@K**: Fraction of retrieved documents that are relevant
|
||
- Formula: `Precision@K = Relevant Retrieved@K / K`
|
||
- Use Case: Measuring result quality at different cutoff points
|
||
- Target Values: >0.7 for K=1, >0.5 for K=5, >0.3 for K=10
|
||
|
||
**Recall@K**: Fraction of relevant documents that are retrieved
|
||
- Formula: `Recall@K = Relevant Retrieved@K / Total Relevant`
|
||
- Use Case: Measuring coverage of relevant information
|
||
- Target Values: >0.8 for K=10, >0.9 for K=20
|
||
|
||
**Mean Reciprocal Rank (MRR)**: Average reciprocal rank of first relevant result
|
||
- Formula: `MRR = (1/Q) × Σ(1/rank_i)` where rank_i is position of first relevant result
|
||
- Use Case: Measuring how quickly users find relevant information
|
||
- Target Values: >0.6 for good systems, >0.8 for excellent systems
|
||
|
||
**Normalized Discounted Cumulative Gain (NDCG@K)**: Position-aware relevance metric
|
||
- Formula: `NDCG@K = DCG@K / IDCG@K`
|
||
- Use Case: Penalizing relevant documents that appear lower in rankings
|
||
- Target Values: >0.7 for K=5, >0.6 for K=10
|
||
|
||
### 2. Generation Quality (RAG-Specific Metrics)
|
||
|
||
**Faithfulness**: How well the generated answer is grounded in retrieved context
|
||
- Measurement: NLI-based entailment scoring, fact verification
|
||
- Implementation: Check if each claim in answer is supported by context
|
||
- Target Values: >0.95 for factual systems, >0.85 for general applications
|
||
|
||
**Answer Relevance**: How well the generated answer addresses the original question
|
||
- Measurement: Semantic similarity between question and answer
|
||
- Implementation: Embedding similarity, keyword overlap, LLM-as-judge
|
||
- Target Values: >0.8 for focused answers, >0.7 for comprehensive responses
|
||
|
||
**Context Relevance**: How relevant the retrieved context is to the question
|
||
- Measurement: Relevance scoring of each retrieved chunk
|
||
- Implementation: Question-context similarity, manual annotation
|
||
- Target Values: >0.7 for average relevance of top-5 chunks
|
||
|
||
**Context Precision**: Fraction of relevant sentences in retrieved context
|
||
- Measurement: Sentence-level relevance annotation
|
||
- Implementation: Binary classification of each sentence's relevance
|
||
- Target Values: >0.6 for efficient context usage
|
||
|
||
**Context Recall**: Coverage of necessary information for answering the question
|
||
- Measurement: Whether all required facts are present in context
|
||
- Implementation: Expert annotation or automated fact extraction
|
||
- Target Values: >0.8 for comprehensive coverage
|
||
|
||
### 3. End-to-End Quality
|
||
|
||
**Correctness**: Factual accuracy of the generated answer
|
||
- Measurement: Expert evaluation, automated fact-checking
|
||
- Implementation: Compare against ground truth, verify claims
|
||
- Scoring: Binary (correct/incorrect) or scaled (1-5)
|
||
|
||
**Completeness**: Whether the answer addresses all aspects of the question
|
||
- Measurement: Coverage of question components
|
||
- Implementation: Aspect-based evaluation, expert annotation
|
||
- Scoring: Fraction of question aspects covered
|
||
|
||
**Helpfulness**: Overall utility of the response to the user
|
||
- Measurement: User ratings, task completion rates
|
||
- Implementation: Human evaluation, A/B testing
|
||
- Scoring: 1-5 Likert scale or thumbs up/down
|
||
|
||
## Evaluation Methodologies
|
||
|
||
### 1. Offline Evaluation
|
||
|
||
**Dataset Requirements**:
|
||
- Diverse query set (100+ queries for statistical significance)
|
||
- Ground truth relevance judgments
|
||
- Reference answers (for generation evaluation)
|
||
- Representative document corpus
|
||
|
||
**Evaluation Pipeline**:
|
||
1. Query Processing: Standardize query format and preprocessing
|
||
2. Retrieval Execution: Run retrieval with consistent parameters
|
||
3. Generation Execution: Generate answers using retrieved context
|
||
4. Metric Calculation: Compute all relevant metrics
|
||
5. Statistical Analysis: Significance testing, confidence intervals
|
||
|
||
**Best Practices**:
|
||
- Stratify queries by type (factual, analytical, conversational)
|
||
- Include edge cases (ambiguous queries, no-answer situations)
|
||
- Use multiple annotators with inter-rater agreement analysis
|
||
- Regular re-evaluation as system evolves
|
||
|
||
### 2. Online Evaluation (A/B Testing)
|
||
|
||
**Metrics to Track**:
|
||
- User engagement: Click-through rates, time on page
|
||
- User satisfaction: Explicit ratings, implicit feedback
|
||
- Task completion: Success rates for specific user goals
|
||
- System performance: Latency, error rates
|
||
|
||
**Experimental Design**:
|
||
- Randomized assignment to treatment/control groups
|
||
- Sufficient sample size (typically 1000+ users per group)
|
||
- Runtime duration (1-4 weeks for stable results)
|
||
- Proper randomization and bias mitigation
|
||
|
||
### 3. Human Evaluation
|
||
|
||
**Evaluation Aspects**:
|
||
- Factual Accuracy: Is the information correct?
|
||
- Relevance: Does the answer address the question?
|
||
- Completeness: Are all aspects covered?
|
||
- Clarity: Is the answer easy to understand?
|
||
- Conciseness: Is the answer appropriately brief?
|
||
|
||
**Annotation Guidelines**:
|
||
- Clear scoring rubrics (e.g., 1-5 scales with examples)
|
||
- Multiple annotators per sample (typically 3-5)
|
||
- Training and calibration sessions
|
||
- Regular quality checks and inter-rater agreement
|
||
|
||
## Implementation Framework
|
||
|
||
### 1. Automated Evaluation Pipeline
|
||
|
||
```python
|
||
class RAGEvaluator:
|
||
def __init__(self, retriever, generator, metrics_config):
|
||
self.retriever = retriever
|
||
self.generator = generator
|
||
self.metrics = self._initialize_metrics(metrics_config)
|
||
|
||
def evaluate_query(self, query, ground_truth):
|
||
# Retrieval evaluation
|
||
retrieved_docs = self.retriever.search(query)
|
||
retrieval_metrics = self.evaluate_retrieval(
|
||
retrieved_docs, ground_truth['relevant_docs']
|
||
)
|
||
|
||
# Generation evaluation
|
||
generated_answer = self.generator.generate(query, retrieved_docs)
|
||
generation_metrics = self.evaluate_generation(
|
||
query, generated_answer, retrieved_docs, ground_truth['answer']
|
||
)
|
||
|
||
return {**retrieval_metrics, **generation_metrics}
|
||
```
|
||
|
||
### 2. Metric Implementations
|
||
|
||
**Faithfulness Score**:
|
||
```python
|
||
def calculate_faithfulness(answer, context):
|
||
# Split answer into claims
|
||
claims = extract_claims(answer)
|
||
|
||
# Check each claim against context
|
||
faithful_claims = 0
|
||
for claim in claims:
|
||
if is_supported_by_context(claim, context):
|
||
faithful_claims += 1
|
||
|
||
return faithful_claims / len(claims) if claims else 0
|
||
```
|
||
|
||
**Context Relevance Score**:
|
||
```python
|
||
def calculate_context_relevance(query, contexts):
|
||
relevance_scores = []
|
||
for context in contexts:
|
||
similarity = embedding_similarity(query, context)
|
||
relevance_scores.append(similarity)
|
||
|
||
return {
|
||
'average_relevance': mean(relevance_scores),
|
||
'top_k_relevance': mean(relevance_scores[:k]),
|
||
'relevance_distribution': relevance_scores
|
||
}
|
||
```
|
||
|
||
### 3. Evaluation Dataset Creation
|
||
|
||
**Query Collection Strategies**:
|
||
1. **User Log Analysis**: Extract real user queries from production systems
|
||
2. **Expert Generation**: Domain experts create representative queries
|
||
3. **Synthetic Generation**: LLM-generated queries based on document content
|
||
4. **Community Sourcing**: Crowdsourced query collection
|
||
|
||
**Ground Truth Creation**:
|
||
1. **Document Relevance**: Expert annotation of relevant documents per query
|
||
2. **Answer Creation**: Expert-written reference answers
|
||
3. **Aspect Annotation**: Mark which aspects of complex questions are addressed
|
||
4. **Quality Control**: Multiple annotators with disagreement resolution
|
||
|
||
## Evaluation Datasets and Benchmarks
|
||
|
||
### 1. General Domain Benchmarks
|
||
|
||
**MS MARCO**: Large-scale reading comprehension dataset
|
||
- 100K real user queries from Bing search
|
||
- Passage-level and document-level evaluation
|
||
- Both retrieval and generation evaluation supported
|
||
|
||
**Natural Questions**: Google search queries with Wikipedia answers
|
||
- 307K training examples, 8K development examples
|
||
- Natural language questions from real users
|
||
- Both short and long answer evaluation
|
||
|
||
**SQUAD 2.0**: Reading comprehension with unanswerable questions
|
||
- 150K question-answer pairs
|
||
- Includes questions that cannot be answered from context
|
||
- Tests system's ability to recognize unanswerable queries
|
||
|
||
### 2. Domain-Specific Benchmarks
|
||
|
||
**TREC-COVID**: Scientific literature search
|
||
- 50 queries on COVID-19 research topics
|
||
- 171K scientific papers as corpus
|
||
- Expert relevance judgments
|
||
|
||
**FiQA**: Financial question answering
|
||
- 648 questions from financial forums
|
||
- 57K financial forum posts as corpus
|
||
- Domain-specific terminology and concepts
|
||
|
||
**BioASQ**: Biomedical semantic indexing and question answering
|
||
- 3K biomedical questions
|
||
- PubMed abstracts as corpus
|
||
- Expert physician annotations
|
||
|
||
### 3. Multilingual Benchmarks
|
||
|
||
**Mr. TyDi**: Multilingual question answering
|
||
- 11 languages including Arabic, Bengali, Korean
|
||
- Wikipedia passages in each language
|
||
- Cultural and linguistic diversity testing
|
||
|
||
**MLQA**: Cross-lingual question answering
|
||
- Questions in one language, answers in another
|
||
- 7 languages with all pair combinations
|
||
- Tests multilingual retrieval capabilities
|
||
|
||
## Continuous Evaluation Framework
|
||
|
||
### 1. Monitoring Pipeline
|
||
|
||
**Real-time Metrics**:
|
||
- System latency (p50, p95, p99)
|
||
- Error rates and failure modes
|
||
- User satisfaction scores
|
||
- Query volume and patterns
|
||
|
||
**Batch Evaluation**:
|
||
- Weekly/monthly evaluation on test sets
|
||
- Performance trend analysis
|
||
- Regression detection
|
||
- Model drift monitoring
|
||
|
||
### 2. Quality Assurance
|
||
|
||
**Automated Quality Checks**:
|
||
- Hallucination detection
|
||
- Toxicity and bias screening
|
||
- Factual consistency verification
|
||
- Output format validation
|
||
|
||
**Human Review Process**:
|
||
- Random sampling of responses (1-5% of production queries)
|
||
- Expert review of edge cases and failures
|
||
- User feedback integration
|
||
- Regular calibration of automated metrics
|
||
|
||
### 3. Performance Optimization
|
||
|
||
**A/B Testing Framework**:
|
||
- Infrastructure for controlled experiments
|
||
- Statistical significance testing
|
||
- Multi-armed bandit optimization
|
||
- Gradual rollout procedures
|
||
|
||
**Feedback Loop Integration**:
|
||
- User feedback incorporation into training data
|
||
- Error analysis and root cause identification
|
||
- Iterative improvement processes
|
||
- Model fine-tuning based on evaluation results
|
||
|
||
## Tools and Libraries
|
||
|
||
### 1. Open Source Tools
|
||
|
||
**RAGAS**: RAG Assessment framework
|
||
- Comprehensive metric implementations
|
||
- Easy integration with popular RAG frameworks
|
||
- Support for both synthetic and human evaluation
|
||
|
||
**TruEra TruLens**: ML observability for RAG
|
||
- Real-time monitoring and evaluation
|
||
- Comprehensive metric tracking
|
||
- Integration with popular vector databases
|
||
|
||
**LangSmith**: LangChain evaluation and monitoring
|
||
- End-to-end RAG pipeline evaluation
|
||
- Human feedback integration
|
||
- Performance analytics and debugging
|
||
|
||
### 2. Commercial Solutions
|
||
|
||
**Weights & Biases**: ML experiment tracking
|
||
- A/B testing infrastructure
|
||
- Comprehensive metrics dashboard
|
||
- Team collaboration features
|
||
|
||
**Neptune**: ML metadata store
|
||
- Experiment comparison and analysis
|
||
- Model performance monitoring
|
||
- Integration with popular ML frameworks
|
||
|
||
**Comet**: ML platform for tracking experiments
|
||
- Real-time monitoring
|
||
- Model comparison and selection
|
||
- Automated report generation
|
||
|
||
## Best Practices
|
||
|
||
### 1. Evaluation Design
|
||
|
||
**Metric Selection**:
|
||
- Choose metrics aligned with business objectives
|
||
- Use multiple complementary metrics
|
||
- Include both automated and human evaluation
|
||
- Consider computational cost vs. insight value
|
||
|
||
**Dataset Preparation**:
|
||
- Ensure representative query distribution
|
||
- Include edge cases and failure modes
|
||
- Maintain high annotation quality
|
||
- Regular dataset updates and validation
|
||
|
||
### 2. Statistical Rigor
|
||
|
||
**Sample Sizes**:
|
||
- Minimum 100 queries for basic evaluation
|
||
- 1000+ queries for robust statistical analysis
|
||
- Power analysis for A/B testing
|
||
- Confidence interval reporting
|
||
|
||
**Significance Testing**:
|
||
- Use appropriate statistical tests (t-tests, Mann-Whitney U)
|
||
- Multiple comparison corrections (Bonferroni, FDR)
|
||
- Effect size reporting alongside p-values
|
||
- Bootstrap confidence intervals for stability
|
||
|
||
### 3. Operational Integration
|
||
|
||
**Automated Pipelines**:
|
||
- Continuous integration/deployment integration
|
||
- Automated regression testing
|
||
- Performance threshold enforcement
|
||
- Alert systems for quality degradation
|
||
|
||
**Human-in-the-Loop**:
|
||
- Regular expert review processes
|
||
- User feedback collection and analysis
|
||
- Annotation quality control
|
||
- Bias detection and mitigation
|
||
|
||
## Common Pitfalls and Solutions
|
||
|
||
### 1. Evaluation Bias
|
||
|
||
**Problem**: Test set not representative of production queries
|
||
**Solution**: Continuous test set updates from production data
|
||
|
||
**Problem**: Annotator bias in relevance judgments
|
||
**Solution**: Multiple annotators, clear guidelines, bias training
|
||
|
||
### 2. Metric Gaming
|
||
|
||
**Problem**: Optimizing for metrics rather than user satisfaction
|
||
**Solution**: Multiple complementary metrics, regular metric validation
|
||
|
||
**Problem**: Overfitting to evaluation set
|
||
**Solution**: Hold-out validation sets, temporal splits
|
||
|
||
### 3. Scale Challenges
|
||
|
||
**Problem**: Evaluation becomes too expensive at scale
|
||
**Solution**: Sampling strategies, automated metrics, efficient tooling
|
||
|
||
**Problem**: Human evaluation bottlenecks
|
||
**Solution**: Active learning for annotation, LLM-as-judge validation
|
||
|
||
## Future Directions
|
||
|
||
### 1. Advanced Metrics
|
||
|
||
- **Semantic Coherence**: Measuring logical flow in generated answers
|
||
- **Factual Consistency**: Cross-document fact verification
|
||
- **Personalization Quality**: User-specific relevance assessment
|
||
- **Multimodal Evaluation**: Text, image, audio integration metrics
|
||
|
||
### 2. Automated Evaluation
|
||
|
||
- **LLM-as-Judge**: Using large language models for quality assessment
|
||
- **Adversarial Testing**: Systematic stress testing of RAG systems
|
||
- **Causal Evaluation**: Understanding why systems fail
|
||
- **Real-time Adaptation**: Dynamic metric adjustment based on context
|
||
|
||
### 3. Holistic Assessment
|
||
|
||
- **User Journey Evaluation**: Multi-turn conversation quality
|
||
- **Task Success Measurement**: Goal completion rather than single query
|
||
- **Temporal Consistency**: Performance stability over time
|
||
- **Fairness and Bias**: Systematic bias detection and measurement
|
||
|
||
## Conclusion
|
||
|
||
Effective RAG evaluation requires a multi-faceted approach combining automated metrics, human judgment, and continuous monitoring. The key principles are:
|
||
|
||
1. **Comprehensive Coverage**: Evaluate all pipeline components
|
||
2. **Multiple Perspectives**: Combine different evaluation methodologies
|
||
3. **Continuous Improvement**: Regular evaluation and iteration
|
||
4. **Business Alignment**: Metrics should reflect actual user value
|
||
5. **Statistical Rigor**: Proper experimental design and analysis
|
||
|
||
This framework provides the foundation for building robust, high-quality RAG systems that deliver real value to users while maintaining reliability and trustworthiness. |