fix(skill): rewrite senior-prompt-engineer with unique, actionable content (#91)
Issue #49 feedback implementation: SKILL.md: - Added YAML frontmatter with trigger phrases - Removed marketing language ("world-class", etc.) - Added Table of Contents - Converted vague bullets to concrete workflows - Added input/output examples for all tools Reference files (all 3 previously 100% identical): - prompt_engineering_patterns.md: 10 patterns with examples (Zero-Shot, Few-Shot, CoT, Role, Structured Output, etc.) - llm_evaluation_frameworks.md: 7 sections on metrics (BLEU, ROUGE, BERTScore, RAG metrics, A/B testing) - agentic_system_design.md: 6 agent architecture sections (ReAct, Plan-Execute, Tool Use, Multi-Agent, Memory) Python scripts (all 3 previously identical placeholders): - prompt_optimizer.py: Token counting, clarity analysis, few-shot extraction, optimization suggestions - rag_evaluator.py: Context relevance, faithfulness, retrieval metrics (Precision@K, MRR, NDCG) - agent_orchestrator.py: Config parsing, validation, ASCII/Mermaid visualization, cost estimation Total: 3,571 lines added, 587 deleted Before: ~785 lines duplicate boilerplate After: 3,750 lines unique, actionable content Closes #49 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,226 +1,355 @@
|
||||
---
|
||||
name: senior-prompt-engineer
|
||||
description: World-class prompt engineering skill for LLM optimization, prompt patterns, structured outputs, and AI product development. Expertise in Claude, GPT-4, prompt design patterns, few-shot learning, chain-of-thought, and AI evaluation. Includes RAG optimization, agent design, and LLM system architecture. Use when building AI products, optimizing LLM performance, designing agentic systems, or implementing advanced prompting techniques.
|
||||
description: This skill should be used when the user asks to "optimize prompts", "design prompt templates", "evaluate LLM outputs", "build agentic systems", "implement RAG", "create few-shot examples", "analyze token usage", or "design AI workflows". Use for prompt engineering patterns, LLM evaluation frameworks, agent architectures, and structured output design.
|
||||
---
|
||||
|
||||
# Senior Prompt Engineer
|
||||
|
||||
World-class senior prompt engineer skill for production-grade AI/ML/Data systems.
|
||||
Prompt engineering patterns, LLM evaluation frameworks, and agentic system design.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Quick Start](#quick-start)
|
||||
- [Tools Overview](#tools-overview)
|
||||
- [Prompt Optimizer](#1-prompt-optimizer)
|
||||
- [RAG Evaluator](#2-rag-evaluator)
|
||||
- [Agent Orchestrator](#3-agent-orchestrator)
|
||||
- [Prompt Engineering Workflows](#prompt-engineering-workflows)
|
||||
- [Prompt Optimization Workflow](#prompt-optimization-workflow)
|
||||
- [Few-Shot Example Design](#few-shot-example-design-workflow)
|
||||
- [Structured Output Design](#structured-output-design-workflow)
|
||||
- [Reference Documentation](#reference-documentation)
|
||||
- [Common Patterns Quick Reference](#common-patterns-quick-reference)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Main Capabilities
|
||||
|
||||
```bash
|
||||
# Core Tool 1
|
||||
python scripts/prompt_optimizer.py --input data/ --output results/
|
||||
# Analyze and optimize a prompt file
|
||||
python scripts/prompt_optimizer.py prompts/my_prompt.txt --analyze
|
||||
|
||||
# Core Tool 2
|
||||
python scripts/rag_evaluator.py --target project/ --analyze
|
||||
# Evaluate RAG retrieval quality
|
||||
python scripts/rag_evaluator.py --contexts contexts.json --questions questions.json
|
||||
|
||||
# Core Tool 3
|
||||
python scripts/agent_orchestrator.py --config config.yaml --deploy
|
||||
# Visualize agent workflow from definition
|
||||
python scripts/agent_orchestrator.py agent_config.yaml --visualize
|
||||
```
|
||||
|
||||
## Core Expertise
|
||||
---
|
||||
|
||||
This skill covers world-class capabilities in:
|
||||
## Tools Overview
|
||||
|
||||
- Advanced production patterns and architectures
|
||||
- Scalable system design and implementation
|
||||
- Performance optimization at scale
|
||||
- MLOps and DataOps best practices
|
||||
- Real-time processing and inference
|
||||
- Distributed computing frameworks
|
||||
- Model deployment and monitoring
|
||||
- Security and compliance
|
||||
- Cost optimization
|
||||
- Team leadership and mentoring
|
||||
### 1. Prompt Optimizer
|
||||
|
||||
## Tech Stack
|
||||
Analyzes prompts for token efficiency, clarity, and structure. Generates optimized versions.
|
||||
|
||||
**Languages:** Python, SQL, R, Scala, Go
|
||||
**ML Frameworks:** PyTorch, TensorFlow, Scikit-learn, XGBoost
|
||||
**Data Tools:** Spark, Airflow, dbt, Kafka, Databricks
|
||||
**LLM Frameworks:** LangChain, LlamaIndex, DSPy
|
||||
**Deployment:** Docker, Kubernetes, AWS/GCP/Azure
|
||||
**Monitoring:** MLflow, Weights & Biases, Prometheus
|
||||
**Databases:** PostgreSQL, BigQuery, Snowflake, Pinecone
|
||||
**Input:** Prompt text file or string
|
||||
**Output:** Analysis report with optimization suggestions
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Analyze a prompt file
|
||||
python scripts/prompt_optimizer.py prompt.txt --analyze
|
||||
|
||||
# Output:
|
||||
# Token count: 847
|
||||
# Estimated cost: $0.0025 (GPT-4)
|
||||
# Clarity score: 72/100
|
||||
# Issues found:
|
||||
# - Ambiguous instruction at line 3
|
||||
# - Missing output format specification
|
||||
# - Redundant context (lines 12-15 repeat lines 5-8)
|
||||
# Suggestions:
|
||||
# 1. Add explicit output format: "Respond in JSON with keys: ..."
|
||||
# 2. Remove redundant context to save 89 tokens
|
||||
# 3. Clarify "analyze" -> "list the top 3 issues with severity ratings"
|
||||
|
||||
# Generate optimized version
|
||||
python scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt
|
||||
|
||||
# Count tokens for cost estimation
|
||||
python scripts/prompt_optimizer.py prompt.txt --tokens --model gpt-4
|
||||
|
||||
# Extract and manage few-shot examples
|
||||
python scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. RAG Evaluator
|
||||
|
||||
Evaluates Retrieval-Augmented Generation quality by measuring context relevance and answer faithfulness.
|
||||
|
||||
**Input:** Retrieved contexts (JSON) and questions/answers
|
||||
**Output:** Evaluation metrics and quality report
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Evaluate retrieval quality
|
||||
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
|
||||
|
||||
# Output:
|
||||
# === RAG Evaluation Report ===
|
||||
# Questions evaluated: 50
|
||||
#
|
||||
# Retrieval Metrics:
|
||||
# Context Relevance: 0.78 (target: >0.80)
|
||||
# Retrieval Precision@5: 0.72
|
||||
# Coverage: 0.85
|
||||
#
|
||||
# Generation Metrics:
|
||||
# Answer Faithfulness: 0.91
|
||||
# Groundedness: 0.88
|
||||
#
|
||||
# Issues Found:
|
||||
# - 8 questions had no relevant context in top-5
|
||||
# - 3 answers contained information not in context
|
||||
#
|
||||
# Recommendations:
|
||||
# 1. Improve chunking strategy for technical documents
|
||||
# 2. Add metadata filtering for date-sensitive queries
|
||||
|
||||
# Evaluate with custom metrics
|
||||
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json \
|
||||
--metrics relevance,faithfulness,coverage
|
||||
|
||||
# Export detailed results
|
||||
python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json \
|
||||
--output report.json --verbose
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Agent Orchestrator
|
||||
|
||||
Parses agent definitions and visualizes execution flows. Validates tool configurations.
|
||||
|
||||
**Input:** Agent configuration (YAML/JSON)
|
||||
**Output:** Workflow visualization, validation report
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Validate agent configuration
|
||||
python scripts/agent_orchestrator.py agent.yaml --validate
|
||||
|
||||
# Output:
|
||||
# === Agent Validation Report ===
|
||||
# Agent: research_assistant
|
||||
# Pattern: ReAct
|
||||
#
|
||||
# Tools (4 registered):
|
||||
# [OK] web_search - API key configured
|
||||
# [OK] calculator - No config needed
|
||||
# [WARN] file_reader - Missing allowed_paths
|
||||
# [OK] summarizer - Prompt template valid
|
||||
#
|
||||
# Flow Analysis:
|
||||
# Max depth: 5 iterations
|
||||
# Estimated tokens/run: 2,400-4,800
|
||||
# Potential infinite loop: No
|
||||
#
|
||||
# Recommendations:
|
||||
# 1. Add allowed_paths to file_reader for security
|
||||
# 2. Consider adding early exit condition for simple queries
|
||||
|
||||
# Visualize agent workflow (ASCII)
|
||||
python scripts/agent_orchestrator.py agent.yaml --visualize
|
||||
|
||||
# Output:
|
||||
# ┌─────────────────────────────────────────┐
|
||||
# │ research_assistant │
|
||||
# │ (ReAct Pattern) │
|
||||
# └─────────────────┬───────────────────────┘
|
||||
# │
|
||||
# ┌────────▼────────┐
|
||||
# │ User Query │
|
||||
# └────────┬────────┘
|
||||
# │
|
||||
# ┌────────▼────────┐
|
||||
# │ Think │◄──────┐
|
||||
# └────────┬────────┘ │
|
||||
# │ │
|
||||
# ┌────────▼────────┐ │
|
||||
# │ Select Tool │ │
|
||||
# └────────┬────────┘ │
|
||||
# │ │
|
||||
# ┌─────────────┼─────────────┐ │
|
||||
# ▼ ▼ ▼ │
|
||||
# [web_search] [calculator] [file_reader]
|
||||
# │ │ │ │
|
||||
# └─────────────┼─────────────┘ │
|
||||
# │ │
|
||||
# ┌────────▼────────┐ │
|
||||
# │ Observe │───────┘
|
||||
# └────────┬────────┘
|
||||
# │
|
||||
# ┌────────▼────────┐
|
||||
# │ Final Answer │
|
||||
# └─────────────────┘
|
||||
|
||||
# Export workflow as Mermaid diagram
|
||||
python scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prompt Engineering Workflows
|
||||
|
||||
### Prompt Optimization Workflow
|
||||
|
||||
Use when improving an existing prompt's performance or reducing token costs.
|
||||
|
||||
**Step 1: Baseline current prompt**
|
||||
```bash
|
||||
python scripts/prompt_optimizer.py current_prompt.txt --analyze --output baseline.json
|
||||
```
|
||||
|
||||
**Step 2: Identify issues**
|
||||
Review the analysis report for:
|
||||
- Token waste (redundant instructions, verbose examples)
|
||||
- Ambiguous instructions (unclear output format, vague verbs)
|
||||
- Missing constraints (no length limits, no format specification)
|
||||
|
||||
**Step 3: Apply optimization patterns**
|
||||
| Issue | Pattern to Apply |
|
||||
|-------|------------------|
|
||||
| Ambiguous output | Add explicit format specification |
|
||||
| Too verbose | Extract to few-shot examples |
|
||||
| Inconsistent results | Add role/persona framing |
|
||||
| Missing edge cases | Add constraint boundaries |
|
||||
|
||||
**Step 4: Generate optimized version**
|
||||
```bash
|
||||
python scripts/prompt_optimizer.py current_prompt.txt --optimize --output optimized.txt
|
||||
```
|
||||
|
||||
**Step 5: Compare results**
|
||||
```bash
|
||||
python scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json
|
||||
# Shows: token reduction, clarity improvement, issues resolved
|
||||
```
|
||||
|
||||
**Step 6: Validate with test cases**
|
||||
Run both prompts against your evaluation set and compare outputs.
|
||||
|
||||
---
|
||||
|
||||
### Few-Shot Example Design Workflow
|
||||
|
||||
Use when creating examples for in-context learning.
|
||||
|
||||
**Step 1: Define the task clearly**
|
||||
```
|
||||
Task: Extract product entities from customer reviews
|
||||
Input: Review text
|
||||
Output: JSON with {product_name, sentiment, features_mentioned}
|
||||
```
|
||||
|
||||
**Step 2: Select diverse examples (3-5 recommended)**
|
||||
| Example Type | Purpose |
|
||||
|--------------|---------|
|
||||
| Simple case | Shows basic pattern |
|
||||
| Edge case | Handles ambiguity |
|
||||
| Complex case | Multiple entities |
|
||||
| Negative case | What NOT to extract |
|
||||
|
||||
**Step 3: Format consistently**
|
||||
```
|
||||
Example 1:
|
||||
Input: "Love my new iPhone 15, the camera is amazing!"
|
||||
Output: {"product_name": "iPhone 15", "sentiment": "positive", "features_mentioned": ["camera"]}
|
||||
|
||||
Example 2:
|
||||
Input: "The laptop was okay but battery life is terrible."
|
||||
Output: {"product_name": "laptop", "sentiment": "mixed", "features_mentioned": ["battery life"]}
|
||||
```
|
||||
|
||||
**Step 4: Validate example quality**
|
||||
```bash
|
||||
python scripts/prompt_optimizer.py prompt_with_examples.txt --validate-examples
|
||||
# Checks: consistency, coverage, format alignment
|
||||
```
|
||||
|
||||
**Step 5: Test with held-out cases**
|
||||
Ensure model generalizes beyond your examples.
|
||||
|
||||
---
|
||||
|
||||
### Structured Output Design Workflow
|
||||
|
||||
Use when you need reliable JSON/XML/structured responses.
|
||||
|
||||
**Step 1: Define schema**
|
||||
```json
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"summary": {"type": "string", "maxLength": 200},
|
||||
"sentiment": {"enum": ["positive", "negative", "neutral"]},
|
||||
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
|
||||
},
|
||||
"required": ["summary", "sentiment"]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Include schema in prompt**
|
||||
```
|
||||
Respond with JSON matching this schema:
|
||||
- summary (string, max 200 chars): Brief summary of the content
|
||||
- sentiment (enum): One of "positive", "negative", "neutral"
|
||||
- confidence (number 0-1): Your confidence in the sentiment
|
||||
```
|
||||
|
||||
**Step 3: Add format enforcement**
|
||||
```
|
||||
IMPORTANT: Respond ONLY with valid JSON. No markdown, no explanation.
|
||||
Start your response with { and end with }
|
||||
```
|
||||
|
||||
**Step 4: Validate outputs**
|
||||
```bash
|
||||
python scripts/prompt_optimizer.py structured_prompt.txt --validate-schema schema.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
### 1. Prompt Engineering Patterns
|
||||
| File | Contains | Load when user asks about |
|
||||
|------|----------|---------------------------|
|
||||
| `references/prompt_engineering_patterns.md` | 10 prompt patterns with input/output examples | "which pattern?", "few-shot", "chain-of-thought", "role prompting" |
|
||||
| `references/llm_evaluation_frameworks.md` | Evaluation metrics, scoring methods, A/B testing | "how to evaluate?", "measure quality", "compare prompts" |
|
||||
| `references/agentic_system_design.md` | Agent architectures (ReAct, Plan-Execute, Tool Use) | "build agent", "tool calling", "multi-agent" |
|
||||
|
||||
Comprehensive guide available in `references/prompt_engineering_patterns.md` covering:
|
||||
---
|
||||
|
||||
- Advanced patterns and best practices
|
||||
- Production implementation strategies
|
||||
- Performance optimization techniques
|
||||
- Scalability considerations
|
||||
- Security and compliance
|
||||
- Real-world case studies
|
||||
## Common Patterns Quick Reference
|
||||
|
||||
### 2. Llm Evaluation Frameworks
|
||||
| Pattern | When to Use | Example |
|
||||
|---------|-------------|---------|
|
||||
| **Zero-shot** | Simple, well-defined tasks | "Classify this email as spam or not spam" |
|
||||
| **Few-shot** | Complex tasks, consistent format needed | Provide 3-5 examples before the task |
|
||||
| **Chain-of-Thought** | Reasoning, math, multi-step logic | "Think step by step..." |
|
||||
| **Role Prompting** | Expertise needed, specific perspective | "You are an expert tax accountant..." |
|
||||
| **Structured Output** | Need parseable JSON/XML | Include schema + format enforcement |
|
||||
|
||||
Complete workflow documentation in `references/llm_evaluation_frameworks.md` including:
|
||||
|
||||
- Step-by-step processes
|
||||
- Architecture design patterns
|
||||
- Tool integration guides
|
||||
- Performance tuning strategies
|
||||
- Troubleshooting procedures
|
||||
|
||||
### 3. Agentic System Design
|
||||
|
||||
Technical reference guide in `references/agentic_system_design.md` with:
|
||||
|
||||
- System design principles
|
||||
- Implementation examples
|
||||
- Configuration best practices
|
||||
- Deployment strategies
|
||||
- Monitoring and observability
|
||||
|
||||
## Production Patterns
|
||||
|
||||
### Pattern 1: Scalable Data Processing
|
||||
|
||||
Enterprise-scale data processing with distributed computing:
|
||||
|
||||
- Horizontal scaling architecture
|
||||
- Fault-tolerant design
|
||||
- Real-time and batch processing
|
||||
- Data quality validation
|
||||
- Performance monitoring
|
||||
|
||||
### Pattern 2: ML Model Deployment
|
||||
|
||||
Production ML system with high availability:
|
||||
|
||||
- Model serving with low latency
|
||||
- A/B testing infrastructure
|
||||
- Feature store integration
|
||||
- Model monitoring and drift detection
|
||||
- Automated retraining pipelines
|
||||
|
||||
### Pattern 3: Real-Time Inference
|
||||
|
||||
High-throughput inference system:
|
||||
|
||||
- Batching and caching strategies
|
||||
- Load balancing
|
||||
- Auto-scaling
|
||||
- Latency optimization
|
||||
- Cost optimization
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Development
|
||||
|
||||
- Test-driven development
|
||||
- Code reviews and pair programming
|
||||
- Documentation as code
|
||||
- Version control everything
|
||||
- Continuous integration
|
||||
|
||||
### Production
|
||||
|
||||
- Monitor everything critical
|
||||
- Automate deployments
|
||||
- Feature flags for releases
|
||||
- Canary deployments
|
||||
- Comprehensive logging
|
||||
|
||||
### Team Leadership
|
||||
|
||||
- Mentor junior engineers
|
||||
- Drive technical decisions
|
||||
- Establish coding standards
|
||||
- Foster learning culture
|
||||
- Cross-functional collaboration
|
||||
|
||||
## Performance Targets
|
||||
|
||||
**Latency:**
|
||||
- P50: < 50ms
|
||||
- P95: < 100ms
|
||||
- P99: < 200ms
|
||||
|
||||
**Throughput:**
|
||||
- Requests/second: > 1000
|
||||
- Concurrent users: > 10,000
|
||||
|
||||
**Availability:**
|
||||
- Uptime: 99.9%
|
||||
- Error rate: < 0.1%
|
||||
|
||||
## Security & Compliance
|
||||
|
||||
- Authentication & authorization
|
||||
- Data encryption (at rest & in transit)
|
||||
- PII handling and anonymization
|
||||
- GDPR/CCPA compliance
|
||||
- Regular security audits
|
||||
- Vulnerability management
|
||||
---
|
||||
|
||||
## Common Commands
|
||||
|
||||
```bash
|
||||
# Development
|
||||
python -m pytest tests/ -v --cov
|
||||
python -m black src/
|
||||
python -m pylint src/
|
||||
# Prompt Analysis
|
||||
python scripts/prompt_optimizer.py prompt.txt --analyze # Full analysis
|
||||
python scripts/prompt_optimizer.py prompt.txt --tokens # Token count only
|
||||
python scripts/prompt_optimizer.py prompt.txt --optimize # Generate optimized version
|
||||
|
||||
# Training
|
||||
python scripts/train.py --config prod.yaml
|
||||
python scripts/evaluate.py --model best.pth
|
||||
# RAG Evaluation
|
||||
python scripts/rag_evaluator.py --contexts ctx.json --questions q.json # Evaluate
|
||||
python scripts/rag_evaluator.py --contexts ctx.json --compare baseline # Compare to baseline
|
||||
|
||||
# Deployment
|
||||
docker build -t service:v1 .
|
||||
kubectl apply -f k8s/
|
||||
helm upgrade service ./charts/
|
||||
|
||||
# Monitoring
|
||||
kubectl logs -f deployment/service
|
||||
python scripts/health_check.py
|
||||
# Agent Development
|
||||
python scripts/agent_orchestrator.py agent.yaml --validate # Validate config
|
||||
python scripts/agent_orchestrator.py agent.yaml --visualize # Show workflow
|
||||
python scripts/agent_orchestrator.py agent.yaml --estimate-cost # Token estimation
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- Advanced Patterns: `references/prompt_engineering_patterns.md`
|
||||
- Implementation Guide: `references/llm_evaluation_frameworks.md`
|
||||
- Technical Reference: `references/agentic_system_design.md`
|
||||
- Automation Scripts: `scripts/` directory
|
||||
|
||||
## Senior-Level Responsibilities
|
||||
|
||||
As a world-class senior professional:
|
||||
|
||||
1. **Technical Leadership**
|
||||
- Drive architectural decisions
|
||||
- Mentor team members
|
||||
- Establish best practices
|
||||
- Ensure code quality
|
||||
|
||||
2. **Strategic Thinking**
|
||||
- Align with business goals
|
||||
- Evaluate trade-offs
|
||||
- Plan for scale
|
||||
- Manage technical debt
|
||||
|
||||
3. **Collaboration**
|
||||
- Work across teams
|
||||
- Communicate effectively
|
||||
- Build consensus
|
||||
- Share knowledge
|
||||
|
||||
4. **Innovation**
|
||||
- Stay current with research
|
||||
- Experiment with new approaches
|
||||
- Contribute to community
|
||||
- Drive continuous improvement
|
||||
|
||||
5. **Production Excellence**
|
||||
- Ensure high availability
|
||||
- Monitor proactively
|
||||
- Optimize performance
|
||||
- Respond to incidents
|
||||
|
||||
@@ -1,80 +1,646 @@
|
||||
# Agentic System Design
|
||||
|
||||
## Overview
|
||||
Agent architectures, tool use patterns, and multi-agent orchestration with pseudocode.
|
||||
|
||||
World-class agentic system design for senior prompt engineer.
|
||||
## Architectures Index
|
||||
|
||||
## Core Principles
|
||||
1. [ReAct Pattern](#1-react-pattern)
|
||||
2. [Plan-and-Execute](#2-plan-and-execute)
|
||||
3. [Tool Use / Function Calling](#3-tool-use--function-calling)
|
||||
4. [Multi-Agent Collaboration](#4-multi-agent-collaboration)
|
||||
5. [Memory and State Management](#5-memory-and-state-management)
|
||||
6. [Agent Design Patterns](#6-agent-design-patterns)
|
||||
|
||||
### Production-First Design
|
||||
---
|
||||
|
||||
Always design with production in mind:
|
||||
- Scalability: Handle 10x current load
|
||||
- Reliability: 99.9% uptime target
|
||||
- Maintainability: Clear, documented code
|
||||
- Observability: Monitor everything
|
||||
## 1. ReAct Pattern
|
||||
|
||||
### Performance by Design
|
||||
**Reasoning + Acting**: The agent alternates between thinking about what to do and taking actions.
|
||||
|
||||
Optimize from the start:
|
||||
- Efficient algorithms
|
||||
- Resource awareness
|
||||
- Strategic caching
|
||||
- Batch processing
|
||||
### Architecture
|
||||
|
||||
### Security & Privacy
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ ReAct Loop │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Thought │───▶│ Action │───▶│ Tool │───▶│Observat.│ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ └────┬────┘ │
|
||||
│ ▲ │ │
|
||||
│ └────────────────────────────────────────────┘ │
|
||||
│ (loop until done) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Build security in:
|
||||
- Input validation
|
||||
- Data encryption
|
||||
- Access control
|
||||
- Audit logging
|
||||
### Pseudocode
|
||||
|
||||
## Advanced Patterns
|
||||
```python
|
||||
def react_agent(query, tools, max_iterations=10):
|
||||
"""
|
||||
ReAct agent implementation.
|
||||
|
||||
### Pattern 1: Distributed Processing
|
||||
Args:
|
||||
query: User question
|
||||
tools: Dict of available tools {name: function}
|
||||
max_iterations: Safety limit
|
||||
"""
|
||||
context = f"Question: {query}\n"
|
||||
|
||||
Enterprise-scale data processing with fault tolerance.
|
||||
for i in range(max_iterations):
|
||||
# Generate thought and action
|
||||
response = llm.generate(
|
||||
REACT_PROMPT.format(
|
||||
tools=format_tools(tools),
|
||||
context=context
|
||||
)
|
||||
)
|
||||
|
||||
### Pattern 2: Real-Time Systems
|
||||
# Parse response
|
||||
thought = extract_thought(response)
|
||||
action = extract_action(response)
|
||||
|
||||
Low-latency, high-throughput systems.
|
||||
context += f"Thought: {thought}\n"
|
||||
|
||||
### Pattern 3: ML at Scale
|
||||
# Check for final answer
|
||||
if action.name == "finish":
|
||||
return action.argument
|
||||
|
||||
Production ML with monitoring and automation.
|
||||
# Execute tool
|
||||
if action.name in tools:
|
||||
observation = tools[action.name](action.argument)
|
||||
context += f"Action: {action.name}({action.argument})\n"
|
||||
context += f"Observation: {observation}\n"
|
||||
else:
|
||||
context += f"Error: Unknown tool {action.name}\n"
|
||||
|
||||
## Best Practices
|
||||
return "Max iterations reached"
|
||||
```
|
||||
|
||||
### Code Quality
|
||||
- Comprehensive testing
|
||||
- Clear documentation
|
||||
- Code reviews
|
||||
- Type hints
|
||||
### Prompt Template
|
||||
|
||||
### Performance
|
||||
- Profile before optimizing
|
||||
- Monitor continuously
|
||||
- Cache strategically
|
||||
- Batch operations
|
||||
```
|
||||
You are a helpful assistant that can use tools to answer questions.
|
||||
|
||||
### Reliability
|
||||
- Design for failure
|
||||
- Implement retries
|
||||
- Use circuit breakers
|
||||
- Monitor health
|
||||
Available tools:
|
||||
{tools}
|
||||
|
||||
## Tools & Technologies
|
||||
Answer format:
|
||||
Thought: [your reasoning about what to do next]
|
||||
Action: [tool_name(argument)] OR finish(final_answer)
|
||||
|
||||
Essential tools for this domain:
|
||||
- Development frameworks
|
||||
- Testing libraries
|
||||
- Deployment platforms
|
||||
- Monitoring solutions
|
||||
{context}
|
||||
|
||||
## Further Reading
|
||||
Continue:
|
||||
```
|
||||
|
||||
- Research papers
|
||||
- Industry blogs
|
||||
- Conference talks
|
||||
- Open source projects
|
||||
### When to Use
|
||||
|
||||
| Scenario | ReAct Fit |
|
||||
|----------|-----------|
|
||||
| Simple Q&A with lookup | Good |
|
||||
| Multi-step research | Good |
|
||||
| Math calculations | Good |
|
||||
| Creative writing | Poor |
|
||||
| Real-time conversation | Poor |
|
||||
|
||||
---
|
||||
|
||||
## 2. Plan-and-Execute
|
||||
|
||||
**Two-phase approach**: First create a plan, then execute each step.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ Plan-and-Execute │
|
||||
├──────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Phase 1: Planning │
|
||||
│ ┌──────────┐ ┌──────────────────────────────────────┐ │
|
||||
│ │ Query │───▶│ Generate step-by-step plan │ │
|
||||
│ └──────────┘ └──────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────┐ │
|
||||
│ │ Plan: [S1, S2, S3] │ │
|
||||
│ └──────────┬───────────┘ │
|
||||
│ │ │
|
||||
│ Phase 2: Execution │ │
|
||||
│ ┌──────────▼───────────┐ │
|
||||
│ │ Execute Step 1 │ │
|
||||
│ └──────────┬───────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────▼───────────┐ │
|
||||
│ │ Execute Step 2 │──▶ Replan? │
|
||||
│ └──────────┬───────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────▼───────────┐ │
|
||||
│ │ Execute Step 3 │ │
|
||||
│ └──────────┬───────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────▼───────────┐ │
|
||||
│ │ Final Answer │ │
|
||||
│ └──────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Pseudocode
|
||||
|
||||
```python
|
||||
def plan_and_execute(query, tools):
|
||||
"""
|
||||
Plan-and-Execute agent.
|
||||
|
||||
Separates planning from execution for complex tasks.
|
||||
"""
|
||||
# Phase 1: Generate plan
|
||||
plan = generate_plan(query)
|
||||
|
||||
results = []
|
||||
|
||||
# Phase 2: Execute each step
|
||||
for i, step in enumerate(plan.steps):
|
||||
# Execute step
|
||||
result = execute_step(step, tools, results)
|
||||
results.append(result)
|
||||
|
||||
# Optional: Check if replanning needed
|
||||
if should_replan(step, result, plan):
|
||||
remaining_steps = plan.steps[i+1:]
|
||||
new_plan = replan(query, results, remaining_steps)
|
||||
plan.steps = plan.steps[:i+1] + new_plan.steps
|
||||
|
||||
# Synthesize final answer
|
||||
return synthesize_answer(query, results)
|
||||
|
||||
|
||||
def generate_plan(query):
|
||||
"""Generate execution plan from query."""
|
||||
prompt = f"""
|
||||
Create a step-by-step plan to answer this question:
|
||||
{query}
|
||||
|
||||
Format each step as:
|
||||
Step N: [action description]
|
||||
|
||||
Keep the plan concise (3-7 steps).
|
||||
"""
|
||||
response = llm.generate(prompt)
|
||||
return parse_plan(response)
|
||||
|
||||
|
||||
def execute_step(step, tools, previous_results):
|
||||
"""Execute a single step using available tools."""
|
||||
prompt = f"""
|
||||
Execute this step: {step.description}
|
||||
|
||||
Previous results:
|
||||
{format_results(previous_results)}
|
||||
|
||||
Available tools: {format_tools(tools)}
|
||||
|
||||
Provide the result of this step.
|
||||
"""
|
||||
return llm.generate(prompt)
|
||||
```
|
||||
|
||||
### When to Use
|
||||
|
||||
| Task Complexity | Recommendation |
|
||||
|-----------------|----------------|
|
||||
| Simple (1-2 steps) | Use ReAct |
|
||||
| Medium (3-5 steps) | Plan-and-Execute |
|
||||
| Complex (6+ steps) | Plan-and-Execute with replanning |
|
||||
| Highly dynamic | ReAct with adaptive planning |
|
||||
|
||||
---
|
||||
|
||||
## 3. Tool Use / Function Calling
|
||||
|
||||
**Structured tool invocation**: LLM generates structured calls that are executed externally.
|
||||
|
||||
### Tool Definition Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "search_web",
|
||||
"description": "Search the web for current information",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"query": {
|
||||
"type": "string",
|
||||
"description": "Search query"
|
||||
},
|
||||
"num_results": {
|
||||
"type": "integer",
|
||||
"default": 5,
|
||||
"description": "Number of results to return"
|
||||
}
|
||||
},
|
||||
"required": ["query"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Implementation Pattern
|
||||
|
||||
```python
|
||||
class ToolRegistry:
|
||||
"""Registry for agent tools."""
|
||||
|
||||
def __init__(self):
|
||||
self.tools = {}
|
||||
|
||||
def register(self, name, func, schema):
|
||||
"""Register a tool with its schema."""
|
||||
self.tools[name] = {
|
||||
"function": func,
|
||||
"schema": schema
|
||||
}
|
||||
|
||||
def get_schemas(self):
|
||||
"""Get all tool schemas for LLM."""
|
||||
return [t["schema"] for t in self.tools.values()]
|
||||
|
||||
def execute(self, name, arguments):
|
||||
"""Execute a tool by name."""
|
||||
if name not in self.tools:
|
||||
raise ValueError(f"Unknown tool: {name}")
|
||||
|
||||
func = self.tools[name]["function"]
|
||||
return func(**arguments)
|
||||
|
||||
|
||||
def tool_use_agent(query, registry):
|
||||
"""Agent with function calling."""
|
||||
messages = [{"role": "user", "content": query}]
|
||||
|
||||
while True:
|
||||
# Call LLM with tools
|
||||
response = llm.chat(
|
||||
messages=messages,
|
||||
tools=registry.get_schemas(),
|
||||
tool_choice="auto"
|
||||
)
|
||||
|
||||
# Check if done
|
||||
if response.finish_reason == "stop":
|
||||
return response.content
|
||||
|
||||
# Execute tool calls
|
||||
if response.tool_calls:
|
||||
for call in response.tool_calls:
|
||||
result = registry.execute(
|
||||
call.function.name,
|
||||
json.loads(call.function.arguments)
|
||||
)
|
||||
messages.append({
|
||||
"role": "tool",
|
||||
"tool_call_id": call.id,
|
||||
"content": str(result)
|
||||
})
|
||||
```
|
||||
|
||||
### Tool Design Best Practices
|
||||
|
||||
| Practice | Example |
|
||||
|----------|---------|
|
||||
| Clear descriptions | "Search web for query" not "search" |
|
||||
| Type hints | Use JSON Schema types |
|
||||
| Default values | Provide sensible defaults |
|
||||
| Error handling | Return error messages, not exceptions |
|
||||
| Idempotency | Same input = same output |
|
||||
|
||||
---
|
||||
|
||||
## 4. Multi-Agent Collaboration
|
||||
|
||||
### Orchestration Patterns
|
||||
|
||||
**Pattern 1: Sequential Pipeline**
|
||||
```
|
||||
Agent A → Agent B → Agent C → Output
|
||||
|
||||
Use case: Research → Analysis → Writing
|
||||
```
|
||||
|
||||
**Pattern 2: Hierarchical**
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Coordinator │
|
||||
└──────┬──────┘
|
||||
┌──────────┼──────────┐
|
||||
▼ ▼ ▼
|
||||
┌───────┐ ┌───────┐ ┌───────┐
|
||||
│Agent A│ │Agent B│ │Agent C│
|
||||
└───────┘ └───────┘ └───────┘
|
||||
|
||||
Use case: Complex task decomposition
|
||||
```
|
||||
|
||||
**Pattern 3: Debate/Consensus**
|
||||
```
|
||||
┌───────┐ ┌───────┐
|
||||
│Agent A│◄───▶│Agent B│
|
||||
└───┬───┘ └───┬───┘
|
||||
│ │
|
||||
└──────┬──────┘
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Arbiter │
|
||||
└─────────────┘
|
||||
|
||||
Use case: Critical decisions, fact-checking
|
||||
```
|
||||
|
||||
### Pseudocode: Hierarchical Multi-Agent
|
||||
|
||||
```python
|
||||
class CoordinatorAgent:
|
||||
"""Coordinates multiple specialized agents."""
|
||||
|
||||
def __init__(self, agents):
|
||||
self.agents = agents # Dict[str, Agent]
|
||||
|
||||
def process(self, query):
|
||||
# Decompose task
|
||||
subtasks = self.decompose(query)
|
||||
|
||||
# Assign to agents
|
||||
results = {}
|
||||
for subtask in subtasks:
|
||||
agent_name = self.select_agent(subtask)
|
||||
result = self.agents[agent_name].execute(subtask)
|
||||
results[subtask.id] = result
|
||||
|
||||
# Synthesize
|
||||
return self.synthesize(query, results)
|
||||
|
||||
def decompose(self, query):
|
||||
"""Break query into subtasks."""
|
||||
prompt = f"""
|
||||
Break this task into subtasks for specialized agents:
|
||||
|
||||
Task: {query}
|
||||
|
||||
Available agents:
|
||||
- researcher: Gathers information
|
||||
- analyst: Analyzes data
|
||||
- writer: Produces content
|
||||
|
||||
Format:
|
||||
1. [agent]: [subtask description]
|
||||
"""
|
||||
response = llm.generate(prompt)
|
||||
return parse_subtasks(response)
|
||||
|
||||
def select_agent(self, subtask):
|
||||
"""Select best agent for subtask."""
|
||||
return subtask.assigned_agent
|
||||
|
||||
def synthesize(self, query, results):
|
||||
"""Combine agent results into final answer."""
|
||||
prompt = f"""
|
||||
Combine these results to answer: {query}
|
||||
|
||||
Results:
|
||||
{format_results(results)}
|
||||
|
||||
Provide a coherent final answer.
|
||||
"""
|
||||
return llm.generate(prompt)
|
||||
```
|
||||
|
||||
### Communication Protocols
|
||||
|
||||
| Protocol | Description | Use When |
|
||||
|----------|-------------|----------|
|
||||
| Direct | Agent calls agent | Simple pipelines |
|
||||
| Message queue | Async message passing | High throughput |
|
||||
| Shared state | Shared memory/database | Collaborative editing |
|
||||
| Broadcast | One-to-many | Status updates |
|
||||
|
||||
---
|
||||
|
||||
## 5. Memory and State Management
|
||||
|
||||
### Memory Types
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Agent Memory System │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Working Memory │ │ Episodic Memory │ │
|
||||
│ │ (Current task) │ │ (Past sessions) │ │
|
||||
│ └────────┬────────┘ └────────┬─────────┘ │
|
||||
│ │ │ │
|
||||
│ └────────┬───────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────┐ │
|
||||
│ │ Semantic Memory │ │
|
||||
│ │ (Long-term knowledge, embeddings) │ │
|
||||
│ └─────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class AgentMemory:
|
||||
"""Memory system for conversational agents."""
|
||||
|
||||
def __init__(self, embedding_model, vector_store):
|
||||
self.embedding_model = embedding_model
|
||||
self.vector_store = vector_store
|
||||
self.working_memory = [] # Current conversation
|
||||
self.buffer_size = 10 # Recent messages to keep
|
||||
|
||||
def add_message(self, role, content):
|
||||
"""Add message to working memory."""
|
||||
self.working_memory.append({
|
||||
"role": role,
|
||||
"content": content,
|
||||
"timestamp": datetime.now()
|
||||
})
|
||||
|
||||
# Trim if too long
|
||||
if len(self.working_memory) > self.buffer_size:
|
||||
# Summarize old messages before removing
|
||||
old_messages = self.working_memory[:5]
|
||||
summary = self.summarize(old_messages)
|
||||
self.store_long_term(summary)
|
||||
self.working_memory = self.working_memory[5:]
|
||||
|
||||
def store_long_term(self, content):
|
||||
"""Store in semantic memory (vector store)."""
|
||||
embedding = self.embedding_model.embed(content)
|
||||
self.vector_store.add(
|
||||
embedding=embedding,
|
||||
metadata={"content": content, "type": "summary"}
|
||||
)
|
||||
|
||||
def retrieve_relevant(self, query, k=5):
|
||||
"""Retrieve relevant memories for context."""
|
||||
query_embedding = self.embedding_model.embed(query)
|
||||
results = self.vector_store.search(query_embedding, k=k)
|
||||
return [r.metadata["content"] for r in results]
|
||||
|
||||
def get_context(self, query):
|
||||
"""Build context for LLM from memories."""
|
||||
relevant = self.retrieve_relevant(query)
|
||||
recent = self.working_memory[-self.buffer_size:]
|
||||
|
||||
return {
|
||||
"relevant_memories": relevant,
|
||||
"recent_conversation": recent
|
||||
}
|
||||
|
||||
def summarize(self, messages):
|
||||
"""Summarize messages for long-term storage."""
|
||||
content = "\n".join([
|
||||
f"{m['role']}: {m['content']}"
|
||||
for m in messages
|
||||
])
|
||||
prompt = f"Summarize this conversation:\n{content}"
|
||||
return llm.generate(prompt)
|
||||
```
|
||||
|
||||
### State Persistence Patterns
|
||||
|
||||
| Pattern | Storage | Use Case |
|
||||
|---------|---------|----------|
|
||||
| In-memory | Dict/List | Single session |
|
||||
| Redis | Key-value | Multi-session, fast |
|
||||
| PostgreSQL | Relational | Complex queries |
|
||||
| Vector DB | Embeddings | Semantic search |
|
||||
|
||||
---
|
||||
|
||||
## 6. Agent Design Patterns
|
||||
|
||||
### Pattern: Reflection
|
||||
|
||||
Agent reviews and critiques its own output.
|
||||
|
||||
```python
|
||||
def reflective_agent(query, tools):
|
||||
"""Agent that reflects on its answers."""
|
||||
# Initial response
|
||||
response = react_agent(query, tools)
|
||||
|
||||
# Reflection
|
||||
critique = llm.generate(f"""
|
||||
Review this answer for:
|
||||
1. Accuracy - Is the information correct?
|
||||
2. Completeness - Does it fully answer the question?
|
||||
3. Clarity - Is it easy to understand?
|
||||
|
||||
Question: {query}
|
||||
Answer: {response}
|
||||
|
||||
Critique:
|
||||
""")
|
||||
|
||||
# Check if revision needed
|
||||
if needs_revision(critique):
|
||||
revised = llm.generate(f"""
|
||||
Improve this answer based on the critique:
|
||||
|
||||
Original: {response}
|
||||
Critique: {critique}
|
||||
|
||||
Improved answer:
|
||||
""")
|
||||
return revised
|
||||
|
||||
return response
|
||||
```
|
||||
|
||||
### Pattern: Self-Ask
|
||||
|
||||
Break complex questions into simpler sub-questions.
|
||||
|
||||
```python
|
||||
def self_ask_agent(query, tools):
|
||||
"""Agent that asks itself follow-up questions."""
|
||||
context = []
|
||||
|
||||
while True:
|
||||
prompt = f"""
|
||||
Question: {query}
|
||||
|
||||
Previous Q&A:
|
||||
{format_qa(context)}
|
||||
|
||||
Do you need to ask a follow-up question to answer this?
|
||||
If yes: "Follow-up: [question]"
|
||||
If no: "Final Answer: [answer]"
|
||||
"""
|
||||
|
||||
response = llm.generate(prompt)
|
||||
|
||||
if response.startswith("Final Answer:"):
|
||||
return response.replace("Final Answer:", "").strip()
|
||||
|
||||
# Answer follow-up question
|
||||
follow_up = response.replace("Follow-up:", "").strip()
|
||||
answer = simple_qa(follow_up, tools)
|
||||
context.append({"q": follow_up, "a": answer})
|
||||
```
|
||||
|
||||
### Pattern: Expert Routing
|
||||
|
||||
Route queries to specialized sub-agents.
|
||||
|
||||
```python
|
||||
class ExpertRouter:
|
||||
"""Routes queries to expert agents."""
|
||||
|
||||
def __init__(self):
|
||||
self.experts = {
|
||||
"code": CodeAgent(),
|
||||
"math": MathAgent(),
|
||||
"research": ResearchAgent(),
|
||||
"general": GeneralAgent()
|
||||
}
|
||||
|
||||
def route(self, query):
|
||||
"""Determine best expert for query."""
|
||||
prompt = f"""
|
||||
Classify this query into one category:
|
||||
- code: Programming questions
|
||||
- math: Mathematical calculations
|
||||
- research: Fact-finding, current events
|
||||
- general: Everything else
|
||||
|
||||
Query: {query}
|
||||
Category:
|
||||
"""
|
||||
category = llm.generate(prompt).strip().lower()
|
||||
return self.experts.get(category, self.experts["general"])
|
||||
|
||||
def process(self, query):
|
||||
expert = self.route(query)
|
||||
return expert.execute(query)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Pattern Selection
|
||||
|
||||
| Need | Pattern |
|
||||
|------|---------|
|
||||
| Simple tool use | ReAct |
|
||||
| Complex multi-step | Plan-and-Execute |
|
||||
| API integration | Function Calling |
|
||||
| Multiple perspectives | Multi-Agent Debate |
|
||||
| Quality assurance | Reflection |
|
||||
| Complex reasoning | Self-Ask |
|
||||
| Domain expertise | Expert Routing |
|
||||
| Conversation continuity | Memory System |
|
||||
|
||||
@@ -1,80 +1,524 @@
|
||||
# Llm Evaluation Frameworks
|
||||
# LLM Evaluation Frameworks
|
||||
|
||||
## Overview
|
||||
Concrete metrics, scoring methods, comparison tables, and A/B testing frameworks.
|
||||
|
||||
World-class llm evaluation frameworks for senior prompt engineer.
|
||||
## Frameworks Index
|
||||
|
||||
## Core Principles
|
||||
1. [Evaluation Metrics Overview](#1-evaluation-metrics-overview)
|
||||
2. [Text Generation Metrics](#2-text-generation-metrics)
|
||||
3. [RAG-Specific Metrics](#3-rag-specific-metrics)
|
||||
4. [Human Evaluation Frameworks](#4-human-evaluation-frameworks)
|
||||
5. [A/B Testing for Prompts](#5-ab-testing-for-prompts)
|
||||
6. [Benchmark Datasets](#6-benchmark-datasets)
|
||||
7. [Evaluation Pipeline Design](#7-evaluation-pipeline-design)
|
||||
|
||||
### Production-First Design
|
||||
---
|
||||
|
||||
Always design with production in mind:
|
||||
- Scalability: Handle 10x current load
|
||||
- Reliability: 99.9% uptime target
|
||||
- Maintainability: Clear, documented code
|
||||
- Observability: Monitor everything
|
||||
## 1. Evaluation Metrics Overview
|
||||
|
||||
### Performance by Design
|
||||
### Metric Categories
|
||||
|
||||
Optimize from the start:
|
||||
- Efficient algorithms
|
||||
- Resource awareness
|
||||
- Strategic caching
|
||||
- Batch processing
|
||||
| Category | Metrics | When to Use |
|
||||
|----------|---------|-------------|
|
||||
| **Lexical** | BLEU, ROUGE, Exact Match | Reference-based comparison |
|
||||
| **Semantic** | BERTScore, Embedding similarity | Meaning preservation |
|
||||
| **Task-specific** | F1, Accuracy, Precision/Recall | Classification, extraction |
|
||||
| **Quality** | Coherence, Fluency, Relevance | Open-ended generation |
|
||||
| **Safety** | Toxicity, Bias scores | Content moderation |
|
||||
|
||||
### Security & Privacy
|
||||
### Choosing the Right Metric
|
||||
|
||||
Build security in:
|
||||
- Input validation
|
||||
- Data encryption
|
||||
- Access control
|
||||
- Audit logging
|
||||
```
|
||||
Is there a single correct answer?
|
||||
├── Yes → Exact Match or F1
|
||||
└── No
|
||||
└── Is there a reference output?
|
||||
├── Yes → BLEU, ROUGE, or BERTScore
|
||||
└── No
|
||||
└── Can you define quality criteria?
|
||||
├── Yes → Human evaluation + LLM-as-judge
|
||||
└── No → A/B testing with user metrics
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
---
|
||||
|
||||
### Pattern 1: Distributed Processing
|
||||
## 2. Text Generation Metrics
|
||||
|
||||
Enterprise-scale data processing with fault tolerance.
|
||||
### BLEU (Bilingual Evaluation Understudy)
|
||||
|
||||
### Pattern 2: Real-Time Systems
|
||||
**What it measures:** N-gram overlap between generated and reference text.
|
||||
|
||||
Low-latency, high-throughput systems.
|
||||
**Score range:** 0 to 1 (higher is better)
|
||||
|
||||
### Pattern 3: ML at Scale
|
||||
**Calculation:**
|
||||
```
|
||||
BLEU = BP × exp(Σ wn × log(pn))
|
||||
|
||||
Production ML with monitoring and automation.
|
||||
Where:
|
||||
- BP = brevity penalty (penalizes short outputs)
|
||||
- pn = precision of n-grams
|
||||
- wn = weight (typically 0.25 for BLEU-4)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
**Interpretation:**
|
||||
| BLEU Score | Quality |
|
||||
|------------|---------|
|
||||
| > 0.6 | Excellent |
|
||||
| 0.4 - 0.6 | Good |
|
||||
| 0.2 - 0.4 | Acceptable |
|
||||
| < 0.2 | Poor |
|
||||
|
||||
### Code Quality
|
||||
- Comprehensive testing
|
||||
- Clear documentation
|
||||
- Code reviews
|
||||
- Type hints
|
||||
**Example:**
|
||||
```
|
||||
Reference: "The quick brown fox jumps over the lazy dog"
|
||||
Generated: "A fast brown fox leaps over the lazy dog"
|
||||
|
||||
### Performance
|
||||
- Profile before optimizing
|
||||
- Monitor continuously
|
||||
- Cache strategically
|
||||
- Batch operations
|
||||
1-gram precision: 7/9 = 0.78 (matched: brown, fox, over, the, lazy, dog)
|
||||
2-gram precision: 4/8 = 0.50 (matched: brown fox, the lazy, lazy dog)
|
||||
BLEU-4: ~0.35
|
||||
```
|
||||
|
||||
### Reliability
|
||||
- Design for failure
|
||||
- Implement retries
|
||||
- Use circuit breakers
|
||||
- Monitor health
|
||||
**Limitations:**
|
||||
- Doesn't capture meaning (synonyms penalized)
|
||||
- Position-independent
|
||||
- Requires reference text
|
||||
|
||||
## Tools & Technologies
|
||||
---
|
||||
|
||||
Essential tools for this domain:
|
||||
- Development frameworks
|
||||
- Testing libraries
|
||||
- Deployment platforms
|
||||
- Monitoring solutions
|
||||
### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
|
||||
|
||||
## Further Reading
|
||||
**What it measures:** Overlap focused on recall (coverage of reference).
|
||||
|
||||
- Research papers
|
||||
- Industry blogs
|
||||
- Conference talks
|
||||
- Open source projects
|
||||
**Variants:**
|
||||
| Variant | Measures |
|
||||
|---------|----------|
|
||||
| ROUGE-1 | Unigram overlap |
|
||||
| ROUGE-2 | Bigram overlap |
|
||||
| ROUGE-L | Longest common subsequence |
|
||||
| ROUGE-Lsum | LCS with sentence-level computation |
|
||||
|
||||
**Calculation:**
|
||||
```
|
||||
ROUGE-N Recall = (matching n-grams) / (n-grams in reference)
|
||||
ROUGE-N Precision = (matching n-grams) / (n-grams in generated)
|
||||
ROUGE-N F1 = 2 × (Precision × Recall) / (Precision + Recall)
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Reference: "The cat sat on the mat"
|
||||
Generated: "The cat was sitting on the mat"
|
||||
|
||||
ROUGE-1:
|
||||
Recall: 5/6 = 0.83 (matched: the, cat, on, the, mat)
|
||||
Precision: 5/7 = 0.71
|
||||
F1: 0.77
|
||||
|
||||
ROUGE-2:
|
||||
Recall: 2/5 = 0.40 (matched: "the cat", "the mat")
|
||||
Precision: 2/6 = 0.33
|
||||
F1: 0.36
|
||||
```
|
||||
|
||||
**Best for:** Summarization, text compression
|
||||
|
||||
---
|
||||
|
||||
### BERTScore
|
||||
|
||||
**What it measures:** Semantic similarity using contextual embeddings.
|
||||
|
||||
**How it works:**
|
||||
1. Generate BERT embeddings for each token
|
||||
2. Compute cosine similarity between token pairs
|
||||
3. Apply greedy matching to find best alignment
|
||||
4. Aggregate into Precision, Recall, F1
|
||||
|
||||
**Advantages over lexical metrics:**
|
||||
- Captures synonyms and paraphrases
|
||||
- Context-aware matching
|
||||
- Better correlation with human judgment
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Reference: "The movie was excellent"
|
||||
Generated: "The film was outstanding"
|
||||
|
||||
Lexical (BLEU): Low score (only "The" and "was" match)
|
||||
BERTScore: High score (semantic meaning preserved)
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
| BERTScore F1 | Quality |
|
||||
|--------------|---------|
|
||||
| > 0.9 | Excellent |
|
||||
| 0.8 - 0.9 | Good |
|
||||
| 0.7 - 0.8 | Acceptable |
|
||||
| < 0.7 | Review needed |
|
||||
|
||||
---
|
||||
|
||||
## 3. RAG-Specific Metrics
|
||||
|
||||
### Context Relevance
|
||||
|
||||
**What it measures:** How relevant retrieved documents are to the query.
|
||||
|
||||
**Calculation methods:**
|
||||
|
||||
**Method 1: Embedding similarity**
|
||||
```python
|
||||
relevance = cosine_similarity(
|
||||
embed(query),
|
||||
embed(context)
|
||||
)
|
||||
```
|
||||
|
||||
**Method 2: LLM-as-judge**
|
||||
```
|
||||
Prompt: "Rate the relevance of this context to the question.
|
||||
Question: {question}
|
||||
Context: {context}
|
||||
Rate from 1-5 where 5 is highly relevant."
|
||||
```
|
||||
|
||||
**Target:** > 0.8 for top-k contexts
|
||||
|
||||
---
|
||||
|
||||
### Answer Faithfulness
|
||||
|
||||
**What it measures:** Whether the answer is supported by the context (no hallucination).
|
||||
|
||||
**Evaluation prompt:**
|
||||
```
|
||||
Given the context and answer, determine if every claim in the
|
||||
answer is supported by the context.
|
||||
|
||||
Context: {context}
|
||||
Answer: {answer}
|
||||
|
||||
For each claim in the answer:
|
||||
1. Identify the claim
|
||||
2. Find supporting evidence in context (or mark as unsupported)
|
||||
3. Rate: Supported / Partially Supported / Not Supported
|
||||
|
||||
Overall faithfulness score: [0-1]
|
||||
```
|
||||
|
||||
**Scoring:**
|
||||
```
|
||||
Faithfulness = (supported claims) / (total claims)
|
||||
```
|
||||
|
||||
**Target:** > 0.95 for production systems
|
||||
|
||||
---
|
||||
|
||||
### Retrieval Metrics
|
||||
|
||||
| Metric | Formula | What it measures |
|
||||
|--------|---------|------------------|
|
||||
| **Precision@k** | (relevant in top-k) / k | Quality of top results |
|
||||
| **Recall@k** | (relevant in top-k) / (total relevant) | Coverage |
|
||||
| **MRR** | 1 / (rank of first relevant) | Position of first hit |
|
||||
| **NDCG@k** | DCG@k / IDCG@k | Ranking quality |
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Query: "What is photosynthesis?"
|
||||
Retrieved docs (k=5): [R, N, R, N, R] (R=relevant, N=not relevant)
|
||||
Total relevant in corpus: 10
|
||||
|
||||
Precision@5 = 3/5 = 0.6
|
||||
Recall@5 = 3/10 = 0.3
|
||||
MRR = 1/1 = 1.0 (first doc is relevant)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Human Evaluation Frameworks
|
||||
|
||||
### Likert Scale Evaluation
|
||||
|
||||
**Setup:**
|
||||
```
|
||||
Rate the following response on a scale of 1-5:
|
||||
|
||||
Response: {generated_response}
|
||||
|
||||
Criteria:
|
||||
- Relevance (1-5): Does it address the question?
|
||||
- Accuracy (1-5): Is the information correct?
|
||||
- Fluency (1-5): Is it well-written?
|
||||
- Helpfulness (1-5): Would this be useful to the user?
|
||||
```
|
||||
|
||||
**Sample size guidance:**
|
||||
| Confidence Level | Margin of Error | Required Samples |
|
||||
|-----------------|-----------------|------------------|
|
||||
| 95% | ±5% | 385 |
|
||||
| 95% | ±10% | 97 |
|
||||
| 90% | ±10% | 68 |
|
||||
|
||||
---
|
||||
|
||||
### Comparative Evaluation (Side-by-Side)
|
||||
|
||||
**Setup:**
|
||||
```
|
||||
Compare these two responses to the question:
|
||||
|
||||
Question: {question}
|
||||
|
||||
Response A: {response_a}
|
||||
Response B: {response_b}
|
||||
|
||||
Which response is better?
|
||||
[ ] A is much better
|
||||
[ ] A is slightly better
|
||||
[ ] About the same
|
||||
[ ] B is slightly better
|
||||
[ ] B is much better
|
||||
|
||||
Why? _______________
|
||||
```
|
||||
|
||||
**Advantages:**
|
||||
- Easier for humans than absolute scoring
|
||||
- Reduces calibration issues
|
||||
- Clear winner for A/B decisions
|
||||
|
||||
**Analysis:**
|
||||
```
|
||||
Win rate = (A wins + 0.5 × ties) / total
|
||||
Bradley-Terry model for ranking multiple variants
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### LLM-as-Judge
|
||||
|
||||
**Setup:**
|
||||
```
|
||||
You are an expert evaluator. Rate the quality of this response.
|
||||
|
||||
Question: {question}
|
||||
Response: {response}
|
||||
Reference (if available): {reference}
|
||||
|
||||
Evaluate on:
|
||||
1. Correctness (0-10): Is the information accurate?
|
||||
2. Completeness (0-10): Does it fully address the question?
|
||||
3. Clarity (0-10): Is it easy to understand?
|
||||
4. Conciseness (0-10): Is it appropriately brief?
|
||||
|
||||
Provide scores and brief justification for each.
|
||||
Overall score (0-10):
|
||||
```
|
||||
|
||||
**Calibration techniques:**
|
||||
- Include reference responses with known scores
|
||||
- Use chain-of-thought for reasoning
|
||||
- Compare against human baseline periodically
|
||||
|
||||
**Known biases:**
|
||||
| Bias | Mitigation |
|
||||
|------|------------|
|
||||
| Position bias | Randomize order |
|
||||
| Length bias | Normalize or specify length |
|
||||
| Self-preference | Use different model as judge |
|
||||
| Verbosity preference | Penalize unnecessary length |
|
||||
|
||||
---
|
||||
|
||||
## 5. A/B Testing for Prompts
|
||||
|
||||
### Experiment Design
|
||||
|
||||
**Hypothesis template:**
|
||||
```
|
||||
H0: Prompt A and Prompt B have equal performance on [metric]
|
||||
H1: Prompt B improves [metric] by at least [minimum detectable effect]
|
||||
```
|
||||
|
||||
**Sample size calculation:**
|
||||
```
|
||||
n = 2 × ((z_α + z_β)² × σ²) / δ²
|
||||
|
||||
Where:
|
||||
- z_α = 1.96 for 95% confidence
|
||||
- z_β = 0.84 for 80% power
|
||||
- σ = standard deviation of metric
|
||||
- δ = minimum detectable effect
|
||||
```
|
||||
|
||||
**Quick reference:**
|
||||
| MDE | Baseline Rate | Required n/variant |
|
||||
|-----|---------------|-------------------|
|
||||
| 5% relative | 50% | 3,200 |
|
||||
| 10% relative | 50% | 800 |
|
||||
| 20% relative | 50% | 200 |
|
||||
|
||||
---
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
**Primary metrics:**
|
||||
| Metric | Measurement |
|
||||
|--------|-------------|
|
||||
| Task success rate | % of queries with correct/helpful response |
|
||||
| User satisfaction | Thumbs up/down or 1-5 rating |
|
||||
| Engagement | Follow-up questions, session length |
|
||||
|
||||
**Guardrail metrics:**
|
||||
| Metric | Threshold |
|
||||
|--------|-----------|
|
||||
| Error rate | < 1% |
|
||||
| Latency P95 | < 2s |
|
||||
| Toxicity rate | < 0.1% |
|
||||
| Cost per query | Within budget |
|
||||
|
||||
---
|
||||
|
||||
### Analysis Framework
|
||||
|
||||
**Statistical test selection:**
|
||||
```
|
||||
Is the metric binary (success/failure)?
|
||||
├── Yes → Chi-squared test or Z-test for proportions
|
||||
└── No
|
||||
└── Is the data normally distributed?
|
||||
├── Yes → Two-sample t-test
|
||||
└── No → Mann-Whitney U test
|
||||
```
|
||||
|
||||
**Interpreting results:**
|
||||
```
|
||||
p-value < 0.05: Statistically significant
|
||||
Effect size (Cohen's d):
|
||||
- Small: 0.2
|
||||
- Medium: 0.5
|
||||
- Large: 0.8
|
||||
|
||||
Decision: Ship if p < 0.05 AND effect size meets threshold AND guardrails pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Benchmark Datasets
|
||||
|
||||
### General NLP Benchmarks
|
||||
|
||||
| Benchmark | Task | Size | Metric |
|
||||
|-----------|------|------|--------|
|
||||
| **MMLU** | Knowledge QA | 14K | Accuracy |
|
||||
| **HellaSwag** | Commonsense | 10K | Accuracy |
|
||||
| **TruthfulQA** | Factuality | 817 | % Truthful |
|
||||
| **HumanEval** | Code generation | 164 | pass@k |
|
||||
| **GSM8K** | Math reasoning | 8.5K | Accuracy |
|
||||
|
||||
### RAG Benchmarks
|
||||
|
||||
| Benchmark | Focus | Metrics |
|
||||
|-----------|-------|---------|
|
||||
| **Natural Questions** | Wikipedia QA | EM, F1 |
|
||||
| **HotpotQA** | Multi-hop reasoning | EM, F1 |
|
||||
| **MS MARCO** | Web search | MRR, Recall |
|
||||
| **BEIR** | Zero-shot retrieval | NDCG@10 |
|
||||
|
||||
### Creating Custom Benchmarks
|
||||
|
||||
**Template:**
|
||||
```json
|
||||
{
|
||||
"id": "custom-001",
|
||||
"input": "What are the symptoms of diabetes?",
|
||||
"expected_output": "Common symptoms include...",
|
||||
"metadata": {
|
||||
"category": "medical",
|
||||
"difficulty": "easy",
|
||||
"source": "internal docs"
|
||||
},
|
||||
"evaluation": {
|
||||
"type": "semantic_similarity",
|
||||
"threshold": 0.85
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Best practices:**
|
||||
- Minimum 100 examples per category
|
||||
- Include edge cases (10-20%)
|
||||
- Balance difficulty levels
|
||||
- Version control your benchmark
|
||||
- Update quarterly
|
||||
|
||||
---
|
||||
|
||||
## 7. Evaluation Pipeline Design
|
||||
|
||||
### Automated Evaluation Pipeline
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Prompt │────▶│ LLM API │────▶│ Output │
|
||||
│ Version │ │ │ │ Storage │
|
||||
└─────────────┘ └─────────────┘ └──────┬──────┘
|
||||
│
|
||||
┌──────────────────────────┘
|
||||
▼
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Metrics │◀────│ Evaluator │◀────│ Benchmark │
|
||||
│ Dashboard │ │ Service │ │ Dataset │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
### Implementation Checklist
|
||||
|
||||
```
|
||||
□ Define success metrics
|
||||
□ Primary metric (what you're optimizing)
|
||||
□ Guardrail metrics (what must not regress)
|
||||
□ Monitoring metrics (operational health)
|
||||
|
||||
□ Create benchmark dataset
|
||||
□ Representative samples from production
|
||||
□ Edge cases and failure modes
|
||||
□ Golden answers or human labels
|
||||
|
||||
□ Set up evaluation infrastructure
|
||||
□ Automated scoring pipeline
|
||||
□ Version control for prompts
|
||||
□ Results tracking and comparison
|
||||
|
||||
□ Establish baseline
|
||||
□ Run current prompt against benchmark
|
||||
□ Document scores for all metrics
|
||||
□ Set improvement targets
|
||||
|
||||
□ Run experiments
|
||||
□ Test one change at a time
|
||||
□ Use statistical significance testing
|
||||
□ Check all guardrail metrics
|
||||
|
||||
□ Deploy and monitor
|
||||
□ Gradual rollout (canary)
|
||||
□ Real-time metric monitoring
|
||||
□ Rollback plan if regression
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Metric Selection
|
||||
|
||||
| Use Case | Primary Metric | Secondary Metrics |
|
||||
|----------|---------------|-------------------|
|
||||
| Summarization | ROUGE-L | BERTScore, Compression ratio |
|
||||
| Translation | BLEU | chrF, Human pref |
|
||||
| QA (extractive) | Exact Match, F1 | |
|
||||
| QA (generative) | BERTScore | Faithfulness, Relevance |
|
||||
| Code generation | pass@k | Syntax errors |
|
||||
| Classification | Accuracy, F1 | Precision, Recall |
|
||||
| RAG | Faithfulness | Context relevance, MRR |
|
||||
| Open-ended chat | Human eval | Helpfulness, Safety |
|
||||
|
||||
@@ -1,80 +1,572 @@
|
||||
# Prompt Engineering Patterns
|
||||
|
||||
## Overview
|
||||
Specific prompt techniques with example inputs and expected outputs.
|
||||
|
||||
World-class prompt engineering patterns for senior prompt engineer.
|
||||
## Patterns Index
|
||||
|
||||
## Core Principles
|
||||
1. [Zero-Shot Prompting](#1-zero-shot-prompting)
|
||||
2. [Few-Shot Prompting](#2-few-shot-prompting)
|
||||
3. [Chain-of-Thought (CoT)](#3-chain-of-thought-cot)
|
||||
4. [Role Prompting](#4-role-prompting)
|
||||
5. [Structured Output](#5-structured-output)
|
||||
6. [Self-Consistency](#6-self-consistency)
|
||||
7. [ReAct (Reasoning + Acting)](#7-react-reasoning--acting)
|
||||
8. [Tree of Thoughts](#8-tree-of-thoughts)
|
||||
9. [Retrieval-Augmented Generation](#9-retrieval-augmented-generation)
|
||||
10. [Meta-Prompting](#10-meta-prompting)
|
||||
|
||||
### Production-First Design
|
||||
---
|
||||
|
||||
Always design with production in mind:
|
||||
- Scalability: Handle 10x current load
|
||||
- Reliability: 99.9% uptime target
|
||||
- Maintainability: Clear, documented code
|
||||
- Observability: Monitor everything
|
||||
## 1. Zero-Shot Prompting
|
||||
|
||||
### Performance by Design
|
||||
**When to use:** Simple, well-defined tasks where the model has sufficient training knowledge.
|
||||
|
||||
Optimize from the start:
|
||||
- Efficient algorithms
|
||||
- Resource awareness
|
||||
- Strategic caching
|
||||
- Batch processing
|
||||
**Pattern:**
|
||||
```
|
||||
[Task instruction]
|
||||
[Input]
|
||||
```
|
||||
|
||||
### Security & Privacy
|
||||
**Example:**
|
||||
|
||||
Build security in:
|
||||
- Input validation
|
||||
- Data encryption
|
||||
- Access control
|
||||
- Audit logging
|
||||
Input:
|
||||
```
|
||||
Classify the following customer review as positive, negative, or neutral.
|
||||
|
||||
## Advanced Patterns
|
||||
Review: "The shipping was fast but the product quality was disappointing."
|
||||
```
|
||||
|
||||
### Pattern 1: Distributed Processing
|
||||
Expected Output:
|
||||
```
|
||||
negative
|
||||
```
|
||||
|
||||
Enterprise-scale data processing with fault tolerance.
|
||||
**Best practices:**
|
||||
- Be explicit about output format
|
||||
- Use clear, unambiguous verbs (classify, extract, summarize)
|
||||
- Specify constraints (word limits, format requirements)
|
||||
|
||||
### Pattern 2: Real-Time Systems
|
||||
**When to avoid:**
|
||||
- Tasks requiring specific formatting the model hasn't seen
|
||||
- Domain-specific tasks requiring specialized knowledge
|
||||
- Tasks where consistency is critical
|
||||
|
||||
Low-latency, high-throughput systems.
|
||||
---
|
||||
|
||||
### Pattern 3: ML at Scale
|
||||
## 2. Few-Shot Prompting
|
||||
|
||||
Production ML with monitoring and automation.
|
||||
**When to use:** Tasks requiring consistent formatting or domain-specific patterns.
|
||||
|
||||
## Best Practices
|
||||
**Pattern:**
|
||||
```
|
||||
[Task description]
|
||||
|
||||
### Code Quality
|
||||
- Comprehensive testing
|
||||
- Clear documentation
|
||||
- Code reviews
|
||||
- Type hints
|
||||
Example 1:
|
||||
Input: [example input]
|
||||
Output: [example output]
|
||||
|
||||
### Performance
|
||||
- Profile before optimizing
|
||||
- Monitor continuously
|
||||
- Cache strategically
|
||||
- Batch operations
|
||||
Example 2:
|
||||
Input: [example input]
|
||||
Output: [example output]
|
||||
|
||||
### Reliability
|
||||
- Design for failure
|
||||
- Implement retries
|
||||
- Use circuit breakers
|
||||
- Monitor health
|
||||
Now process:
|
||||
Input: [actual input]
|
||||
Output:
|
||||
```
|
||||
|
||||
## Tools & Technologies
|
||||
**Example:**
|
||||
|
||||
Essential tools for this domain:
|
||||
- Development frameworks
|
||||
- Testing libraries
|
||||
- Deployment platforms
|
||||
- Monitoring solutions
|
||||
Input:
|
||||
```
|
||||
Extract the company name and founding year from the text.
|
||||
|
||||
## Further Reading
|
||||
Example 1:
|
||||
Input: "Apple Inc. was founded in 1976 by Steve Jobs."
|
||||
Output: {"company": "Apple Inc.", "year": 1976}
|
||||
|
||||
- Research papers
|
||||
- Industry blogs
|
||||
- Conference talks
|
||||
- Open source projects
|
||||
Example 2:
|
||||
Input: "Microsoft Corporation started in 1975."
|
||||
Output: {"company": "Microsoft Corporation", "year": 1975}
|
||||
|
||||
Example 3:
|
||||
Input: "Founded in 1994, Amazon has grown into a tech giant."
|
||||
Output: {"company": "Amazon", "year": 1994}
|
||||
|
||||
Now process:
|
||||
Input: "Tesla, Inc. was established in 2003 by Martin Eberhard."
|
||||
Output:
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
{"company": "Tesla, Inc.", "year": 2003}
|
||||
```
|
||||
|
||||
**Example selection guidelines:**
|
||||
| Example Type | Purpose | Count |
|
||||
|--------------|---------|-------|
|
||||
| Simple/typical | Establish basic pattern | 1-2 |
|
||||
| Edge case | Handle ambiguity | 1 |
|
||||
| Different format | Show variations | 1 |
|
||||
| **Total** | | **3-5** |
|
||||
|
||||
**Common mistakes:**
|
||||
- Too many examples (wastes tokens, may confuse)
|
||||
- Inconsistent formatting between examples
|
||||
- Examples too similar (model doesn't learn variation)
|
||||
|
||||
---
|
||||
|
||||
## 3. Chain-of-Thought (CoT)
|
||||
|
||||
**When to use:** Math problems, logical reasoning, multi-step analysis.
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
[Problem]
|
||||
|
||||
Think through this step by step:
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
A store sells apples for $2 each. If you buy 5 or more, you get 20% off.
|
||||
How much do 7 apples cost?
|
||||
|
||||
Think through this step by step:
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
Step 1: Calculate the regular price for 7 apples
|
||||
7 apples × $2 = $14
|
||||
|
||||
Step 2: Check if discount applies
|
||||
7 apples ≥ 5 apples, so 20% discount applies
|
||||
|
||||
Step 3: Calculate the discount amount
|
||||
20% of $14 = 0.20 × $14 = $2.80
|
||||
|
||||
Step 4: Calculate final price
|
||||
$14 - $2.80 = $11.20
|
||||
|
||||
The 7 apples cost $11.20
|
||||
```
|
||||
|
||||
**Variations:**
|
||||
| Variation | Trigger Phrase | Use Case |
|
||||
|-----------|----------------|----------|
|
||||
| Basic CoT | "Think step by step" | General reasoning |
|
||||
| Explicit steps | "Break this into numbered steps" | Complex procedures |
|
||||
| Show work | "Show your calculations" | Math problems |
|
||||
| Reasoning chain | "Explain your reasoning" | Justification needed |
|
||||
|
||||
**Zero-shot CoT:**
|
||||
Simply append "Let's think step by step." to any prompt.
|
||||
|
||||
---
|
||||
|
||||
## 4. Role Prompting
|
||||
|
||||
**When to use:** Tasks requiring expertise, specific perspective, or consistent persona.
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
You are a [role/expert description].
|
||||
|
||||
[Context about the task]
|
||||
|
||||
[Specific instruction]
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
You are a senior security engineer with 15 years of experience in web application security.
|
||||
|
||||
Review the following code snippet for security vulnerabilities:
|
||||
|
||||
```python
|
||||
def login(username, password):
|
||||
query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
|
||||
result = db.execute(query)
|
||||
return result
|
||||
```
|
||||
|
||||
Identify vulnerabilities and provide specific fixes.
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
Security Vulnerabilities Identified:
|
||||
|
||||
1. SQL Injection (Critical)
|
||||
- The code directly interpolates user input into SQL query
|
||||
- Attacker can input: username=' OR '1'='1' --
|
||||
- This bypasses authentication entirely
|
||||
|
||||
Fix:
|
||||
```python
|
||||
def login(username, password):
|
||||
query = "SELECT * FROM users WHERE username=? AND password=?"
|
||||
result = db.execute(query, (username, password))
|
||||
return result
|
||||
```
|
||||
|
||||
2. Plain-text Password Storage (Critical)
|
||||
- Passwords appear to be stored and compared in plain text
|
||||
- Fix: Use bcrypt or argon2 for password hashing
|
||||
|
||||
3. Missing Rate Limiting (Medium)
|
||||
- No protection against brute force attacks
|
||||
- Fix: Implement login attempt throttling
|
||||
```
|
||||
|
||||
**Effective role descriptions:**
|
||||
| Weak | Strong |
|
||||
|------|--------|
|
||||
| "You are an expert" | "You are a senior data scientist with 10 years of experience in NLP" |
|
||||
| "You are helpful" | "You are a patient teacher explaining concepts to beginners" |
|
||||
| "You know about X" | "You are a certified AWS solutions architect specializing in serverless" |
|
||||
|
||||
---
|
||||
|
||||
## 5. Structured Output
|
||||
|
||||
**When to use:** When you need parseable responses (JSON, XML, CSV).
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
[Task instruction]
|
||||
|
||||
Respond in JSON format with exactly these fields:
|
||||
- field1 (type): description
|
||||
- field2 (type): description
|
||||
|
||||
[Input]
|
||||
|
||||
Return ONLY valid JSON, no markdown or explanation.
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
Extract meeting details from this email.
|
||||
|
||||
Respond in JSON format with exactly these fields:
|
||||
- date (string, ISO format): Meeting date
|
||||
- time (string, 24h format): Meeting time
|
||||
- attendees (array of strings): List of attendees
|
||||
- topic (string): Meeting topic
|
||||
- location (string or null): Meeting location if mentioned
|
||||
|
||||
Email: "Hi team, let's meet tomorrow at 2pm to discuss Q4 planning.
|
||||
Sarah, Mike, and Lisa should attend. We'll use Conference Room B."
|
||||
|
||||
Today's date is 2024-01-15.
|
||||
|
||||
Return ONLY valid JSON, no markdown or explanation.
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```json
|
||||
{
|
||||
"date": "2024-01-16",
|
||||
"time": "14:00",
|
||||
"attendees": ["Sarah", "Mike", "Lisa"],
|
||||
"topic": "Q4 planning",
|
||||
"location": "Conference Room B"
|
||||
}
|
||||
```
|
||||
|
||||
**Format enforcement techniques:**
|
||||
```
|
||||
# Strong enforcement
|
||||
"Return ONLY valid JSON. Start with { and end with }"
|
||||
|
||||
# Schema validation hint
|
||||
"The output must be valid JSON matching this TypeScript type:
|
||||
type Output = { name: string; age: number; active: boolean }"
|
||||
|
||||
# Negative instruction
|
||||
"Do NOT include markdown code blocks. Do NOT add explanations."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Self-Consistency
|
||||
|
||||
**When to use:** Complex reasoning where multiple valid paths exist.
|
||||
|
||||
**Pattern:**
|
||||
1. Generate multiple reasoning paths (temperature > 0)
|
||||
2. Extract final answers from each path
|
||||
3. Select most common answer (majority vote)
|
||||
|
||||
**Example approach:**
|
||||
|
||||
```
|
||||
# Run this prompt 5 times with temperature=0.7
|
||||
|
||||
Solve this logic puzzle. Think through it step by step.
|
||||
|
||||
Three friends (Alice, Bob, Carol) each have a different pet (cat, dog, bird).
|
||||
- Alice doesn't have the dog
|
||||
- The person with the bird is not Carol
|
||||
- Bob's pet has four legs
|
||||
|
||||
What pet does each person have?
|
||||
```
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# Pseudo-code for self-consistency
|
||||
answers = []
|
||||
for i in range(5):
|
||||
response = llm.generate(prompt, temperature=0.7)
|
||||
answer = extract_final_answer(response)
|
||||
answers.append(answer)
|
||||
|
||||
final_answer = majority_vote(answers)
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
- Math word problems
|
||||
- Logic puzzles
|
||||
- Any task where you can verify consistency
|
||||
|
||||
---
|
||||
|
||||
## 7. ReAct (Reasoning + Acting)
|
||||
|
||||
**When to use:** Tasks requiring external tool use or information gathering.
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
Answer the question using the available tools.
|
||||
|
||||
Tools available:
|
||||
- search(query): Search the web
|
||||
- calculate(expression): Evaluate math
|
||||
- lookup(term): Look up in knowledge base
|
||||
|
||||
Format your response as:
|
||||
Thought: [reasoning about what to do]
|
||||
Action: [tool_name(arguments)]
|
||||
Observation: [result from tool]
|
||||
... (repeat as needed)
|
||||
Thought: I now have enough information
|
||||
Final Answer: [answer]
|
||||
|
||||
Question: [user question]
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
Question: What is the population of France divided by the population of Belgium?
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
Thought: I need to find the population of France and Belgium, then calculate the ratio.
|
||||
|
||||
Action: search("population of France 2024")
|
||||
Observation: France has a population of approximately 67.75 million.
|
||||
|
||||
Thought: Now I need Belgium's population.
|
||||
|
||||
Action: search("population of Belgium 2024")
|
||||
Observation: Belgium has a population of approximately 11.7 million.
|
||||
|
||||
Thought: Now I can calculate the ratio.
|
||||
|
||||
Action: calculate(67.75 / 11.7)
|
||||
Observation: 5.79
|
||||
|
||||
Thought: I now have enough information to answer.
|
||||
|
||||
Final Answer: The population of France is approximately 5.79 times the population of Belgium.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Tree of Thoughts
|
||||
|
||||
**When to use:** Complex problems requiring exploration of multiple solution paths.
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
Problem: [complex problem]
|
||||
|
||||
Generate 3 different approaches to solve this:
|
||||
|
||||
Approach 1: [description]
|
||||
Evaluation: [pros/cons]
|
||||
|
||||
Approach 2: [description]
|
||||
Evaluation: [pros/cons]
|
||||
|
||||
Approach 3: [description]
|
||||
Evaluation: [pros/cons]
|
||||
|
||||
Best approach: [selection with reasoning]
|
||||
Solution using best approach: [detailed solution]
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
Problem: Design a system to handle 10,000 concurrent user sessions with sub-100ms response times.
|
||||
|
||||
Generate 3 different architectural approaches:
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
Approach 1: Horizontal scaling with load balancer
|
||||
- Multiple stateless app servers behind load balancer
|
||||
- Session state in Redis cluster
|
||||
Evaluation: Simple, proven pattern. May have Redis as bottleneck.
|
||||
|
||||
Approach 2: Serverless with edge computing
|
||||
- AWS Lambda@Edge or Cloudflare Workers
|
||||
- Session state in DynamoDB Global Tables
|
||||
Evaluation: Auto-scales, but cold starts may impact latency.
|
||||
|
||||
Approach 3: Actor model with sticky sessions
|
||||
- Akka/Orleans cluster with virtual actors
|
||||
- Session state co-located with actor
|
||||
Evaluation: Lowest latency, but complex to operate.
|
||||
|
||||
Best approach: Approach 1 (Horizontal scaling)
|
||||
Reasoning: Best balance of simplicity, reliability, and performance for the given requirements.
|
||||
|
||||
Solution: [detailed architecture follows]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Retrieval-Augmented Generation
|
||||
|
||||
**When to use:** Tasks requiring factual accuracy or current information.
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
Use the following context to answer the question. If the answer is not in the context, say "I don't have information about that."
|
||||
|
||||
Context:
|
||||
[retrieved documents]
|
||||
|
||||
Question: [user question]
|
||||
|
||||
Answer based only on the context provided:
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
Use the following context to answer the question.
|
||||
|
||||
Context:
|
||||
Document 1: "The Acme Widget 3000 was released in March 2024. It features
|
||||
a 12-hour battery life and weighs 250 grams."
|
||||
|
||||
Document 2: "Pricing for the Acme Widget 3000 starts at $299 for the base
|
||||
model. The Pro version with extended battery is $399."
|
||||
|
||||
Question: How much does the Acme Widget 3000 cost and what is its battery life?
|
||||
|
||||
Answer based only on the context provided:
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
The Acme Widget 3000 starts at $299 for the base model, with a Pro version
|
||||
available at $399. The standard model has a 12-hour battery life.
|
||||
```
|
||||
|
||||
**Chunking strategies:**
|
||||
| Strategy | Chunk Size | Overlap | Best For |
|
||||
|----------|------------|---------|----------|
|
||||
| Fixed | 500 tokens | 50 tokens | General purpose |
|
||||
| Semantic | Paragraph | None | Well-structured docs |
|
||||
| Sentence | 3-5 sentences | 1 sentence | Q&A systems |
|
||||
|
||||
---
|
||||
|
||||
## 10. Meta-Prompting
|
||||
|
||||
**When to use:** Generating or improving prompts programmatically.
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
You are a prompt engineering expert.
|
||||
|
||||
Task: [description of what the prompt should do]
|
||||
Target model: [GPT-4/Claude/etc.]
|
||||
Constraints: [length limits, format requirements]
|
||||
|
||||
Generate an optimized prompt for this task.
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
Input:
|
||||
```
|
||||
You are a prompt engineering expert.
|
||||
|
||||
Task: Create a prompt that extracts action items from meeting notes
|
||||
Target model: GPT-4
|
||||
Constraints:
|
||||
- Output must be valid JSON
|
||||
- Each action item needs: task, owner, due_date
|
||||
- Handle cases where information is missing
|
||||
|
||||
Generate an optimized prompt for this task.
|
||||
```
|
||||
|
||||
Expected Output:
|
||||
```
|
||||
Extract action items from the following meeting notes.
|
||||
|
||||
For each action item, identify:
|
||||
- task: The specific action to be taken
|
||||
- owner: Person responsible (use "Unassigned" if not mentioned)
|
||||
- due_date: Deadline in YYYY-MM-DD format (use null if not specified)
|
||||
|
||||
Meeting Notes:
|
||||
{meeting_notes}
|
||||
|
||||
Respond with a JSON array. Example format:
|
||||
[
|
||||
{"task": "Review proposal", "owner": "Sarah", "due_date": "2024-01-20"},
|
||||
{"task": "Send update", "owner": "Unassigned", "due_date": null}
|
||||
]
|
||||
|
||||
Return ONLY the JSON array, no additional text.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern Selection Guide
|
||||
|
||||
| Task Type | Recommended Pattern |
|
||||
|-----------|---------------------|
|
||||
| Simple classification | Zero-shot |
|
||||
| Consistent formatting needed | Few-shot |
|
||||
| Math/logic problems | Chain-of-Thought |
|
||||
| Need expertise/perspective | Role Prompting |
|
||||
| API integration | Structured Output |
|
||||
| High-stakes decisions | Self-Consistency |
|
||||
| Tool use required | ReAct |
|
||||
| Complex problem solving | Tree of Thoughts |
|
||||
| Factual Q&A | RAG |
|
||||
| Prompt generation | Meta-Prompting |
|
||||
|
||||
@@ -1,100 +1,560 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Agent Orchestrator
|
||||
Production-grade tool for senior prompt engineer
|
||||
Agent Orchestrator - Tool for designing and validating agent workflows
|
||||
|
||||
Features:
|
||||
- Parse agent configurations (YAML/JSON)
|
||||
- Validate tool registrations
|
||||
- Visualize execution flows (ASCII/Mermaid)
|
||||
- Estimate token usage per run
|
||||
- Detect potential issues (loops, missing tools)
|
||||
|
||||
Usage:
|
||||
python agent_orchestrator.py agent.yaml --validate
|
||||
python agent_orchestrator.py agent.yaml --visualize
|
||||
python agent_orchestrator.py agent.yaml --visualize --format mermaid
|
||||
python agent_orchestrator.py agent.yaml --estimate-cost
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import logging
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional, Set, Tuple, Any
|
||||
from dataclasses import dataclass, asdict, field
|
||||
from enum import Enum
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class AgentOrchestrator:
|
||||
"""Production-grade agent orchestrator"""
|
||||
|
||||
def __init__(self, config: Dict):
|
||||
self.config = config
|
||||
self.results = {
|
||||
'status': 'initialized',
|
||||
'start_time': datetime.now().isoformat(),
|
||||
'processed_items': 0
|
||||
}
|
||||
logger.info(f"Initialized {self.__class__.__name__}")
|
||||
|
||||
def validate_config(self) -> bool:
|
||||
"""Validate configuration"""
|
||||
logger.info("Validating configuration...")
|
||||
# Add validation logic
|
||||
logger.info("Configuration validated")
|
||||
return True
|
||||
|
||||
def process(self) -> Dict:
|
||||
"""Main processing logic"""
|
||||
logger.info("Starting processing...")
|
||||
|
||||
class AgentPattern(Enum):
|
||||
"""Supported agent patterns"""
|
||||
REACT = "react"
|
||||
PLAN_EXECUTE = "plan-execute"
|
||||
TOOL_USE = "tool-use"
|
||||
MULTI_AGENT = "multi-agent"
|
||||
CUSTOM = "custom"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ToolDefinition:
|
||||
"""Definition of an agent tool"""
|
||||
name: str
|
||||
description: str
|
||||
parameters: Dict[str, Any] = field(default_factory=dict)
|
||||
required_config: List[str] = field(default_factory=list)
|
||||
estimated_tokens: int = 100
|
||||
|
||||
|
||||
@dataclass
|
||||
class AgentConfig:
|
||||
"""Agent configuration"""
|
||||
name: str
|
||||
pattern: AgentPattern
|
||||
description: str
|
||||
tools: List[ToolDefinition]
|
||||
max_iterations: int = 10
|
||||
system_prompt: str = ""
|
||||
temperature: float = 0.7
|
||||
model: str = "gpt-4"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationResult:
|
||||
"""Result of agent validation"""
|
||||
is_valid: bool
|
||||
errors: List[str]
|
||||
warnings: List[str]
|
||||
tool_status: Dict[str, str]
|
||||
estimated_tokens_per_run: Tuple[int, int] # (min, max)
|
||||
potential_infinite_loop: bool
|
||||
max_depth: int
|
||||
|
||||
|
||||
def parse_yaml_simple(content: str) -> Dict[str, Any]:
|
||||
"""Simple YAML parser for agent configs (no external dependencies)"""
|
||||
result = {}
|
||||
current_key = None
|
||||
current_list = None
|
||||
indent_stack = [(0, result)]
|
||||
|
||||
lines = content.split('\n')
|
||||
|
||||
for line in lines:
|
||||
# Skip empty lines and comments
|
||||
stripped = line.strip()
|
||||
if not stripped or stripped.startswith('#'):
|
||||
continue
|
||||
|
||||
# Calculate indent
|
||||
indent = len(line) - len(line.lstrip())
|
||||
|
||||
# Check for list item
|
||||
if stripped.startswith('- '):
|
||||
item = stripped[2:].strip()
|
||||
if current_list is not None:
|
||||
# Check if it's a key-value pair
|
||||
if ':' in item and not item.startswith('{'):
|
||||
key, _, value = item.partition(':')
|
||||
current_list.append({key.strip(): value.strip().strip('"\'')})
|
||||
else:
|
||||
current_list.append(item.strip('"\''))
|
||||
continue
|
||||
|
||||
# Check for key-value pair
|
||||
if ':' in stripped:
|
||||
key, _, value = stripped.partition(':')
|
||||
key = key.strip()
|
||||
value = value.strip().strip('"\'')
|
||||
|
||||
# Pop indent stack as needed
|
||||
while indent_stack and indent <= indent_stack[-1][0] and len(indent_stack) > 1:
|
||||
indent_stack.pop()
|
||||
|
||||
current_dict = indent_stack[-1][1]
|
||||
|
||||
if value:
|
||||
# Simple key-value
|
||||
current_dict[key] = value
|
||||
current_list = None
|
||||
else:
|
||||
# Start of nested structure or list
|
||||
# Peek ahead to see if it's a list
|
||||
next_line_idx = lines.index(line) + 1
|
||||
if next_line_idx < len(lines):
|
||||
next_stripped = lines[next_line_idx].strip()
|
||||
if next_stripped.startswith('- '):
|
||||
current_dict[key] = []
|
||||
current_list = current_dict[key]
|
||||
else:
|
||||
current_dict[key] = {}
|
||||
indent_stack.append((indent + 2, current_dict[key]))
|
||||
current_list = None
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def load_config(path: Path) -> AgentConfig:
|
||||
"""Load agent configuration from file"""
|
||||
content = path.read_text(encoding='utf-8')
|
||||
|
||||
# Try JSON first
|
||||
if path.suffix == '.json':
|
||||
data = json.loads(content)
|
||||
else:
|
||||
# Try YAML
|
||||
try:
|
||||
self.validate_config()
|
||||
|
||||
# Main processing
|
||||
result = self._execute()
|
||||
|
||||
self.results['status'] = 'completed'
|
||||
self.results['end_time'] = datetime.now().isoformat()
|
||||
|
||||
logger.info("Processing completed successfully")
|
||||
return self.results
|
||||
|
||||
except Exception as e:
|
||||
self.results['status'] = 'failed'
|
||||
self.results['error'] = str(e)
|
||||
logger.error(f"Processing failed: {e}")
|
||||
raise
|
||||
|
||||
def _execute(self) -> Dict:
|
||||
"""Execute main logic"""
|
||||
# Implementation here
|
||||
return {'success': True}
|
||||
data = parse_yaml_simple(content)
|
||||
except Exception:
|
||||
# Fallback to JSON if YAML parsing fails
|
||||
data = json.loads(content)
|
||||
|
||||
# Parse pattern
|
||||
pattern_str = data.get('pattern', 'react').lower()
|
||||
try:
|
||||
pattern = AgentPattern(pattern_str)
|
||||
except ValueError:
|
||||
pattern = AgentPattern.CUSTOM
|
||||
|
||||
# Parse tools
|
||||
tools = []
|
||||
for tool_data in data.get('tools', []):
|
||||
if isinstance(tool_data, dict):
|
||||
tools.append(ToolDefinition(
|
||||
name=tool_data.get('name', 'unknown'),
|
||||
description=tool_data.get('description', ''),
|
||||
parameters=tool_data.get('parameters', {}),
|
||||
required_config=tool_data.get('required_config', []),
|
||||
estimated_tokens=tool_data.get('estimated_tokens', 100)
|
||||
))
|
||||
elif isinstance(tool_data, str):
|
||||
tools.append(ToolDefinition(name=tool_data, description=''))
|
||||
|
||||
return AgentConfig(
|
||||
name=data.get('name', 'agent'),
|
||||
pattern=pattern,
|
||||
description=data.get('description', ''),
|
||||
tools=tools,
|
||||
max_iterations=int(data.get('max_iterations', 10)),
|
||||
system_prompt=data.get('system_prompt', ''),
|
||||
temperature=float(data.get('temperature', 0.7)),
|
||||
model=data.get('model', 'gpt-4')
|
||||
)
|
||||
|
||||
|
||||
def validate_agent(config: AgentConfig) -> ValidationResult:
|
||||
"""Validate agent configuration"""
|
||||
errors = []
|
||||
warnings = []
|
||||
tool_status = {}
|
||||
|
||||
# Validate name
|
||||
if not config.name:
|
||||
errors.append("Agent name is required")
|
||||
|
||||
# Validate tools
|
||||
if not config.tools:
|
||||
warnings.append("No tools defined - agent will have limited capabilities")
|
||||
|
||||
tool_names = set()
|
||||
for tool in config.tools:
|
||||
# Check for duplicates
|
||||
if tool.name in tool_names:
|
||||
errors.append(f"Duplicate tool name: {tool.name}")
|
||||
tool_names.add(tool.name)
|
||||
|
||||
# Check required config
|
||||
if tool.required_config:
|
||||
missing = [c for c in tool.required_config if not c.startswith('$')]
|
||||
if missing:
|
||||
tool_status[tool.name] = f"WARN: Missing config: {missing}"
|
||||
else:
|
||||
tool_status[tool.name] = "OK"
|
||||
else:
|
||||
tool_status[tool.name] = "OK - No config needed"
|
||||
|
||||
# Check description
|
||||
if not tool.description:
|
||||
warnings.append(f"Tool '{tool.name}' has no description")
|
||||
|
||||
# Validate pattern-specific requirements
|
||||
if config.pattern == AgentPattern.MULTI_AGENT:
|
||||
if len(config.tools) < 2:
|
||||
warnings.append("Multi-agent pattern typically requires 2+ specialized tools")
|
||||
|
||||
# Check for potential infinite loops
|
||||
potential_loop = config.max_iterations > 50
|
||||
|
||||
# Estimate tokens
|
||||
base_tokens = len(config.system_prompt.split()) * 1.3 if config.system_prompt else 200
|
||||
tool_tokens = sum(t.estimated_tokens for t in config.tools)
|
||||
|
||||
min_tokens = int(base_tokens + tool_tokens)
|
||||
max_tokens = int((base_tokens + tool_tokens * 2) * config.max_iterations)
|
||||
|
||||
return ValidationResult(
|
||||
is_valid=len(errors) == 0,
|
||||
errors=errors,
|
||||
warnings=warnings,
|
||||
tool_status=tool_status,
|
||||
estimated_tokens_per_run=(min_tokens, max_tokens),
|
||||
potential_infinite_loop=potential_loop,
|
||||
max_depth=config.max_iterations
|
||||
)
|
||||
|
||||
|
||||
def generate_ascii_diagram(config: AgentConfig) -> str:
|
||||
"""Generate ASCII workflow diagram"""
|
||||
lines = []
|
||||
|
||||
# Header
|
||||
width = max(40, len(config.name) + 10)
|
||||
lines.append("┌" + "─" * width + "┐")
|
||||
lines.append("│" + config.name.center(width) + "│")
|
||||
lines.append("│" + f"({config.pattern.value} Pattern)".center(width) + "│")
|
||||
lines.append("└" + "─" * (width // 2 - 1) + "┬" + "─" * (width // 2) + "┘")
|
||||
lines.append(" " * (width // 2) + "│")
|
||||
|
||||
# User Query
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "│ User Query │")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
|
||||
lines.append(" " * (width // 2) + "│")
|
||||
|
||||
if config.pattern == AgentPattern.REACT:
|
||||
# ReAct loop
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Think │◄──────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘ │")
|
||||
lines.append(" " * (width // 2) + "│ │")
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐ │")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Select Tool │ │")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘ │")
|
||||
lines.append(" " * (width // 2) + "│ │")
|
||||
|
||||
# Tools
|
||||
if config.tools:
|
||||
tool_line = " ".join([f"[{t.name}]" for t in config.tools[:4]])
|
||||
if len(config.tools) > 4:
|
||||
tool_line += " ..."
|
||||
lines.append(" " * 4 + tool_line)
|
||||
lines.append(" " * (width // 2) + "│ │")
|
||||
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐ │")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Observe │───────┘")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
|
||||
|
||||
elif config.pattern == AgentPattern.PLAN_EXECUTE:
|
||||
# Plan phase
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Create Plan │")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
|
||||
lines.append(" " * (width // 2) + "│")
|
||||
|
||||
# Execute loop
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Execute Step │◄──────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘ │")
|
||||
lines.append(" " * (width // 2) + "│ │")
|
||||
|
||||
if config.tools:
|
||||
tool_line = " ".join([f"[{t.name}]" for t in config.tools[:4]])
|
||||
lines.append(" " * 4 + tool_line)
|
||||
lines.append(" " * (width // 2) + "│ │")
|
||||
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐ │")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Check Done? │───────┘")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
|
||||
|
||||
else:
|
||||
# Generic tool use
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Process Query │")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
|
||||
lines.append(" " * (width // 2) + "│")
|
||||
|
||||
if config.tools:
|
||||
for tool in config.tools[:6]:
|
||||
lines.append(" " * (width // 2 - 8) + f"├──▶ [{tool.name}]")
|
||||
if len(config.tools) > 6:
|
||||
lines.append(" " * (width // 2 - 8) + "├──▶ [...]")
|
||||
|
||||
# Final answer
|
||||
lines.append(" " * (width // 2) + "│")
|
||||
lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
|
||||
lines.append(" " * (width // 2 - 8) + "│ Final Answer │")
|
||||
lines.append(" " * (width // 2 - 8) + "└───────────────┘")
|
||||
|
||||
return '\n'.join(lines)
|
||||
|
||||
|
||||
def generate_mermaid_diagram(config: AgentConfig) -> str:
|
||||
"""Generate Mermaid flowchart"""
|
||||
lines = ["```mermaid", "flowchart TD"]
|
||||
|
||||
# Start and query
|
||||
lines.append(f" subgraph {config.name}[{config.name}]")
|
||||
lines.append(" direction TB")
|
||||
lines.append(" A[User Query] --> B{Process}")
|
||||
|
||||
if config.pattern == AgentPattern.REACT:
|
||||
lines.append(" B --> C[Think]")
|
||||
lines.append(" C --> D{Select Tool}")
|
||||
|
||||
for i, tool in enumerate(config.tools[:6]):
|
||||
lines.append(f" D -->|{tool.name}| T{i}[{tool.name}]")
|
||||
lines.append(f" T{i} --> E[Observe]")
|
||||
|
||||
lines.append(" E -->|Continue| C")
|
||||
lines.append(" E -->|Done| F[Final Answer]")
|
||||
|
||||
elif config.pattern == AgentPattern.PLAN_EXECUTE:
|
||||
lines.append(" B --> P[Create Plan]")
|
||||
lines.append(" P --> X{Execute Step}")
|
||||
|
||||
for i, tool in enumerate(config.tools[:6]):
|
||||
lines.append(f" X -->|{tool.name}| T{i}[{tool.name}]")
|
||||
lines.append(f" T{i} --> R[Review]")
|
||||
|
||||
lines.append(" R -->|More Steps| X")
|
||||
lines.append(" R -->|Complete| F[Final Answer]")
|
||||
|
||||
else:
|
||||
for i, tool in enumerate(config.tools[:6]):
|
||||
lines.append(f" B -->|use| T{i}[{tool.name}]")
|
||||
lines.append(f" T{i} --> F[Final Answer]")
|
||||
|
||||
lines.append(" end")
|
||||
lines.append("```")
|
||||
|
||||
return '\n'.join(lines)
|
||||
|
||||
|
||||
def estimate_cost(config: AgentConfig, runs: int = 100) -> Dict[str, Any]:
|
||||
"""Estimate token costs for agent runs"""
|
||||
validation = validate_agent(config)
|
||||
min_tokens, max_tokens = validation.estimated_tokens_per_run
|
||||
|
||||
# Cost per 1K tokens
|
||||
costs = {
|
||||
'gpt-4': {'input': 0.03, 'output': 0.06},
|
||||
'gpt-4-turbo': {'input': 0.01, 'output': 0.03},
|
||||
'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
|
||||
'claude-3-opus': {'input': 0.015, 'output': 0.075},
|
||||
'claude-3-sonnet': {'input': 0.003, 'output': 0.015},
|
||||
}
|
||||
|
||||
model_cost = costs.get(config.model, costs['gpt-4'])
|
||||
|
||||
# Assume 60% input, 40% output
|
||||
input_tokens = min_tokens * 0.6
|
||||
output_tokens = min_tokens * 0.4
|
||||
|
||||
cost_per_run_min = (input_tokens / 1000 * model_cost['input'] +
|
||||
output_tokens / 1000 * model_cost['output'])
|
||||
|
||||
input_tokens_max = max_tokens * 0.6
|
||||
output_tokens_max = max_tokens * 0.4
|
||||
cost_per_run_max = (input_tokens_max / 1000 * model_cost['input'] +
|
||||
output_tokens_max / 1000 * model_cost['output'])
|
||||
|
||||
return {
|
||||
'model': config.model,
|
||||
'tokens_per_run': {'min': min_tokens, 'max': max_tokens},
|
||||
'cost_per_run': {'min': round(cost_per_run_min, 4), 'max': round(cost_per_run_max, 4)},
|
||||
'estimated_monthly': {
|
||||
'runs': runs * 30,
|
||||
'cost_min': round(cost_per_run_min * runs * 30, 2),
|
||||
'cost_max': round(cost_per_run_max * runs * 30, 2)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def format_validation_report(config: AgentConfig, result: ValidationResult) -> str:
|
||||
"""Format validation result as human-readable report"""
|
||||
lines = []
|
||||
lines.append("=" * 50)
|
||||
lines.append("AGENT VALIDATION REPORT")
|
||||
lines.append("=" * 50)
|
||||
lines.append("")
|
||||
|
||||
lines.append(f"📋 AGENT INFO")
|
||||
lines.append(f" Name: {config.name}")
|
||||
lines.append(f" Pattern: {config.pattern.value}")
|
||||
lines.append(f" Model: {config.model}")
|
||||
lines.append("")
|
||||
|
||||
lines.append(f"🔧 TOOLS ({len(config.tools)} registered)")
|
||||
for tool in config.tools:
|
||||
status = result.tool_status.get(tool.name, "Unknown")
|
||||
emoji = "✅" if status.startswith("OK") else "⚠️"
|
||||
lines.append(f" {emoji} {tool.name} - {status}")
|
||||
lines.append("")
|
||||
|
||||
lines.append("📊 FLOW ANALYSIS")
|
||||
lines.append(f" Max iterations: {result.max_depth}")
|
||||
lines.append(f" Estimated tokens: {result.estimated_tokens_per_run[0]:,} - {result.estimated_tokens_per_run[1]:,}")
|
||||
lines.append(f" Potential loop: {'⚠️ Yes' if result.potential_infinite_loop else '✅ No'}")
|
||||
lines.append("")
|
||||
|
||||
if result.errors:
|
||||
lines.append(f"❌ ERRORS ({len(result.errors)})")
|
||||
for error in result.errors:
|
||||
lines.append(f" • {error}")
|
||||
lines.append("")
|
||||
|
||||
if result.warnings:
|
||||
lines.append(f"⚠️ WARNINGS ({len(result.warnings)})")
|
||||
for warning in result.warnings:
|
||||
lines.append(f" • {warning}")
|
||||
lines.append("")
|
||||
|
||||
# Overall status
|
||||
if result.is_valid:
|
||||
lines.append("✅ VALIDATION PASSED")
|
||||
else:
|
||||
lines.append("❌ VALIDATION FAILED")
|
||||
|
||||
lines.append("")
|
||||
lines.append("=" * 50)
|
||||
|
||||
return '\n'.join(lines)
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Agent Orchestrator"
|
||||
description="Agent Orchestrator - Design and validate agent workflows",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s agent.yaml --validate
|
||||
%(prog)s agent.yaml --visualize
|
||||
%(prog)s agent.yaml --visualize --format mermaid
|
||||
%(prog)s agent.yaml --estimate-cost --runs 100
|
||||
|
||||
Agent config format (YAML):
|
||||
|
||||
name: research_assistant
|
||||
pattern: react
|
||||
model: gpt-4
|
||||
max_iterations: 10
|
||||
tools:
|
||||
- name: web_search
|
||||
description: Search the web
|
||||
required_config: [api_key]
|
||||
- name: calculator
|
||||
description: Evaluate math expressions
|
||||
"""
|
||||
)
|
||||
parser.add_argument('--input', '-i', required=True, help='Input path')
|
||||
parser.add_argument('--output', '-o', required=True, help='Output path')
|
||||
parser.add_argument('--config', '-c', help='Configuration file')
|
||||
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
|
||||
|
||||
|
||||
parser.add_argument('config', help='Agent configuration file (YAML or JSON)')
|
||||
parser.add_argument('--validate', '-V', action='store_true', help='Validate agent configuration')
|
||||
parser.add_argument('--visualize', '-v', action='store_true', help='Visualize agent workflow')
|
||||
parser.add_argument('--format', '-f', choices=['ascii', 'mermaid'], default='ascii',
|
||||
help='Visualization format (default: ascii)')
|
||||
parser.add_argument('--estimate-cost', '-e', action='store_true', help='Estimate token costs')
|
||||
parser.add_argument('--runs', '-r', type=int, default=100, help='Daily runs for cost estimation')
|
||||
parser.add_argument('--output', '-o', help='Output file path')
|
||||
parser.add_argument('--json', '-j', action='store_true', help='Output as JSON')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.verbose:
|
||||
logging.getLogger().setLevel(logging.DEBUG)
|
||||
|
||||
try:
|
||||
config = {
|
||||
'input': args.input,
|
||||
'output': args.output
|
||||
}
|
||||
|
||||
processor = AgentOrchestrator(config)
|
||||
results = processor.process()
|
||||
|
||||
print(json.dumps(results, indent=2))
|
||||
sys.exit(0)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
|
||||
# Load config
|
||||
config_path = Path(args.config)
|
||||
if not config_path.exists():
|
||||
print(f"Error: Config file not found: {args.config}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
config = load_config(config_path)
|
||||
except Exception as e:
|
||||
print(f"Error parsing config: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Default to validate if no action specified
|
||||
if not any([args.validate, args.visualize, args.estimate_cost]):
|
||||
args.validate = True
|
||||
|
||||
output_parts = []
|
||||
|
||||
# Validate
|
||||
if args.validate:
|
||||
result = validate_agent(config)
|
||||
if args.json:
|
||||
output_parts.append(json.dumps(asdict(result), indent=2))
|
||||
else:
|
||||
output_parts.append(format_validation_report(config, result))
|
||||
|
||||
# Visualize
|
||||
if args.visualize:
|
||||
if args.format == 'mermaid':
|
||||
diagram = generate_mermaid_diagram(config)
|
||||
else:
|
||||
diagram = generate_ascii_diagram(config)
|
||||
output_parts.append(diagram)
|
||||
|
||||
# Cost estimation
|
||||
if args.estimate_cost:
|
||||
costs = estimate_cost(config, args.runs)
|
||||
if args.json:
|
||||
output_parts.append(json.dumps(costs, indent=2))
|
||||
else:
|
||||
output_parts.append("")
|
||||
output_parts.append("💰 COST ESTIMATION")
|
||||
output_parts.append(f" Model: {costs['model']}")
|
||||
output_parts.append(f" Tokens per run: {costs['tokens_per_run']['min']:,} - {costs['tokens_per_run']['max']:,}")
|
||||
output_parts.append(f" Cost per run: ${costs['cost_per_run']['min']:.4f} - ${costs['cost_per_run']['max']:.4f}")
|
||||
output_parts.append(f" Monthly ({costs['estimated_monthly']['runs']:,} runs):")
|
||||
output_parts.append(f" Min: ${costs['estimated_monthly']['cost_min']:.2f}")
|
||||
output_parts.append(f" Max: ${costs['estimated_monthly']['cost_max']:.2f}")
|
||||
|
||||
# Output
|
||||
output = '\n'.join(output_parts)
|
||||
print(output)
|
||||
|
||||
if args.output:
|
||||
Path(args.output).write_text(output)
|
||||
print(f"\nOutput saved to {args.output}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
@@ -1,100 +1,519 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Prompt Optimizer
|
||||
Production-grade tool for senior prompt engineer
|
||||
Prompt Optimizer - Static analysis tool for prompt engineering
|
||||
|
||||
Features:
|
||||
- Token estimation (GPT-4/Claude approximation)
|
||||
- Prompt structure analysis
|
||||
- Clarity scoring
|
||||
- Few-shot example extraction and management
|
||||
- Optimization suggestions
|
||||
|
||||
Usage:
|
||||
python prompt_optimizer.py prompt.txt --analyze
|
||||
python prompt_optimizer.py prompt.txt --tokens --model gpt-4
|
||||
python prompt_optimizer.py prompt.txt --optimize --output optimized.txt
|
||||
python prompt_optimizer.py prompt.txt --extract-examples --output examples.json
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import logging
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
from dataclasses import dataclass, asdict
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class PromptOptimizer:
|
||||
"""Production-grade prompt optimizer"""
|
||||
|
||||
def __init__(self, config: Dict):
|
||||
self.config = config
|
||||
self.results = {
|
||||
'status': 'initialized',
|
||||
'start_time': datetime.now().isoformat(),
|
||||
'processed_items': 0
|
||||
}
|
||||
logger.info(f"Initialized {self.__class__.__name__}")
|
||||
|
||||
def validate_config(self) -> bool:
|
||||
"""Validate configuration"""
|
||||
logger.info("Validating configuration...")
|
||||
# Add validation logic
|
||||
logger.info("Configuration validated")
|
||||
return True
|
||||
|
||||
def process(self) -> Dict:
|
||||
"""Main processing logic"""
|
||||
logger.info("Starting processing...")
|
||||
|
||||
try:
|
||||
self.validate_config()
|
||||
|
||||
# Main processing
|
||||
result = self._execute()
|
||||
|
||||
self.results['status'] = 'completed'
|
||||
self.results['end_time'] = datetime.now().isoformat()
|
||||
|
||||
logger.info("Processing completed successfully")
|
||||
return self.results
|
||||
|
||||
except Exception as e:
|
||||
self.results['status'] = 'failed'
|
||||
self.results['error'] = str(e)
|
||||
logger.error(f"Processing failed: {e}")
|
||||
raise
|
||||
|
||||
def _execute(self) -> Dict:
|
||||
"""Execute main logic"""
|
||||
# Implementation here
|
||||
return {'success': True}
|
||||
# Token estimation ratios (chars per token approximation)
|
||||
TOKEN_RATIOS = {
|
||||
'gpt-4': 4.0,
|
||||
'gpt-3.5': 4.0,
|
||||
'claude': 3.5,
|
||||
'default': 4.0
|
||||
}
|
||||
|
||||
# Cost per 1K tokens (input)
|
||||
COST_PER_1K = {
|
||||
'gpt-4': 0.03,
|
||||
'gpt-4-turbo': 0.01,
|
||||
'gpt-3.5-turbo': 0.0005,
|
||||
'claude-3-opus': 0.015,
|
||||
'claude-3-sonnet': 0.003,
|
||||
'claude-3-haiku': 0.00025,
|
||||
'default': 0.01
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class PromptAnalysis:
|
||||
"""Results of prompt analysis"""
|
||||
token_count: int
|
||||
estimated_cost: float
|
||||
model: str
|
||||
clarity_score: int
|
||||
structure_score: int
|
||||
issues: List[Dict[str, str]]
|
||||
suggestions: List[str]
|
||||
sections: List[Dict[str, any]]
|
||||
has_examples: bool
|
||||
example_count: int
|
||||
has_output_format: bool
|
||||
word_count: int
|
||||
line_count: int
|
||||
|
||||
|
||||
@dataclass
|
||||
class FewShotExample:
|
||||
"""A single few-shot example"""
|
||||
input_text: str
|
||||
output_text: str
|
||||
index: int
|
||||
|
||||
|
||||
def estimate_tokens(text: str, model: str = 'default') -> int:
|
||||
"""Estimate token count based on character ratio"""
|
||||
ratio = TOKEN_RATIOS.get(model, TOKEN_RATIOS['default'])
|
||||
return int(len(text) / ratio)
|
||||
|
||||
|
||||
def estimate_cost(token_count: int, model: str = 'default') -> float:
|
||||
"""Estimate cost based on token count"""
|
||||
cost_per_1k = COST_PER_1K.get(model, COST_PER_1K['default'])
|
||||
return round((token_count / 1000) * cost_per_1k, 6)
|
||||
|
||||
|
||||
def find_ambiguous_instructions(text: str) -> List[Dict[str, str]]:
|
||||
"""Find vague or ambiguous instructions"""
|
||||
issues = []
|
||||
|
||||
# Vague verbs that need specificity
|
||||
vague_patterns = [
|
||||
(r'\b(analyze|process|handle|deal with)\b', 'Vague verb - specify the exact action'),
|
||||
(r'\b(good|nice|appropriate|suitable)\b', 'Subjective term - define specific criteria'),
|
||||
(r'\b(etc\.|and so on|and more)\b', 'Open-ended list - enumerate all items explicitly'),
|
||||
(r'\b(if needed|as necessary|when appropriate)\b', 'Conditional without criteria - specify when'),
|
||||
(r'\b(some|several|many|few|various)\b', 'Vague quantity - use specific numbers'),
|
||||
]
|
||||
|
||||
lines = text.split('\n')
|
||||
for i, line in enumerate(lines, 1):
|
||||
for pattern, message in vague_patterns:
|
||||
matches = re.finditer(pattern, line, re.IGNORECASE)
|
||||
for match in matches:
|
||||
issues.append({
|
||||
'type': 'ambiguity',
|
||||
'line': i,
|
||||
'text': match.group(),
|
||||
'message': message,
|
||||
'context': line.strip()[:80]
|
||||
})
|
||||
|
||||
return issues
|
||||
|
||||
|
||||
def find_redundant_content(text: str) -> List[Dict[str, str]]:
|
||||
"""Find potentially redundant content"""
|
||||
issues = []
|
||||
lines = text.split('\n')
|
||||
|
||||
# Check for repeated phrases (3+ words)
|
||||
seen_phrases = {}
|
||||
for i, line in enumerate(lines, 1):
|
||||
words = line.split()
|
||||
for j in range(len(words) - 2):
|
||||
phrase = ' '.join(words[j:j+3]).lower()
|
||||
phrase = re.sub(r'[^\w\s]', '', phrase)
|
||||
if phrase and len(phrase) > 10:
|
||||
if phrase in seen_phrases:
|
||||
issues.append({
|
||||
'type': 'redundancy',
|
||||
'line': i,
|
||||
'text': phrase,
|
||||
'message': f'Phrase repeated from line {seen_phrases[phrase]}',
|
||||
'context': line.strip()[:80]
|
||||
})
|
||||
else:
|
||||
seen_phrases[phrase] = i
|
||||
|
||||
return issues
|
||||
|
||||
|
||||
def check_output_format(text: str) -> Tuple[bool, List[str]]:
|
||||
"""Check if prompt specifies output format"""
|
||||
suggestions = []
|
||||
|
||||
format_indicators = [
|
||||
r'respond\s+(in|with)\s+(json|xml|csv|markdown)',
|
||||
r'output\s+format',
|
||||
r'return\s+(only|just)',
|
||||
r'format:\s*\n',
|
||||
r'\{["\']?\w+["\']?\s*:', # JSON-like structure
|
||||
r'```\w*\n', # Code block
|
||||
]
|
||||
|
||||
has_format = any(re.search(p, text, re.IGNORECASE) for p in format_indicators)
|
||||
|
||||
if not has_format:
|
||||
suggestions.append('Add explicit output format specification (e.g., "Respond in JSON with keys: ...")')
|
||||
|
||||
return has_format, suggestions
|
||||
|
||||
|
||||
def extract_sections(text: str) -> List[Dict[str, any]]:
|
||||
"""Extract logical sections from prompt"""
|
||||
sections = []
|
||||
|
||||
# Common section patterns
|
||||
section_patterns = [
|
||||
r'^#+\s+(.+)$', # Markdown headers
|
||||
r'^([A-Z][A-Za-z\s]+):\s*$', # Title Case Label:
|
||||
r'^(Instructions|Context|Examples?|Input|Output|Task|Role|Format)[:.]',
|
||||
]
|
||||
|
||||
lines = text.split('\n')
|
||||
current_section = {'name': 'Introduction', 'start': 1, 'content': []}
|
||||
|
||||
for i, line in enumerate(lines, 1):
|
||||
is_header = False
|
||||
for pattern in section_patterns:
|
||||
match = re.match(pattern, line.strip(), re.IGNORECASE)
|
||||
if match:
|
||||
if current_section['content']:
|
||||
current_section['end'] = i - 1
|
||||
current_section['line_count'] = len(current_section['content'])
|
||||
sections.append(current_section)
|
||||
current_section = {
|
||||
'name': match.group(1).strip() if match.groups() else line.strip(),
|
||||
'start': i,
|
||||
'content': []
|
||||
}
|
||||
is_header = True
|
||||
break
|
||||
|
||||
if not is_header:
|
||||
current_section['content'].append(line)
|
||||
|
||||
# Add last section
|
||||
if current_section['content']:
|
||||
current_section['end'] = len(lines)
|
||||
current_section['line_count'] = len(current_section['content'])
|
||||
sections.append(current_section)
|
||||
|
||||
return sections
|
||||
|
||||
|
||||
def extract_few_shot_examples(text: str) -> List[FewShotExample]:
|
||||
"""Extract few-shot examples from prompt"""
|
||||
examples = []
|
||||
|
||||
# Pattern 1: "Example N:" or "Example:" blocks
|
||||
example_pattern = r'Example\s*\d*:\s*\n(Input:\s*(.+?)\n(?:Output:\s*(.+?)(?=\n\nExample|\n\n[A-Z]|\Z)))'
|
||||
|
||||
matches = re.finditer(example_pattern, text, re.DOTALL | re.IGNORECASE)
|
||||
for i, match in enumerate(matches, 1):
|
||||
examples.append(FewShotExample(
|
||||
input_text=match.group(2).strip() if match.group(2) else '',
|
||||
output_text=match.group(3).strip() if match.group(3) else '',
|
||||
index=i
|
||||
))
|
||||
|
||||
# Pattern 2: Input/Output pairs without "Example" label
|
||||
if not examples:
|
||||
io_pattern = r'Input:\s*["\']?(.+?)["\']?\s*\nOutput:\s*(.+?)(?=\nInput:|\Z)'
|
||||
matches = re.finditer(io_pattern, text, re.DOTALL)
|
||||
for i, match in enumerate(matches, 1):
|
||||
examples.append(FewShotExample(
|
||||
input_text=match.group(1).strip(),
|
||||
output_text=match.group(2).strip(),
|
||||
index=i
|
||||
))
|
||||
|
||||
return examples
|
||||
|
||||
|
||||
def calculate_clarity_score(text: str, issues: List[Dict]) -> int:
|
||||
"""Calculate clarity score (0-100)"""
|
||||
score = 100
|
||||
|
||||
# Deduct for issues
|
||||
score -= len([i for i in issues if i['type'] == 'ambiguity']) * 5
|
||||
score -= len([i for i in issues if i['type'] == 'redundancy']) * 3
|
||||
|
||||
# Check for structure
|
||||
if not re.search(r'^#+\s|^[A-Z][a-z]+:', text, re.MULTILINE):
|
||||
score -= 10 # No clear sections
|
||||
|
||||
# Check for instruction clarity
|
||||
if not re.search(r'(you (should|must|will)|please|your task)', text, re.IGNORECASE):
|
||||
score -= 5 # No clear directives
|
||||
|
||||
return max(0, min(100, score))
|
||||
|
||||
|
||||
def calculate_structure_score(sections: List[Dict], has_format: bool, has_examples: bool) -> int:
|
||||
"""Calculate structure score (0-100)"""
|
||||
score = 50 # Base score
|
||||
|
||||
# Bonus for clear sections
|
||||
if len(sections) >= 2:
|
||||
score += 15
|
||||
if len(sections) >= 4:
|
||||
score += 10
|
||||
|
||||
# Bonus for output format
|
||||
if has_format:
|
||||
score += 15
|
||||
|
||||
# Bonus for examples
|
||||
if has_examples:
|
||||
score += 10
|
||||
|
||||
return min(100, score)
|
||||
|
||||
|
||||
def generate_suggestions(analysis: PromptAnalysis) -> List[str]:
|
||||
"""Generate optimization suggestions"""
|
||||
suggestions = []
|
||||
|
||||
if not analysis.has_output_format:
|
||||
suggestions.append('Add explicit output format: "Respond in JSON with keys: ..."')
|
||||
|
||||
if analysis.example_count == 0:
|
||||
suggestions.append('Consider adding 2-3 few-shot examples for consistent outputs')
|
||||
elif analysis.example_count == 1:
|
||||
suggestions.append('Add 1-2 more examples to improve consistency')
|
||||
elif analysis.example_count > 5:
|
||||
suggestions.append(f'Consider reducing examples from {analysis.example_count} to 3-5 to save tokens')
|
||||
|
||||
if analysis.clarity_score < 70:
|
||||
suggestions.append('Improve clarity: replace vague terms with specific instructions')
|
||||
|
||||
if analysis.token_count > 2000:
|
||||
suggestions.append(f'Prompt is {analysis.token_count} tokens - consider condensing for cost efficiency')
|
||||
|
||||
# Check for role prompting
|
||||
if not re.search(r'you are|act as|as a\s+\w+', analysis.sections[0].get('content', [''])[0] if analysis.sections else '', re.IGNORECASE):
|
||||
suggestions.append('Consider adding role context: "You are an expert..."')
|
||||
|
||||
return suggestions
|
||||
|
||||
|
||||
def analyze_prompt(text: str, model: str = 'gpt-4') -> PromptAnalysis:
|
||||
"""Perform comprehensive prompt analysis"""
|
||||
|
||||
# Basic metrics
|
||||
token_count = estimate_tokens(text, model)
|
||||
cost = estimate_cost(token_count, model)
|
||||
word_count = len(text.split())
|
||||
line_count = len(text.split('\n'))
|
||||
|
||||
# Find issues
|
||||
ambiguity_issues = find_ambiguous_instructions(text)
|
||||
redundancy_issues = find_redundant_content(text)
|
||||
all_issues = ambiguity_issues + redundancy_issues
|
||||
|
||||
# Extract structure
|
||||
sections = extract_sections(text)
|
||||
examples = extract_few_shot_examples(text)
|
||||
has_format, format_suggestions = check_output_format(text)
|
||||
|
||||
# Calculate scores
|
||||
clarity_score = calculate_clarity_score(text, all_issues)
|
||||
structure_score = calculate_structure_score(sections, has_format, len(examples) > 0)
|
||||
|
||||
analysis = PromptAnalysis(
|
||||
token_count=token_count,
|
||||
estimated_cost=cost,
|
||||
model=model,
|
||||
clarity_score=clarity_score,
|
||||
structure_score=structure_score,
|
||||
issues=all_issues,
|
||||
suggestions=[],
|
||||
sections=[{'name': s['name'], 'lines': f"{s['start']}-{s.get('end', s['start'])}"} for s in sections],
|
||||
has_examples=len(examples) > 0,
|
||||
example_count=len(examples),
|
||||
has_output_format=has_format,
|
||||
word_count=word_count,
|
||||
line_count=line_count
|
||||
)
|
||||
|
||||
analysis.suggestions = generate_suggestions(analysis) + format_suggestions
|
||||
|
||||
return analysis
|
||||
|
||||
|
||||
def optimize_prompt(text: str) -> str:
|
||||
"""Generate optimized version of prompt"""
|
||||
optimized = text
|
||||
|
||||
# Remove redundant whitespace
|
||||
optimized = re.sub(r'\n{3,}', '\n\n', optimized)
|
||||
optimized = re.sub(r' {2,}', ' ', optimized)
|
||||
|
||||
# Trim lines
|
||||
lines = [line.rstrip() for line in optimized.split('\n')]
|
||||
optimized = '\n'.join(lines)
|
||||
|
||||
return optimized.strip()
|
||||
|
||||
|
||||
def format_report(analysis: PromptAnalysis) -> str:
|
||||
"""Format analysis as human-readable report"""
|
||||
report = []
|
||||
report.append("=" * 50)
|
||||
report.append("PROMPT ANALYSIS REPORT")
|
||||
report.append("=" * 50)
|
||||
report.append("")
|
||||
|
||||
report.append("📊 METRICS")
|
||||
report.append(f" Token count: {analysis.token_count:,}")
|
||||
report.append(f" Estimated cost: ${analysis.estimated_cost:.4f} ({analysis.model})")
|
||||
report.append(f" Word count: {analysis.word_count:,}")
|
||||
report.append(f" Line count: {analysis.line_count}")
|
||||
report.append("")
|
||||
|
||||
report.append("📈 SCORES")
|
||||
report.append(f" Clarity: {analysis.clarity_score}/100 {'✅' if analysis.clarity_score >= 70 else '⚠️'}")
|
||||
report.append(f" Structure: {analysis.structure_score}/100 {'✅' if analysis.structure_score >= 70 else '⚠️'}")
|
||||
report.append("")
|
||||
|
||||
report.append("📋 STRUCTURE")
|
||||
report.append(f" Sections: {len(analysis.sections)}")
|
||||
report.append(f" Examples: {analysis.example_count} {'✅' if analysis.has_examples else '❌'}")
|
||||
report.append(f" Output format: {'✅ Specified' if analysis.has_output_format else '❌ Missing'}")
|
||||
report.append("")
|
||||
|
||||
if analysis.sections:
|
||||
report.append(" Detected sections:")
|
||||
for section in analysis.sections:
|
||||
report.append(f" - {section['name']} (lines {section['lines']})")
|
||||
report.append("")
|
||||
|
||||
if analysis.issues:
|
||||
report.append(f"⚠️ ISSUES FOUND ({len(analysis.issues)})")
|
||||
for issue in analysis.issues[:10]: # Limit to first 10
|
||||
report.append(f" Line {issue['line']}: {issue['message']}")
|
||||
report.append(f" Found: \"{issue['text']}\"")
|
||||
if len(analysis.issues) > 10:
|
||||
report.append(f" ... and {len(analysis.issues) - 10} more issues")
|
||||
report.append("")
|
||||
|
||||
if analysis.suggestions:
|
||||
report.append("💡 SUGGESTIONS")
|
||||
for i, suggestion in enumerate(analysis.suggestions, 1):
|
||||
report.append(f" {i}. {suggestion}")
|
||||
report.append("")
|
||||
|
||||
report.append("=" * 50)
|
||||
|
||||
return '\n'.join(report)
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Prompt Optimizer"
|
||||
description="Prompt Optimizer - Analyze and optimize prompts",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s prompt.txt --analyze
|
||||
%(prog)s prompt.txt --tokens --model claude-3-sonnet
|
||||
%(prog)s prompt.txt --optimize --output optimized.txt
|
||||
%(prog)s prompt.txt --extract-examples --output examples.json
|
||||
"""
|
||||
)
|
||||
parser.add_argument('--input', '-i', required=True, help='Input path')
|
||||
parser.add_argument('--output', '-o', required=True, help='Output path')
|
||||
parser.add_argument('--config', '-c', help='Configuration file')
|
||||
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
|
||||
|
||||
|
||||
parser.add_argument('prompt', help='Prompt file to analyze')
|
||||
parser.add_argument('--analyze', '-a', action='store_true', help='Run full analysis')
|
||||
parser.add_argument('--tokens', '-t', action='store_true', help='Count tokens only')
|
||||
parser.add_argument('--optimize', '-O', action='store_true', help='Generate optimized version')
|
||||
parser.add_argument('--extract-examples', '-e', action='store_true', help='Extract few-shot examples')
|
||||
parser.add_argument('--model', '-m', default='gpt-4',
|
||||
choices=['gpt-4', 'gpt-4-turbo', 'gpt-3.5-turbo', 'claude-3-opus', 'claude-3-sonnet', 'claude-3-haiku'],
|
||||
help='Model for token/cost estimation')
|
||||
parser.add_argument('--output', '-o', help='Output file path')
|
||||
parser.add_argument('--json', '-j', action='store_true', help='Output as JSON')
|
||||
parser.add_argument('--compare', '-c', help='Compare with baseline analysis JSON')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.verbose:
|
||||
logging.getLogger().setLevel(logging.DEBUG)
|
||||
|
||||
try:
|
||||
config = {
|
||||
'input': args.input,
|
||||
'output': args.output
|
||||
}
|
||||
|
||||
processor = PromptOptimizer(config)
|
||||
results = processor.process()
|
||||
|
||||
print(json.dumps(results, indent=2))
|
||||
sys.exit(0)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
|
||||
# Read prompt file
|
||||
prompt_path = Path(args.prompt)
|
||||
if not prompt_path.exists():
|
||||
print(f"Error: File not found: {args.prompt}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
text = prompt_path.read_text(encoding='utf-8')
|
||||
|
||||
# Tokens only
|
||||
if args.tokens:
|
||||
token_count = estimate_tokens(text, args.model)
|
||||
cost = estimate_cost(token_count, args.model)
|
||||
if args.json:
|
||||
print(json.dumps({
|
||||
'tokens': token_count,
|
||||
'cost': cost,
|
||||
'model': args.model
|
||||
}, indent=2))
|
||||
else:
|
||||
print(f"Tokens: {token_count:,}")
|
||||
print(f"Estimated cost: ${cost:.4f} ({args.model})")
|
||||
sys.exit(0)
|
||||
|
||||
# Extract examples
|
||||
if args.extract_examples:
|
||||
examples = extract_few_shot_examples(text)
|
||||
output = [asdict(ex) for ex in examples]
|
||||
|
||||
if args.output:
|
||||
Path(args.output).write_text(json.dumps(output, indent=2))
|
||||
print(f"Extracted {len(examples)} examples to {args.output}")
|
||||
else:
|
||||
print(json.dumps(output, indent=2))
|
||||
sys.exit(0)
|
||||
|
||||
# Optimize
|
||||
if args.optimize:
|
||||
optimized = optimize_prompt(text)
|
||||
|
||||
if args.output:
|
||||
Path(args.output).write_text(optimized)
|
||||
print(f"Optimized prompt written to {args.output}")
|
||||
|
||||
# Show comparison
|
||||
orig_tokens = estimate_tokens(text, args.model)
|
||||
new_tokens = estimate_tokens(optimized, args.model)
|
||||
saved = orig_tokens - new_tokens
|
||||
print(f"Tokens: {orig_tokens:,} -> {new_tokens:,} (saved {saved:,})")
|
||||
else:
|
||||
print(optimized)
|
||||
sys.exit(0)
|
||||
|
||||
# Default: full analysis
|
||||
analysis = analyze_prompt(text, args.model)
|
||||
|
||||
# Compare with baseline
|
||||
if args.compare:
|
||||
baseline_path = Path(args.compare)
|
||||
if baseline_path.exists():
|
||||
baseline = json.loads(baseline_path.read_text())
|
||||
print("\n📊 COMPARISON WITH BASELINE")
|
||||
print(f" Tokens: {baseline.get('token_count', 0):,} -> {analysis.token_count:,}")
|
||||
print(f" Clarity: {baseline.get('clarity_score', 0)} -> {analysis.clarity_score}")
|
||||
print(f" Issues: {len(baseline.get('issues', []))} -> {len(analysis.issues)}")
|
||||
print()
|
||||
|
||||
if args.json:
|
||||
print(json.dumps(asdict(analysis), indent=2))
|
||||
else:
|
||||
print(format_report(analysis))
|
||||
|
||||
# Write to output file
|
||||
if args.output:
|
||||
output_data = asdict(analysis)
|
||||
Path(args.output).write_text(json.dumps(output_data, indent=2))
|
||||
print(f"\nAnalysis saved to {args.output}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
@@ -1,100 +1,574 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Rag Evaluator
|
||||
Production-grade tool for senior prompt engineer
|
||||
RAG Evaluator - Evaluation tool for Retrieval-Augmented Generation systems
|
||||
|
||||
Features:
|
||||
- Context relevance scoring (lexical overlap)
|
||||
- Answer faithfulness checking
|
||||
- Retrieval metrics (Precision@K, Recall@K, MRR)
|
||||
- Coverage analysis
|
||||
- Quality report generation
|
||||
|
||||
Usage:
|
||||
python rag_evaluator.py --contexts contexts.json --questions questions.json
|
||||
python rag_evaluator.py --contexts ctx.json --questions q.json --metrics relevance,faithfulness
|
||||
python rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import logging
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional, Set, Tuple
|
||||
from dataclasses import dataclass, asdict, field
|
||||
from collections import Counter
|
||||
import math
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class RagEvaluator:
|
||||
"""Production-grade rag evaluator"""
|
||||
|
||||
def __init__(self, config: Dict):
|
||||
self.config = config
|
||||
self.results = {
|
||||
'status': 'initialized',
|
||||
'start_time': datetime.now().isoformat(),
|
||||
'processed_items': 0
|
||||
@dataclass
|
||||
class RetrievalMetrics:
|
||||
"""Retrieval quality metrics"""
|
||||
precision_at_k: float
|
||||
recall_at_k: float
|
||||
mrr: float # Mean Reciprocal Rank
|
||||
ndcg_at_k: float
|
||||
k: int
|
||||
|
||||
|
||||
@dataclass
|
||||
class ContextEvaluation:
|
||||
"""Evaluation of a single context"""
|
||||
context_id: str
|
||||
relevance_score: float
|
||||
token_overlap: float
|
||||
key_terms_covered: List[str]
|
||||
missing_terms: List[str]
|
||||
|
||||
|
||||
@dataclass
|
||||
class AnswerEvaluation:
|
||||
"""Evaluation of an answer against context"""
|
||||
question_id: str
|
||||
faithfulness_score: float
|
||||
groundedness_score: float
|
||||
claims: List[Dict[str, any]]
|
||||
unsupported_claims: List[str]
|
||||
context_used: List[str]
|
||||
|
||||
|
||||
@dataclass
|
||||
class RAGEvaluationReport:
|
||||
"""Complete RAG evaluation report"""
|
||||
total_questions: int
|
||||
avg_context_relevance: float
|
||||
avg_faithfulness: float
|
||||
avg_groundedness: float
|
||||
retrieval_metrics: Dict[str, float]
|
||||
coverage: float
|
||||
issues: List[Dict[str, str]]
|
||||
recommendations: List[str]
|
||||
question_details: List[Dict[str, any]] = field(default_factory=list)
|
||||
|
||||
|
||||
def tokenize(text: str) -> List[str]:
|
||||
"""Simple tokenization for text comparison"""
|
||||
# Lowercase and split on non-alphanumeric
|
||||
text = text.lower()
|
||||
tokens = re.findall(r'\b\w+\b', text)
|
||||
# Remove common stopwords
|
||||
stopwords = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
|
||||
'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
|
||||
'would', 'could', 'should', 'may', 'might', 'must', 'shall',
|
||||
'can', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
|
||||
'from', 'as', 'into', 'through', 'during', 'before', 'after',
|
||||
'above', 'below', 'up', 'down', 'out', 'off', 'over', 'under',
|
||||
'again', 'further', 'then', 'once', 'here', 'there', 'when',
|
||||
'where', 'why', 'how', 'all', 'each', 'few', 'more', 'most',
|
||||
'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
|
||||
'same', 'so', 'than', 'too', 'very', 'just', 'and', 'but',
|
||||
'if', 'or', 'because', 'until', 'while', 'it', 'this', 'that',
|
||||
'these', 'those', 'i', 'you', 'he', 'she', 'we', 'they'}
|
||||
return [t for t in tokens if t not in stopwords and len(t) > 2]
|
||||
|
||||
|
||||
def extract_key_terms(text: str, top_n: int = 10) -> List[str]:
|
||||
"""Extract key terms from text based on frequency"""
|
||||
tokens = tokenize(text)
|
||||
freq = Counter(tokens)
|
||||
return [term for term, _ in freq.most_common(top_n)]
|
||||
|
||||
|
||||
def calculate_token_overlap(text1: str, text2: str) -> float:
|
||||
"""Calculate Jaccard similarity between two texts"""
|
||||
tokens1 = set(tokenize(text1))
|
||||
tokens2 = set(tokenize(text2))
|
||||
|
||||
if not tokens1 or not tokens2:
|
||||
return 0.0
|
||||
|
||||
intersection = tokens1 & tokens2
|
||||
union = tokens1 | tokens2
|
||||
|
||||
return len(intersection) / len(union) if union else 0.0
|
||||
|
||||
|
||||
def calculate_rouge_l(reference: str, candidate: str) -> float:
|
||||
"""Calculate ROUGE-L score (Longest Common Subsequence)"""
|
||||
ref_tokens = tokenize(reference)
|
||||
cand_tokens = tokenize(candidate)
|
||||
|
||||
if not ref_tokens or not cand_tokens:
|
||||
return 0.0
|
||||
|
||||
# LCS using dynamic programming
|
||||
m, n = len(ref_tokens), len(cand_tokens)
|
||||
dp = [[0] * (n + 1) for _ in range(m + 1)]
|
||||
|
||||
for i in range(1, m + 1):
|
||||
for j in range(1, n + 1):
|
||||
if ref_tokens[i-1] == cand_tokens[j-1]:
|
||||
dp[i][j] = dp[i-1][j-1] + 1
|
||||
else:
|
||||
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
|
||||
|
||||
lcs_length = dp[m][n]
|
||||
|
||||
# F1-like score
|
||||
precision = lcs_length / n if n > 0 else 0
|
||||
recall = lcs_length / m if m > 0 else 0
|
||||
|
||||
if precision + recall == 0:
|
||||
return 0.0
|
||||
|
||||
return 2 * precision * recall / (precision + recall)
|
||||
|
||||
|
||||
def evaluate_context_relevance(question: str, context: str, context_id: str = "") -> ContextEvaluation:
|
||||
"""Evaluate how relevant a context is to a question"""
|
||||
question_terms = set(extract_key_terms(question, 15))
|
||||
context_terms = set(extract_key_terms(context, 30))
|
||||
|
||||
covered = question_terms & context_terms
|
||||
missing = question_terms - context_terms
|
||||
|
||||
# Calculate relevance based on term coverage and overlap
|
||||
term_coverage = len(covered) / len(question_terms) if question_terms else 0
|
||||
token_overlap = calculate_token_overlap(question, context)
|
||||
|
||||
# Combined relevance score
|
||||
relevance = 0.6 * term_coverage + 0.4 * token_overlap
|
||||
|
||||
return ContextEvaluation(
|
||||
context_id=context_id,
|
||||
relevance_score=round(relevance, 3),
|
||||
token_overlap=round(token_overlap, 3),
|
||||
key_terms_covered=list(covered),
|
||||
missing_terms=list(missing)
|
||||
)
|
||||
|
||||
|
||||
def extract_claims(answer: str) -> List[str]:
|
||||
"""Extract individual claims from an answer"""
|
||||
# Split on sentence boundaries
|
||||
sentences = re.split(r'[.!?]+', answer)
|
||||
claims = []
|
||||
|
||||
for sentence in sentences:
|
||||
sentence = sentence.strip()
|
||||
if len(sentence) > 10: # Filter out very short fragments
|
||||
claims.append(sentence)
|
||||
|
||||
return claims
|
||||
|
||||
|
||||
def check_claim_support(claim: str, context: str) -> Tuple[bool, float]:
|
||||
"""Check if a claim is supported by the context"""
|
||||
claim_terms = set(tokenize(claim))
|
||||
context_terms = set(tokenize(context))
|
||||
|
||||
if not claim_terms:
|
||||
return True, 1.0 # Empty claim is "supported"
|
||||
|
||||
# Check term overlap
|
||||
overlap = claim_terms & context_terms
|
||||
support_ratio = len(overlap) / len(claim_terms)
|
||||
|
||||
# Also check for ROUGE-L style matching
|
||||
rouge_score = calculate_rouge_l(context, claim)
|
||||
|
||||
# Combined support score
|
||||
support_score = 0.5 * support_ratio + 0.5 * rouge_score
|
||||
|
||||
return support_score > 0.3, support_score
|
||||
|
||||
|
||||
def evaluate_answer_faithfulness(
|
||||
question: str,
|
||||
answer: str,
|
||||
contexts: List[str],
|
||||
question_id: str = ""
|
||||
) -> AnswerEvaluation:
|
||||
"""Evaluate if answer is faithful to the provided contexts"""
|
||||
claims = extract_claims(answer)
|
||||
combined_context = ' '.join(contexts)
|
||||
|
||||
claim_evaluations = []
|
||||
supported_claims = 0
|
||||
unsupported = []
|
||||
context_used = []
|
||||
|
||||
for claim in claims:
|
||||
is_supported, score = check_claim_support(claim, combined_context)
|
||||
|
||||
claim_eval = {
|
||||
'claim': claim[:100] + '...' if len(claim) > 100 else claim,
|
||||
'supported': is_supported,
|
||||
'score': round(score, 3)
|
||||
}
|
||||
logger.info(f"Initialized {self.__class__.__name__}")
|
||||
|
||||
def validate_config(self) -> bool:
|
||||
"""Validate configuration"""
|
||||
logger.info("Validating configuration...")
|
||||
# Add validation logic
|
||||
logger.info("Configuration validated")
|
||||
return True
|
||||
|
||||
def process(self) -> Dict:
|
||||
"""Main processing logic"""
|
||||
logger.info("Starting processing...")
|
||||
|
||||
try:
|
||||
self.validate_config()
|
||||
|
||||
# Main processing
|
||||
result = self._execute()
|
||||
|
||||
self.results['status'] = 'completed'
|
||||
self.results['end_time'] = datetime.now().isoformat()
|
||||
|
||||
logger.info("Processing completed successfully")
|
||||
return self.results
|
||||
|
||||
except Exception as e:
|
||||
self.results['status'] = 'failed'
|
||||
self.results['error'] = str(e)
|
||||
logger.error(f"Processing failed: {e}")
|
||||
raise
|
||||
|
||||
def _execute(self) -> Dict:
|
||||
"""Execute main logic"""
|
||||
# Implementation here
|
||||
return {'success': True}
|
||||
|
||||
# Track which contexts support this claim
|
||||
for i, ctx in enumerate(contexts):
|
||||
_, ctx_score = check_claim_support(claim, ctx)
|
||||
if ctx_score > 0.3:
|
||||
claim_eval[f'context_{i}'] = round(ctx_score, 3)
|
||||
if f'context_{i}' not in context_used:
|
||||
context_used.append(f'context_{i}')
|
||||
|
||||
claim_evaluations.append(claim_eval)
|
||||
|
||||
if is_supported:
|
||||
supported_claims += 1
|
||||
else:
|
||||
unsupported.append(claim[:100])
|
||||
|
||||
# Faithfulness = % of claims supported
|
||||
faithfulness = supported_claims / len(claims) if claims else 1.0
|
||||
|
||||
# Groundedness = average support score
|
||||
avg_score = sum(c['score'] for c in claim_evaluations) / len(claim_evaluations) if claim_evaluations else 1.0
|
||||
|
||||
return AnswerEvaluation(
|
||||
question_id=question_id,
|
||||
faithfulness_score=round(faithfulness, 3),
|
||||
groundedness_score=round(avg_score, 3),
|
||||
claims=claim_evaluations,
|
||||
unsupported_claims=unsupported,
|
||||
context_used=context_used
|
||||
)
|
||||
|
||||
|
||||
def calculate_retrieval_metrics(
|
||||
retrieved: List[str],
|
||||
relevant: Set[str],
|
||||
k: int = 5
|
||||
) -> RetrievalMetrics:
|
||||
"""Calculate standard retrieval metrics"""
|
||||
retrieved_k = retrieved[:k]
|
||||
|
||||
# Precision@K
|
||||
relevant_in_k = sum(1 for doc in retrieved_k if doc in relevant)
|
||||
precision = relevant_in_k / k if k > 0 else 0
|
||||
|
||||
# Recall@K
|
||||
recall = relevant_in_k / len(relevant) if relevant else 0
|
||||
|
||||
# MRR (Mean Reciprocal Rank)
|
||||
mrr = 0.0
|
||||
for i, doc in enumerate(retrieved):
|
||||
if doc in relevant:
|
||||
mrr = 1.0 / (i + 1)
|
||||
break
|
||||
|
||||
# NDCG@K
|
||||
dcg = 0.0
|
||||
for i, doc in enumerate(retrieved_k):
|
||||
rel = 1 if doc in relevant else 0
|
||||
dcg += rel / math.log2(i + 2)
|
||||
|
||||
# Ideal DCG (all relevant at top)
|
||||
idcg = sum(1 / math.log2(i + 2) for i in range(min(len(relevant), k)))
|
||||
ndcg = dcg / idcg if idcg > 0 else 0
|
||||
|
||||
return RetrievalMetrics(
|
||||
precision_at_k=round(precision, 3),
|
||||
recall_at_k=round(recall, 3),
|
||||
mrr=round(mrr, 3),
|
||||
ndcg_at_k=round(ndcg, 3),
|
||||
k=k
|
||||
)
|
||||
|
||||
|
||||
def generate_recommendations(report: RAGEvaluationReport) -> List[str]:
|
||||
"""Generate actionable recommendations based on evaluation"""
|
||||
recommendations = []
|
||||
|
||||
if report.avg_context_relevance < 0.8:
|
||||
recommendations.append(
|
||||
f"Context relevance ({report.avg_context_relevance:.2f}) is below target (0.80). "
|
||||
"Consider: improving chunking strategy, adding metadata filtering, or using hybrid search."
|
||||
)
|
||||
|
||||
if report.avg_faithfulness < 0.95:
|
||||
recommendations.append(
|
||||
f"Faithfulness ({report.avg_faithfulness:.2f}) is below target (0.95). "
|
||||
"Consider: adding source citations, implementing fact-checking, or adjusting temperature."
|
||||
)
|
||||
|
||||
if report.avg_groundedness < 0.85:
|
||||
recommendations.append(
|
||||
f"Groundedness ({report.avg_groundedness:.2f}) is below target (0.85). "
|
||||
"Consider: using more restrictive prompts, adding 'only use provided context' instructions."
|
||||
)
|
||||
|
||||
if report.coverage < 0.9:
|
||||
recommendations.append(
|
||||
f"Coverage ({report.coverage:.2f}) indicates some questions lack relevant context. "
|
||||
"Consider: expanding document corpus, improving embedding model, or adding fallback responses."
|
||||
)
|
||||
|
||||
retrieval = report.retrieval_metrics
|
||||
if retrieval.get('precision_at_k', 0) < 0.7:
|
||||
recommendations.append(
|
||||
"Retrieval precision is low. Consider: re-ranking retrieved documents, "
|
||||
"using cross-encoder for reranking, or adjusting similarity threshold."
|
||||
)
|
||||
|
||||
if not recommendations:
|
||||
recommendations.append("All metrics meet targets. Consider A/B testing new improvements.")
|
||||
|
||||
return recommendations
|
||||
|
||||
|
||||
def evaluate_rag_system(
|
||||
questions: List[Dict],
|
||||
contexts: List[Dict],
|
||||
k: int = 5,
|
||||
verbose: bool = False
|
||||
) -> RAGEvaluationReport:
|
||||
"""Comprehensive RAG system evaluation"""
|
||||
|
||||
all_context_scores = []
|
||||
all_faithfulness_scores = []
|
||||
all_groundedness_scores = []
|
||||
issues = []
|
||||
question_details = []
|
||||
|
||||
questions_with_context = 0
|
||||
|
||||
for q_data in questions:
|
||||
question = q_data.get('question', q_data.get('query', ''))
|
||||
question_id = q_data.get('id', str(questions.index(q_data)))
|
||||
answer = q_data.get('answer', q_data.get('response', ''))
|
||||
expected = q_data.get('expected', q_data.get('ground_truth', ''))
|
||||
|
||||
# Find contexts for this question
|
||||
q_contexts = []
|
||||
for ctx in contexts:
|
||||
if ctx.get('question_id') == question_id or ctx.get('query_id') == question_id:
|
||||
q_contexts.append(ctx.get('content', ctx.get('text', '')))
|
||||
|
||||
# If no specific contexts, use all contexts (for simple datasets)
|
||||
if not q_contexts:
|
||||
q_contexts = [ctx.get('content', ctx.get('text', ''))
|
||||
for ctx in contexts[:k]]
|
||||
|
||||
if q_contexts:
|
||||
questions_with_context += 1
|
||||
|
||||
# Evaluate context relevance
|
||||
context_evals = []
|
||||
for i, ctx in enumerate(q_contexts[:k]):
|
||||
eval_result = evaluate_context_relevance(question, ctx, f"ctx_{i}")
|
||||
context_evals.append(eval_result)
|
||||
all_context_scores.append(eval_result.relevance_score)
|
||||
|
||||
# Evaluate answer faithfulness
|
||||
if answer and q_contexts:
|
||||
answer_eval = evaluate_answer_faithfulness(question, answer, q_contexts, question_id)
|
||||
all_faithfulness_scores.append(answer_eval.faithfulness_score)
|
||||
all_groundedness_scores.append(answer_eval.groundedness_score)
|
||||
|
||||
# Track issues
|
||||
if answer_eval.unsupported_claims:
|
||||
issues.append({
|
||||
'type': 'unsupported_claim',
|
||||
'question_id': question_id,
|
||||
'claims': answer_eval.unsupported_claims[:3]
|
||||
})
|
||||
|
||||
# Check for low relevance contexts
|
||||
low_relevance = [e for e in context_evals if e.relevance_score < 0.5]
|
||||
if low_relevance:
|
||||
issues.append({
|
||||
'type': 'low_relevance',
|
||||
'question_id': question_id,
|
||||
'contexts': [e.context_id for e in low_relevance]
|
||||
})
|
||||
|
||||
if verbose:
|
||||
question_details.append({
|
||||
'question_id': question_id,
|
||||
'question': question[:100],
|
||||
'context_scores': [asdict(e) for e in context_evals],
|
||||
'answer_faithfulness': all_faithfulness_scores[-1] if all_faithfulness_scores else None
|
||||
})
|
||||
|
||||
# Calculate aggregates
|
||||
avg_context_relevance = sum(all_context_scores) / len(all_context_scores) if all_context_scores else 0
|
||||
avg_faithfulness = sum(all_faithfulness_scores) / len(all_faithfulness_scores) if all_faithfulness_scores else 0
|
||||
avg_groundedness = sum(all_groundedness_scores) / len(all_groundedness_scores) if all_groundedness_scores else 0
|
||||
coverage = questions_with_context / len(questions) if questions else 0
|
||||
|
||||
# Simulated retrieval metrics (based on relevance scores)
|
||||
high_relevance = sum(1 for s in all_context_scores if s > 0.5)
|
||||
retrieval_metrics = {
|
||||
'precision_at_k': round(high_relevance / len(all_context_scores) if all_context_scores else 0, 3),
|
||||
'estimated_recall': round(coverage, 3),
|
||||
'k': k
|
||||
}
|
||||
|
||||
report = RAGEvaluationReport(
|
||||
total_questions=len(questions),
|
||||
avg_context_relevance=round(avg_context_relevance, 3),
|
||||
avg_faithfulness=round(avg_faithfulness, 3),
|
||||
avg_groundedness=round(avg_groundedness, 3),
|
||||
retrieval_metrics=retrieval_metrics,
|
||||
coverage=round(coverage, 3),
|
||||
issues=issues[:20], # Limit to 20 issues
|
||||
recommendations=[],
|
||||
question_details=question_details if verbose else []
|
||||
)
|
||||
|
||||
report.recommendations = generate_recommendations(report)
|
||||
|
||||
return report
|
||||
|
||||
|
||||
def format_report(report: RAGEvaluationReport) -> str:
|
||||
"""Format report as human-readable text"""
|
||||
lines = []
|
||||
lines.append("=" * 60)
|
||||
lines.append("RAG EVALUATION REPORT")
|
||||
lines.append("=" * 60)
|
||||
lines.append("")
|
||||
|
||||
lines.append(f"📊 SUMMARY")
|
||||
lines.append(f" Questions evaluated: {report.total_questions}")
|
||||
lines.append(f" Coverage: {report.coverage:.1%}")
|
||||
lines.append("")
|
||||
|
||||
lines.append("📈 RETRIEVAL METRICS")
|
||||
lines.append(f" Context Relevance: {report.avg_context_relevance:.2f} {'✅' if report.avg_context_relevance >= 0.8 else '⚠️'} (target: >0.80)")
|
||||
lines.append(f" Precision@{report.retrieval_metrics.get('k', 5)}: {report.retrieval_metrics.get('precision_at_k', 0):.2f}")
|
||||
lines.append("")
|
||||
|
||||
lines.append("📝 GENERATION METRICS")
|
||||
lines.append(f" Answer Faithfulness: {report.avg_faithfulness:.2f} {'✅' if report.avg_faithfulness >= 0.95 else '⚠️'} (target: >0.95)")
|
||||
lines.append(f" Groundedness: {report.avg_groundedness:.2f} {'✅' if report.avg_groundedness >= 0.85 else '⚠️'} (target: >0.85)")
|
||||
lines.append("")
|
||||
|
||||
if report.issues:
|
||||
lines.append(f"⚠️ ISSUES FOUND ({len(report.issues)})")
|
||||
for issue in report.issues[:10]:
|
||||
if issue['type'] == 'unsupported_claim':
|
||||
lines.append(f" Q{issue['question_id']}: {len(issue.get('claims', []))} unsupported claim(s)")
|
||||
elif issue['type'] == 'low_relevance':
|
||||
lines.append(f" Q{issue['question_id']}: Low relevance contexts: {issue.get('contexts', [])}")
|
||||
if len(report.issues) > 10:
|
||||
lines.append(f" ... and {len(report.issues) - 10} more issues")
|
||||
lines.append("")
|
||||
|
||||
lines.append("💡 RECOMMENDATIONS")
|
||||
for i, rec in enumerate(report.recommendations, 1):
|
||||
lines.append(f" {i}. {rec}")
|
||||
lines.append("")
|
||||
|
||||
lines.append("=" * 60)
|
||||
|
||||
return '\n'.join(lines)
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Rag Evaluator"
|
||||
description="RAG Evaluator - Evaluate Retrieval-Augmented Generation systems",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s --contexts contexts.json --questions questions.json
|
||||
%(prog)s --contexts ctx.json --questions q.json --k 10
|
||||
%(prog)s --contexts ctx.json --questions q.json --output report.json --verbose
|
||||
|
||||
Input file formats:
|
||||
|
||||
questions.json:
|
||||
[
|
||||
{"id": "q1", "question": "What is X?", "answer": "X is..."},
|
||||
{"id": "q2", "question": "How does Y work?", "answer": "Y works by..."}
|
||||
]
|
||||
|
||||
contexts.json:
|
||||
[
|
||||
{"question_id": "q1", "content": "Retrieved context text..."},
|
||||
{"question_id": "q2", "content": "Another context..."}
|
||||
]
|
||||
"""
|
||||
)
|
||||
parser.add_argument('--input', '-i', required=True, help='Input path')
|
||||
parser.add_argument('--output', '-o', required=True, help='Output path')
|
||||
parser.add_argument('--config', '-c', help='Configuration file')
|
||||
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
|
||||
|
||||
|
||||
parser.add_argument('--contexts', '-c', required=True, help='JSON file with retrieved contexts')
|
||||
parser.add_argument('--questions', '-q', required=True, help='JSON file with questions and answers')
|
||||
parser.add_argument('--k', type=int, default=5, help='Number of top contexts to evaluate (default: 5)')
|
||||
parser.add_argument('--output', '-o', help='Output file for detailed report (JSON)')
|
||||
parser.add_argument('--json', '-j', action='store_true', help='Output as JSON instead of text')
|
||||
parser.add_argument('--verbose', '-v', action='store_true', help='Include per-question details')
|
||||
parser.add_argument('--compare', help='Compare with baseline report JSON')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.verbose:
|
||||
logging.getLogger().setLevel(logging.DEBUG)
|
||||
|
||||
try:
|
||||
config = {
|
||||
'input': args.input,
|
||||
'output': args.output
|
||||
}
|
||||
|
||||
processor = RagEvaluator(config)
|
||||
results = processor.process()
|
||||
|
||||
print(json.dumps(results, indent=2))
|
||||
sys.exit(0)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
|
||||
# Load input files
|
||||
contexts_path = Path(args.contexts)
|
||||
questions_path = Path(args.questions)
|
||||
|
||||
if not contexts_path.exists():
|
||||
print(f"Error: Contexts file not found: {args.contexts}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not questions_path.exists():
|
||||
print(f"Error: Questions file not found: {args.questions}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
contexts = json.loads(contexts_path.read_text(encoding='utf-8'))
|
||||
questions = json.loads(questions_path.read_text(encoding='utf-8'))
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error: Invalid JSON format: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Run evaluation
|
||||
report = evaluate_rag_system(questions, contexts, k=args.k, verbose=args.verbose)
|
||||
|
||||
# Compare with baseline
|
||||
if args.compare:
|
||||
baseline_path = Path(args.compare)
|
||||
if baseline_path.exists():
|
||||
baseline = json.loads(baseline_path.read_text())
|
||||
print("\n📊 COMPARISON WITH BASELINE")
|
||||
print(f" Relevance: {baseline.get('avg_context_relevance', 0):.2f} -> {report.avg_context_relevance:.2f}")
|
||||
print(f" Faithfulness: {baseline.get('avg_faithfulness', 0):.2f} -> {report.avg_faithfulness:.2f}")
|
||||
print(f" Groundedness: {baseline.get('avg_groundedness', 0):.2f} -> {report.avg_groundedness:.2f}")
|
||||
print()
|
||||
|
||||
# Output
|
||||
if args.json:
|
||||
print(json.dumps(asdict(report), indent=2))
|
||||
else:
|
||||
print(format_report(report))
|
||||
|
||||
# Save to file
|
||||
if args.output:
|
||||
Path(args.output).write_text(json.dumps(asdict(report), indent=2))
|
||||
print(f"\nDetailed report saved to {args.output}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
Reference in New Issue
Block a user