Release v1.18.0: Add iOS-APP-developer and promptfoo-evaluation skills
### Added - **New Skill**: iOS-APP-developer v1.1.0 - iOS development with XcodeGen, SwiftUI, and SPM - XcodeGen project.yml configuration - SPM dependency resolution - Device deployment and code signing - Camera/AVFoundation debugging - iOS version compatibility handling - Library not loaded @rpath framework error fixes - State machine testing patterns for @MainActor classes - Bundled references: xcodegen-full.md, camera-avfoundation.md, swiftui-compatibility.md, testing-mainactor.md - **New Skill**: promptfoo-evaluation v1.0.0 - LLM evaluation framework using Promptfoo - Promptfoo configuration (promptfooconfig.yaml) - Python custom assertions - llm-rubric for LLM-as-judge evaluations - Few-shot example management - Model comparison and prompt testing - Bundled reference: promptfoo_api.md ### Changed - Updated marketplace version from 1.16.0 to 1.18.0 - Updated marketplace skills count from 23 to 25 - Updated skill-creator to v1.2.2: - Fixed best practices documentation URL (platform.claude.com) - Enhanced quick_validate.py to exclude file:// prefixed paths from validation - Updated marketplace.json metadata description to include new skills 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
4
promptfoo-evaluation/.security-scan-passed
Normal file
4
promptfoo-evaluation/.security-scan-passed
Normal file
@@ -0,0 +1,4 @@
|
||||
Security scan passed
|
||||
Scanned at: 2025-12-11T22:24:55.327388
|
||||
Tool: gitleaks + pattern-based validation
|
||||
Content hash: d04b93ec8a47fa7b64a2d0ee9790997e5ecc212ddbfa4c2c58fddafa2424d49a
|
||||
392
promptfoo-evaluation/SKILL.md
Normal file
392
promptfoo-evaluation/SKILL.md
Normal file
@@ -0,0 +1,392 @@
|
||||
---
|
||||
name: promptfoo-evaluation
|
||||
description: Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
|
||||
---
|
||||
|
||||
# Promptfoo Evaluation
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides guidance for configuring and running LLM evaluations using [Promptfoo](https://www.promptfoo.dev/), an open-source CLI tool for testing and comparing LLM outputs.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Initialize a new evaluation project
|
||||
npx promptfoo@latest init
|
||||
|
||||
# Run evaluation
|
||||
npx promptfoo@latest eval
|
||||
|
||||
# View results in browser
|
||||
npx promptfoo@latest view
|
||||
```
|
||||
|
||||
## Configuration Structure
|
||||
|
||||
A typical Promptfoo project structure:
|
||||
|
||||
```
|
||||
project/
|
||||
├── promptfooconfig.yaml # Main configuration
|
||||
├── prompts/
|
||||
│ ├── system.md # System prompt
|
||||
│ └── chat.json # Chat format prompt
|
||||
├── tests/
|
||||
│ └── cases.yaml # Test cases
|
||||
└── scripts/
|
||||
└── metrics.py # Custom Python assertions
|
||||
```
|
||||
|
||||
## Core Configuration (promptfooconfig.yaml)
|
||||
|
||||
```yaml
|
||||
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
|
||||
description: "My LLM Evaluation"
|
||||
|
||||
# Prompts to test
|
||||
prompts:
|
||||
- file://prompts/system.md
|
||||
- file://prompts/chat.json
|
||||
|
||||
# Models to compare
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
label: Claude-4.5-Sonnet
|
||||
- id: openai:gpt-4.1
|
||||
label: GPT-4.1
|
||||
|
||||
# Test cases
|
||||
tests: file://tests/cases.yaml
|
||||
|
||||
# Default assertions for all tests
|
||||
defaultTest:
|
||||
assert:
|
||||
- type: python
|
||||
value: file://scripts/metrics.py:custom_assert
|
||||
- type: llm-rubric
|
||||
value: |
|
||||
Evaluate the response quality on a 0-1 scale.
|
||||
threshold: 0.7
|
||||
|
||||
# Output path
|
||||
outputPath: results/eval-results.json
|
||||
```
|
||||
|
||||
## Prompt Formats
|
||||
|
||||
### Text Prompt (system.md)
|
||||
|
||||
```markdown
|
||||
You are a helpful assistant.
|
||||
|
||||
Task: {{task}}
|
||||
Context: {{context}}
|
||||
```
|
||||
|
||||
### Chat Format (chat.json)
|
||||
|
||||
```json
|
||||
[
|
||||
{"role": "system", "content": "{{system_prompt}}"},
|
||||
{"role": "user", "content": "{{user_input}}"}
|
||||
]
|
||||
```
|
||||
|
||||
### Few-Shot Pattern
|
||||
|
||||
Embed examples directly in prompt or use chat format with assistant messages:
|
||||
|
||||
```json
|
||||
[
|
||||
{"role": "system", "content": "{{system_prompt}}"},
|
||||
{"role": "user", "content": "Example input: {{example_input}}"},
|
||||
{"role": "assistant", "content": "{{example_output}}"},
|
||||
{"role": "user", "content": "Now process: {{actual_input}}"}
|
||||
]
|
||||
```
|
||||
|
||||
## Test Cases (tests/cases.yaml)
|
||||
|
||||
```yaml
|
||||
- description: "Test case 1"
|
||||
vars:
|
||||
system_prompt: file://prompts/system.md
|
||||
user_input: "Hello world"
|
||||
# Load content from files
|
||||
context: file://data/context.txt
|
||||
assert:
|
||||
- type: contains
|
||||
value: "expected text"
|
||||
- type: python
|
||||
value: file://scripts/metrics.py:custom_check
|
||||
threshold: 0.8
|
||||
```
|
||||
|
||||
## Python Custom Assertions
|
||||
|
||||
Create a Python file for custom assertions (e.g., `scripts/metrics.py`):
|
||||
|
||||
```python
|
||||
def get_assert(output: str, context: dict) -> dict:
|
||||
"""Default assertion function."""
|
||||
vars_dict = context.get('vars', {})
|
||||
|
||||
# Access test variables
|
||||
expected = vars_dict.get('expected', '')
|
||||
|
||||
# Return result
|
||||
return {
|
||||
"pass": expected in output,
|
||||
"score": 0.8,
|
||||
"reason": "Contains expected content",
|
||||
"named_scores": {"relevance": 0.9}
|
||||
}
|
||||
|
||||
def custom_check(output: str, context: dict) -> dict:
|
||||
"""Custom named assertion."""
|
||||
word_count = len(output.split())
|
||||
passed = 100 <= word_count <= 500
|
||||
|
||||
return {
|
||||
"pass": passed,
|
||||
"score": min(1.0, word_count / 300),
|
||||
"reason": f"Word count: {word_count}"
|
||||
}
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
- Default function name is `get_assert`
|
||||
- Specify function with `file://path.py:function_name`
|
||||
- Return `bool`, `float` (score), or `dict` with pass/score/reason
|
||||
- Access variables via `context['vars']`
|
||||
|
||||
## LLM-as-Judge (llm-rubric)
|
||||
|
||||
```yaml
|
||||
assert:
|
||||
- type: llm-rubric
|
||||
value: |
|
||||
Evaluate the response based on:
|
||||
1. Accuracy of information
|
||||
2. Clarity of explanation
|
||||
3. Completeness
|
||||
|
||||
Score 0.0-1.0 where 0.7+ is passing.
|
||||
threshold: 0.7
|
||||
provider: openai:gpt-4.1 # Optional: override grader model
|
||||
```
|
||||
|
||||
**Best practices:**
|
||||
- Provide clear scoring criteria
|
||||
- Use `threshold` to set minimum passing score
|
||||
- Default grader uses available API keys (OpenAI → Anthropic → Google)
|
||||
|
||||
## Common Assertion Types
|
||||
|
||||
| Type | Usage | Example |
|
||||
|------|-------|---------|
|
||||
| `contains` | Check substring | `value: "hello"` |
|
||||
| `icontains` | Case-insensitive | `value: "HELLO"` |
|
||||
| `equals` | Exact match | `value: "42"` |
|
||||
| `regex` | Pattern match | `value: "\\d{4}"` |
|
||||
| `python` | Custom logic | `value: file://script.py` |
|
||||
| `llm-rubric` | LLM grading | `value: "Is professional"` |
|
||||
| `latency` | Response time | `threshold: 1000` |
|
||||
|
||||
## File References
|
||||
|
||||
All paths are relative to config file location:
|
||||
|
||||
```yaml
|
||||
# Load file content as variable
|
||||
vars:
|
||||
content: file://data/input.txt
|
||||
|
||||
# Load prompt from file
|
||||
prompts:
|
||||
- file://prompts/main.md
|
||||
|
||||
# Load test cases from file
|
||||
tests: file://tests/cases.yaml
|
||||
|
||||
# Load Python assertion
|
||||
assert:
|
||||
- type: python
|
||||
value: file://scripts/check.py:validate
|
||||
```
|
||||
|
||||
## Running Evaluations
|
||||
|
||||
```bash
|
||||
# Basic run
|
||||
npx promptfoo@latest eval
|
||||
|
||||
# With specific config
|
||||
npx promptfoo@latest eval --config path/to/config.yaml
|
||||
|
||||
# Output to file
|
||||
npx promptfoo@latest eval --output results.json
|
||||
|
||||
# Filter tests
|
||||
npx promptfoo@latest eval --filter-metadata category=math
|
||||
|
||||
# View results
|
||||
npx promptfoo@latest view
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Python not found:**
|
||||
```bash
|
||||
export PROMPTFOO_PYTHON=python3
|
||||
```
|
||||
|
||||
**Large outputs truncated:**
|
||||
Outputs over 30000 characters are truncated. Use `head_limit` in assertions.
|
||||
|
||||
**File not found errors:**
|
||||
Ensure paths are relative to `promptfooconfig.yaml` location.
|
||||
|
||||
## Echo Provider (Preview Mode)
|
||||
|
||||
Use the **echo provider** to preview rendered prompts without making API calls:
|
||||
|
||||
```yaml
|
||||
# promptfooconfig-preview.yaml
|
||||
providers:
|
||||
- echo # Returns prompt as output, no API calls
|
||||
|
||||
tests:
|
||||
- vars:
|
||||
input: "test content"
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Preview prompt rendering before expensive API calls
|
||||
- Verify Few-shot examples are loaded correctly
|
||||
- Debug variable substitution issues
|
||||
- Validate prompt structure
|
||||
|
||||
```bash
|
||||
# Run preview mode
|
||||
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
|
||||
```
|
||||
|
||||
**Cost:** Free - no API tokens consumed.
|
||||
|
||||
## Advanced Few-Shot Implementation
|
||||
|
||||
### Multi-turn Conversation Pattern
|
||||
|
||||
For complex few-shot learning with full examples:
|
||||
|
||||
```json
|
||||
[
|
||||
{"role": "system", "content": "{{system_prompt}}"},
|
||||
|
||||
// Few-shot Example 1
|
||||
{"role": "user", "content": "Task: {{example_input_1}}"},
|
||||
{"role": "assistant", "content": "{{example_output_1}}"},
|
||||
|
||||
// Few-shot Example 2 (optional)
|
||||
{"role": "user", "content": "Task: {{example_input_2}}"},
|
||||
{"role": "assistant", "content": "{{example_output_2}}"},
|
||||
|
||||
// Actual test
|
||||
{"role": "user", "content": "Task: {{actual_input}}"}
|
||||
]
|
||||
```
|
||||
|
||||
**Test case configuration:**
|
||||
|
||||
```yaml
|
||||
tests:
|
||||
- vars:
|
||||
system_prompt: file://prompts/system.md
|
||||
# Few-shot examples
|
||||
example_input_1: file://data/examples/input1.txt
|
||||
example_output_1: file://data/examples/output1.txt
|
||||
example_input_2: file://data/examples/input2.txt
|
||||
example_output_2: file://data/examples/output2.txt
|
||||
# Actual test
|
||||
actual_input: file://data/test1.txt
|
||||
```
|
||||
|
||||
**Best practices:**
|
||||
- Use 1-3 few-shot examples (more may dilute effectiveness)
|
||||
- Ensure examples match the task format exactly
|
||||
- Load examples from files for better maintainability
|
||||
- Use echo provider first to verify structure
|
||||
|
||||
## Long Text Handling
|
||||
|
||||
For Chinese/long-form content evaluations (10k+ characters):
|
||||
|
||||
**Configuration:**
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
config:
|
||||
max_tokens: 8192 # Increase for long outputs
|
||||
|
||||
defaultTest:
|
||||
assert:
|
||||
- type: python
|
||||
value: file://scripts/metrics.py:check_length
|
||||
```
|
||||
|
||||
**Python assertion for text metrics:**
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def strip_tags(text: str) -> str:
|
||||
"""Remove HTML tags for pure text."""
|
||||
return re.sub(r'<[^>]+>', '', text)
|
||||
|
||||
def check_length(output: str, context: dict) -> dict:
|
||||
"""Check output length constraints."""
|
||||
raw_input = context['vars'].get('raw_input', '')
|
||||
|
||||
input_len = len(strip_tags(raw_input))
|
||||
output_len = len(strip_tags(output))
|
||||
|
||||
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
|
||||
|
||||
return {
|
||||
"pass": 0.7 <= reduction_ratio <= 0.9,
|
||||
"score": reduction_ratio,
|
||||
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
|
||||
"named_scores": {
|
||||
"input_length": input_len,
|
||||
"output_length": output_len,
|
||||
"reduction_ratio": reduction_ratio
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Real-World Example
|
||||
|
||||
**Project:** Chinese short-video content curation from long transcripts
|
||||
|
||||
**Structure:**
|
||||
```
|
||||
tiaogaoren/
|
||||
├── promptfooconfig.yaml # Production config
|
||||
├── promptfooconfig-preview.yaml # Preview config (echo provider)
|
||||
├── prompts/
|
||||
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
|
||||
│ └── v4/system-v4.md # System prompt
|
||||
├── tests/cases.yaml # 3 test samples
|
||||
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
|
||||
├── data/ # 5 samples (2 few-shot, 3 eval)
|
||||
└── results/
|
||||
```
|
||||
|
||||
**See:** `~/workspace/prompts/tiaogaoren/` for full implementation.
|
||||
|
||||
## Resources
|
||||
|
||||
For detailed API reference and advanced patterns, see [references/promptfoo_api.md](references/promptfoo_api.md).
|
||||
249
promptfoo-evaluation/references/promptfoo_api.md
Normal file
249
promptfoo-evaluation/references/promptfoo_api.md
Normal file
@@ -0,0 +1,249 @@
|
||||
# Promptfoo API Reference
|
||||
|
||||
## Provider Configuration
|
||||
|
||||
### Echo Provider (No API Calls)
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- echo # Returns prompt as-is, no API calls
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Preview rendered prompts without cost
|
||||
- Debug variable substitution
|
||||
- Verify few-shot structure
|
||||
- Test configuration before production runs
|
||||
|
||||
**Cost:** Free - no tokens consumed.
|
||||
|
||||
### Anthropic
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
config:
|
||||
max_tokens: 4096
|
||||
temperature: 0.7
|
||||
```
|
||||
|
||||
### OpenAI
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: openai:gpt-4.1
|
||||
config:
|
||||
temperature: 0.5
|
||||
max_tokens: 2048
|
||||
```
|
||||
|
||||
### Multiple Providers (A/B Testing)
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
label: Claude
|
||||
- id: openai:gpt-4.1
|
||||
label: GPT-4.1
|
||||
```
|
||||
|
||||
## Assertion Reference
|
||||
|
||||
### Python Assertion Context
|
||||
|
||||
```python
|
||||
class AssertionContext:
|
||||
prompt: str # Raw prompt sent to LLM
|
||||
vars: dict # Test case variables
|
||||
test: dict # Complete test case
|
||||
config: dict # Assertion config
|
||||
provider: Any # Provider info
|
||||
providerResponse: Any # Full response
|
||||
```
|
||||
|
||||
### GradingResult Format
|
||||
|
||||
```python
|
||||
{
|
||||
"pass": bool, # Required: pass/fail
|
||||
"score": float, # 0.0-1.0 score
|
||||
"reason": str, # Explanation
|
||||
"named_scores": dict, # Custom metrics
|
||||
"component_results": [] # Nested results
|
||||
}
|
||||
```
|
||||
|
||||
### Assertion Types
|
||||
|
||||
| Type | Description | Parameters |
|
||||
|------|-------------|------------|
|
||||
| `contains` | Substring check | `value` |
|
||||
| `icontains` | Case-insensitive | `value` |
|
||||
| `equals` | Exact match | `value` |
|
||||
| `regex` | Pattern match | `value` |
|
||||
| `not-contains` | Absence check | `value` |
|
||||
| `starts-with` | Prefix check | `value` |
|
||||
| `contains-any` | Any substring | `value` (array) |
|
||||
| `contains-all` | All substrings | `value` (array) |
|
||||
| `cost` | Token cost | `threshold` |
|
||||
| `latency` | Response time | `threshold` (ms) |
|
||||
| `perplexity` | Model confidence | `threshold` |
|
||||
| `python` | Custom Python | `value` (file/code) |
|
||||
| `javascript` | Custom JS | `value` (code) |
|
||||
| `llm-rubric` | LLM grading | `value`, `threshold` |
|
||||
| `factuality` | Fact checking | `value` (reference) |
|
||||
| `model-graded-closedqa` | Q&A grading | `value` |
|
||||
| `similar` | Semantic similarity | `value`, `threshold` |
|
||||
|
||||
## Test Case Configuration
|
||||
|
||||
### Full Test Case Structure
|
||||
|
||||
```yaml
|
||||
- description: "Test name"
|
||||
vars:
|
||||
var1: "value"
|
||||
var2: file://path.txt
|
||||
assert:
|
||||
- type: contains
|
||||
value: "expected"
|
||||
metadata:
|
||||
category: "test-category"
|
||||
priority: high
|
||||
options:
|
||||
provider: specific-provider
|
||||
transform: "output.trim()"
|
||||
```
|
||||
|
||||
### Loading Variables from Files
|
||||
|
||||
```yaml
|
||||
vars:
|
||||
# Text file (loaded as string)
|
||||
content: file://data/input.txt
|
||||
|
||||
# JSON/YAML (parsed to object)
|
||||
config: file://config.json
|
||||
|
||||
# Python script (executed, returns value)
|
||||
dynamic: file://scripts/generate.py
|
||||
|
||||
# PDF (text extracted)
|
||||
document: file://docs/report.pdf
|
||||
|
||||
# Image (base64 encoded)
|
||||
image: file://images/photo.png
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Dynamic Test Generation (Python)
|
||||
|
||||
```python
|
||||
# tests/generate.py
|
||||
def get_tests():
|
||||
return [
|
||||
{
|
||||
"vars": {"input": f"test {i}"},
|
||||
"assert": [{"type": "contains", "value": str(i)}]
|
||||
}
|
||||
for i in range(10)
|
||||
]
|
||||
```
|
||||
|
||||
```yaml
|
||||
tests: file://tests/generate.py:get_tests
|
||||
```
|
||||
|
||||
### Scenario-based Testing
|
||||
|
||||
```yaml
|
||||
scenarios:
|
||||
- config:
|
||||
- vars:
|
||||
language: "French"
|
||||
- vars:
|
||||
language: "Spanish"
|
||||
tests:
|
||||
- vars:
|
||||
text: "Hello"
|
||||
assert:
|
||||
- type: llm-rubric
|
||||
value: "Translation is accurate"
|
||||
```
|
||||
|
||||
### Transform Output
|
||||
|
||||
```yaml
|
||||
defaultTest:
|
||||
options:
|
||||
transform: |
|
||||
output.replace(/\n/g, ' ').trim()
|
||||
```
|
||||
|
||||
### Custom Grading Provider
|
||||
|
||||
```yaml
|
||||
defaultTest:
|
||||
options:
|
||||
provider: openai:gpt-4.1
|
||||
assert:
|
||||
- type: llm-rubric
|
||||
value: "Evaluate quality"
|
||||
provider: anthropic:claude-3-haiku # Override for this assertion
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `ANTHROPIC_API_KEY` | Anthropic API key |
|
||||
| `OPENAI_API_KEY` | OpenAI API key |
|
||||
| `PROMPTFOO_PYTHON` | Python binary path |
|
||||
| `PROMPTFOO_CACHE_ENABLED` | Enable caching (default: true) |
|
||||
| `PROMPTFOO_CACHE_PATH` | Cache directory |
|
||||
|
||||
## CLI Commands
|
||||
|
||||
```bash
|
||||
# Initialize project
|
||||
npx promptfoo@latest init
|
||||
|
||||
# Run evaluation
|
||||
npx promptfoo@latest eval [options]
|
||||
|
||||
# Options:
|
||||
# --config <path> Config file path
|
||||
# --output <path> Output file path
|
||||
# --grader <provider> Override grader model
|
||||
# --no-cache Disable caching
|
||||
# --filter-metadata Filter tests by metadata
|
||||
# --repeat <n> Repeat each test n times
|
||||
# --delay <ms> Delay between requests
|
||||
# --max-concurrency Parallel requests
|
||||
|
||||
# View results
|
||||
npx promptfoo@latest view [options]
|
||||
|
||||
# Share results
|
||||
npx promptfoo@latest share
|
||||
|
||||
# Generate report
|
||||
npx promptfoo@latest generate dataset
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
```bash
|
||||
# JSON (default)
|
||||
--output results.json
|
||||
|
||||
# CSV
|
||||
--output results.csv
|
||||
|
||||
# HTML report
|
||||
--output results.html
|
||||
|
||||
# YAML
|
||||
--output results.yaml
|
||||
```
|
||||
Reference in New Issue
Block a user