- Update model ID to claude-sonnet-4-6 (latest, Feb 2026) - Add Relay/Proxy API Configuration section with apiBaseUrl patterns - Document that maxConcurrency MUST be under commandLineOptions (not top-level) - Add LLM-as-judge relay provider config (apiBaseUrl not inherited) - Add 5 new Troubleshooting entries from real-world title-agent eval - Add Concurrency Control section to API reference - Clarify file:// path resolution (always relative to promptfooconfig.yaml) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
287 lines
6.5 KiB
Markdown
287 lines
6.5 KiB
Markdown
# Promptfoo API Reference
|
|
|
|
## Provider Configuration
|
|
|
|
### Echo Provider (No API Calls)
|
|
|
|
```yaml
|
|
providers:
|
|
- echo # Returns prompt as-is, no API calls
|
|
```
|
|
|
|
**Use cases:**
|
|
- Preview rendered prompts without cost
|
|
- Debug variable substitution
|
|
- Verify few-shot structure
|
|
- Test configuration before production runs
|
|
|
|
**Cost:** Free - no tokens consumed.
|
|
|
|
### Anthropic
|
|
|
|
```yaml
|
|
providers:
|
|
- id: anthropic:messages:claude-sonnet-4-6
|
|
config:
|
|
max_tokens: 4096
|
|
temperature: 0.7
|
|
# For relay/proxy APIs:
|
|
# apiBaseUrl: https://your-relay.example.com/api
|
|
```
|
|
|
|
### OpenAI
|
|
|
|
```yaml
|
|
providers:
|
|
- id: openai:gpt-4.1
|
|
config:
|
|
temperature: 0.5
|
|
max_tokens: 2048
|
|
```
|
|
|
|
### Multiple Providers (A/B Testing)
|
|
|
|
```yaml
|
|
providers:
|
|
- id: anthropic:messages:claude-sonnet-4-6
|
|
label: Claude
|
|
- id: openai:gpt-4.1
|
|
label: GPT-4.1
|
|
```
|
|
|
|
## Assertion Reference
|
|
|
|
### Python Assertion Context
|
|
|
|
```python
|
|
class AssertionContext:
|
|
prompt: str # Raw prompt sent to LLM
|
|
vars: dict # Test case variables
|
|
test: dict # Complete test case
|
|
config: dict # Assertion config
|
|
provider: Any # Provider info
|
|
providerResponse: Any # Full response
|
|
```
|
|
|
|
### GradingResult Format
|
|
|
|
```python
|
|
{
|
|
"pass": bool, # Required: pass/fail
|
|
"score": float, # 0.0-1.0 score
|
|
"reason": str, # Explanation
|
|
"named_scores": dict, # Custom metrics
|
|
"component_results": [] # Nested results
|
|
}
|
|
```
|
|
|
|
### Assertion Types
|
|
|
|
| Type | Description | Parameters |
|
|
|------|-------------|------------|
|
|
| `contains` | Substring check | `value` |
|
|
| `icontains` | Case-insensitive | `value` |
|
|
| `equals` | Exact match | `value` |
|
|
| `regex` | Pattern match | `value` |
|
|
| `not-contains` | Absence check | `value` |
|
|
| `starts-with` | Prefix check | `value` |
|
|
| `contains-any` | Any substring | `value` (array) |
|
|
| `contains-all` | All substrings | `value` (array) |
|
|
| `cost` | Token cost | `threshold` |
|
|
| `latency` | Response time | `threshold` (ms) |
|
|
| `perplexity` | Model confidence | `threshold` |
|
|
| `python` | Custom Python | `value` (file/code) |
|
|
| `javascript` | Custom JS | `value` (code) |
|
|
| `llm-rubric` | LLM grading | `value`, `threshold` |
|
|
| `factuality` | Fact checking | `value` (reference) |
|
|
| `model-graded-closedqa` | Q&A grading | `value` |
|
|
| `similar` | Semantic similarity | `value`, `threshold` |
|
|
|
|
## Test Case Configuration
|
|
|
|
### Full Test Case Structure
|
|
|
|
```yaml
|
|
- description: "Test name"
|
|
vars:
|
|
var1: "value"
|
|
var2: file://path.txt
|
|
assert:
|
|
- type: contains
|
|
value: "expected"
|
|
metadata:
|
|
category: "test-category"
|
|
priority: high
|
|
options:
|
|
provider: specific-provider
|
|
transform: "output.trim()"
|
|
```
|
|
|
|
### Loading Variables from Files
|
|
|
|
```yaml
|
|
vars:
|
|
# Text file (loaded as string)
|
|
content: file://data/input.txt
|
|
|
|
# JSON/YAML (parsed to object)
|
|
config: file://config.json
|
|
|
|
# Python script (executed, returns value)
|
|
dynamic: file://scripts/generate.py
|
|
|
|
# PDF (text extracted)
|
|
document: file://docs/report.pdf
|
|
|
|
# Image (base64 encoded)
|
|
image: file://images/photo.png
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Dynamic Test Generation (Python)
|
|
|
|
```python
|
|
# tests/generate.py
|
|
def get_tests():
|
|
return [
|
|
{
|
|
"vars": {"input": f"test {i}"},
|
|
"assert": [{"type": "contains", "value": str(i)}]
|
|
}
|
|
for i in range(10)
|
|
]
|
|
```
|
|
|
|
```yaml
|
|
tests: file://tests/generate.py:get_tests
|
|
```
|
|
|
|
### Scenario-based Testing
|
|
|
|
```yaml
|
|
scenarios:
|
|
- config:
|
|
- vars:
|
|
language: "French"
|
|
- vars:
|
|
language: "Spanish"
|
|
tests:
|
|
- vars:
|
|
text: "Hello"
|
|
assert:
|
|
- type: llm-rubric
|
|
value: "Translation is accurate"
|
|
```
|
|
|
|
### Transform Output
|
|
|
|
```yaml
|
|
defaultTest:
|
|
options:
|
|
transform: |
|
|
output.replace(/\n/g, ' ').trim()
|
|
```
|
|
|
|
### Custom Grading Provider
|
|
|
|
```yaml
|
|
defaultTest:
|
|
options:
|
|
provider: openai:gpt-4.1
|
|
assert:
|
|
- type: llm-rubric
|
|
value: "Evaluate quality"
|
|
provider: anthropic:claude-3-haiku # Override for this assertion
|
|
```
|
|
|
|
### Relay/Proxy Provider with LLM-as-Judge
|
|
|
|
When using a relay or proxy API, each `llm-rubric` needs its own `provider` config — the main provider's `apiBaseUrl` is NOT inherited:
|
|
|
|
```yaml
|
|
providers:
|
|
- id: anthropic:messages:claude-sonnet-4-6
|
|
config:
|
|
apiBaseUrl: https://your-relay.example.com/api
|
|
|
|
defaultTest:
|
|
assert:
|
|
- type: llm-rubric
|
|
value: "Evaluate quality"
|
|
provider:
|
|
id: anthropic:messages:claude-sonnet-4-6
|
|
config:
|
|
apiBaseUrl: https://your-relay.example.com/api # Must repeat here
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Description |
|
|
|----------|-------------|
|
|
| `ANTHROPIC_API_KEY` | Anthropic API key |
|
|
| `OPENAI_API_KEY` | OpenAI API key |
|
|
| `PROMPTFOO_PYTHON` | Python binary path |
|
|
| `PROMPTFOO_CACHE_ENABLED` | Enable caching (default: true) |
|
|
| `PROMPTFOO_CACHE_PATH` | Cache directory |
|
|
|
|
## Concurrency Control
|
|
|
|
**CRITICAL**: `maxConcurrency` must be placed under `commandLineOptions:` in the YAML config. Placing it at the top level is silently ignored.
|
|
|
|
```yaml
|
|
# ✅ Correct — under commandLineOptions
|
|
commandLineOptions:
|
|
maxConcurrency: 1
|
|
|
|
# ❌ Wrong — top level (silently ignored, defaults to ~4)
|
|
maxConcurrency: 1
|
|
```
|
|
|
|
When using relay APIs with LLM-as-judge, set `maxConcurrency: 1` because generation and grading share the same concurrent request pool.
|
|
|
|
## CLI Commands
|
|
|
|
```bash
|
|
# Initialize project
|
|
npx promptfoo@latest init
|
|
|
|
# Run evaluation
|
|
npx promptfoo@latest eval [options]
|
|
|
|
# Options:
|
|
# --config <path> Config file path
|
|
# --output <path> Output file path
|
|
# --grader <provider> Override grader model
|
|
# --no-cache Disable caching (important for re-runs)
|
|
# --filter-metadata Filter tests by metadata
|
|
# --repeat <n> Repeat each test n times
|
|
# --delay <ms> Delay between requests
|
|
# --max-concurrency Parallel requests (CLI override)
|
|
|
|
# View results
|
|
npx promptfoo@latest view [options]
|
|
|
|
# Share results
|
|
npx promptfoo@latest share
|
|
|
|
# Generate report
|
|
npx promptfoo@latest generate dataset
|
|
```
|
|
|
|
## Output Formats
|
|
|
|
```bash
|
|
# JSON (default)
|
|
--output results.json
|
|
|
|
# CSV
|
|
--output results.csv
|
|
|
|
# HTML report
|
|
--output results.html
|
|
|
|
# YAML
|
|
--output results.yaml
|
|
```
|