Update promptfoo-evaluation skill with relay API and concurrency lessons
- Update model ID to claude-sonnet-4-6 (latest, Feb 2026) - Add Relay/Proxy API Configuration section with apiBaseUrl patterns - Document that maxConcurrency MUST be under commandLineOptions (not top-level) - Add LLM-as-judge relay provider config (apiBaseUrl not inherited) - Add 5 new Troubleshooting entries from real-world title-agent eval - Add Concurrency Control section to API reference - Clarify file:// path resolution (always relative to promptfooconfig.yaml) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -51,14 +51,18 @@ prompts:
|
||||
|
||||
# Models to compare
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
label: Claude-4.5-Sonnet
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
label: Claude-Sonnet-4.6
|
||||
- id: openai:gpt-4.1
|
||||
label: GPT-4.1
|
||||
|
||||
# Test cases
|
||||
tests: file://tests/cases.yaml
|
||||
|
||||
# Concurrency control (MUST be under commandLineOptions, NOT top-level)
|
||||
commandLineOptions:
|
||||
maxConcurrency: 2
|
||||
|
||||
# Default assertions for all tests
|
||||
defaultTest:
|
||||
assert:
|
||||
@@ -177,10 +181,25 @@ assert:
|
||||
provider: openai:gpt-4.1 # Optional: override grader model
|
||||
```
|
||||
|
||||
**When using a relay/proxy API**, each `llm-rubric` assertion needs its own `provider` config with `apiBaseUrl`. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
|
||||
|
||||
```yaml
|
||||
assert:
|
||||
- type: llm-rubric
|
||||
value: |
|
||||
Evaluate quality on a 0-1 scale.
|
||||
threshold: 0.7
|
||||
provider:
|
||||
id: anthropic:messages:claude-sonnet-4-6
|
||||
config:
|
||||
apiBaseUrl: https://your-relay.example.com/api
|
||||
```
|
||||
|
||||
**Best practices:**
|
||||
- Provide clear scoring criteria
|
||||
- Use `threshold` to set minimum passing score
|
||||
- Default grader uses available API keys (OpenAI → Anthropic → Google)
|
||||
- **When using relay/proxy**: every `llm-rubric` must have its own `provider` with `apiBaseUrl` — the main provider's `apiBaseUrl` is NOT inherited
|
||||
|
||||
## Common Assertion Types
|
||||
|
||||
@@ -196,7 +215,7 @@ assert:
|
||||
|
||||
## File References
|
||||
|
||||
All paths are relative to config file location:
|
||||
All `file://` paths are resolved relative to `promptfooconfig.yaml` location (NOT the YAML file containing the reference). This is a common gotcha when `tests:` references a separate YAML file — the `file://` paths inside that test file still resolve from the config root.
|
||||
|
||||
```yaml
|
||||
# Load file content as variable
|
||||
@@ -235,6 +254,29 @@ npx promptfoo@latest eval --filter-metadata category=math
|
||||
npx promptfoo@latest view
|
||||
```
|
||||
|
||||
## Relay / Proxy API Configuration
|
||||
|
||||
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
label: Claude-Sonnet-4.6
|
||||
config:
|
||||
max_tokens: 4096
|
||||
apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages
|
||||
|
||||
# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
|
||||
commandLineOptions:
|
||||
maxConcurrency: 1 # Respect relay rate limits
|
||||
```
|
||||
|
||||
**Key rules:**
|
||||
- `apiBaseUrl` goes in `providers[].config` — Promptfoo appends `/v1/messages` automatically
|
||||
- `maxConcurrency` must be under `commandLineOptions:` — placing it at top level is silently ignored
|
||||
- When using relay with LLM-as-judge, set `maxConcurrency: 1` to avoid concurrent request limits (generation + grading share the same pool)
|
||||
- Pass relay token as `ANTHROPIC_API_KEY` env var
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Python not found:**
|
||||
@@ -246,7 +288,20 @@ export PROMPTFOO_PYTHON=python3
|
||||
Outputs over 30000 characters are truncated. Use `head_limit` in assertions.
|
||||
|
||||
**File not found errors:**
|
||||
Ensure paths are relative to `promptfooconfig.yaml` location.
|
||||
All `file://` paths resolve relative to `promptfooconfig.yaml` location.
|
||||
|
||||
**maxConcurrency ignored (shows "up to N at a time"):**
|
||||
`maxConcurrency` must be under `commandLineOptions:`, not at the YAML top level. This is a common mistake.
|
||||
|
||||
**LLM-as-judge returns 401 with relay API:**
|
||||
Each `llm-rubric` assertion must have its own `provider` with `apiBaseUrl`. The main provider config is not inherited by grader assertions.
|
||||
|
||||
**HTML tags in model output inflating metrics:**
|
||||
Models may output `<br>`, `<b>`, etc. in structured content. Strip HTML in Python assertions before measuring:
|
||||
```python
|
||||
import re
|
||||
clean_text = re.sub(r'<[^>]+>', '', raw_text)
|
||||
```
|
||||
|
||||
## Echo Provider (Preview Mode)
|
||||
|
||||
@@ -327,7 +382,7 @@ For Chinese/long-form content evaluations (10k+ characters):
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
config:
|
||||
max_tokens: 8192 # Increase for long outputs
|
||||
|
||||
|
||||
@@ -21,10 +21,12 @@ providers:
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
config:
|
||||
max_tokens: 4096
|
||||
temperature: 0.7
|
||||
# For relay/proxy APIs:
|
||||
# apiBaseUrl: https://your-relay.example.com/api
|
||||
```
|
||||
|
||||
### OpenAI
|
||||
@@ -41,7 +43,7 @@ providers:
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-5-20250929
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
label: Claude
|
||||
- id: openai:gpt-4.1
|
||||
label: GPT-4.1
|
||||
@@ -193,6 +195,26 @@ defaultTest:
|
||||
provider: anthropic:claude-3-haiku # Override for this assertion
|
||||
```
|
||||
|
||||
### Relay/Proxy Provider with LLM-as-Judge
|
||||
|
||||
When using a relay or proxy API, each `llm-rubric` needs its own `provider` config — the main provider's `apiBaseUrl` is NOT inherited:
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
- id: anthropic:messages:claude-sonnet-4-6
|
||||
config:
|
||||
apiBaseUrl: https://your-relay.example.com/api
|
||||
|
||||
defaultTest:
|
||||
assert:
|
||||
- type: llm-rubric
|
||||
value: "Evaluate quality"
|
||||
provider:
|
||||
id: anthropic:messages:claude-sonnet-4-6
|
||||
config:
|
||||
apiBaseUrl: https://your-relay.example.com/api # Must repeat here
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Description |
|
||||
@@ -203,6 +225,21 @@ defaultTest:
|
||||
| `PROMPTFOO_CACHE_ENABLED` | Enable caching (default: true) |
|
||||
| `PROMPTFOO_CACHE_PATH` | Cache directory |
|
||||
|
||||
## Concurrency Control
|
||||
|
||||
**CRITICAL**: `maxConcurrency` must be placed under `commandLineOptions:` in the YAML config. Placing it at the top level is silently ignored.
|
||||
|
||||
```yaml
|
||||
# ✅ Correct — under commandLineOptions
|
||||
commandLineOptions:
|
||||
maxConcurrency: 1
|
||||
|
||||
# ❌ Wrong — top level (silently ignored, defaults to ~4)
|
||||
maxConcurrency: 1
|
||||
```
|
||||
|
||||
When using relay APIs with LLM-as-judge, set `maxConcurrency: 1` because generation and grading share the same concurrent request pool.
|
||||
|
||||
## CLI Commands
|
||||
|
||||
```bash
|
||||
@@ -216,11 +253,11 @@ npx promptfoo@latest eval [options]
|
||||
# --config <path> Config file path
|
||||
# --output <path> Output file path
|
||||
# --grader <provider> Override grader model
|
||||
# --no-cache Disable caching
|
||||
# --no-cache Disable caching (important for re-runs)
|
||||
# --filter-metadata Filter tests by metadata
|
||||
# --repeat <n> Repeat each test n times
|
||||
# --delay <ms> Delay between requests
|
||||
# --max-concurrency Parallel requests
|
||||
# --max-concurrency Parallel requests (CLI override)
|
||||
|
||||
# View results
|
||||
npx promptfoo@latest view [options]
|
||||
|
||||
Reference in New Issue
Block a user