Update promptfoo-evaluation skill with relay API and concurrency lessons

- Update model ID to claude-sonnet-4-6 (latest, Feb 2026)
- Add Relay/Proxy API Configuration section with apiBaseUrl patterns
- Document that maxConcurrency MUST be under commandLineOptions (not top-level)
- Add LLM-as-judge relay provider config (apiBaseUrl not inherited)
- Add 5 new Troubleshooting entries from real-world title-agent eval
- Add Concurrency Control section to API reference
- Clarify file:// path resolution (always relative to promptfooconfig.yaml)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
daymade
2026-02-24 22:29:24 +08:00
parent 1fd4969d03
commit 11b7539f10
2 changed files with 101 additions and 9 deletions

View File

@@ -51,14 +51,18 @@ prompts:
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
label: Claude-4.5-Sonnet
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
maxConcurrency: 2
# Default assertions for all tests
defaultTest:
assert:
@@ -177,10 +181,25 @@ assert:
provider: openai:gpt-4.1 # Optional: override grader model
```
**When using a relay/proxy API**, each `llm-rubric` assertion needs its own `provider` config with `apiBaseUrl`. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
```yaml
assert:
- type: llm-rubric
value: |
Evaluate quality on a 0-1 scale.
threshold: 0.7
provider:
id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api
```
**Best practices:**
- Provide clear scoring criteria
- Use `threshold` to set minimum passing score
- Default grader uses available API keys (OpenAI → Anthropic → Google)
- **When using relay/proxy**: every `llm-rubric` must have its own `provider` with `apiBaseUrl` — the main provider's `apiBaseUrl` is NOT inherited
## Common Assertion Types
@@ -196,7 +215,7 @@ assert:
## File References
All paths are relative to config file location:
All `file://` paths are resolved relative to `promptfooconfig.yaml` location (NOT the YAML file containing the reference). This is a common gotcha when `tests:` references a separate YAML file — the `file://` paths inside that test file still resolve from the config root.
```yaml
# Load file content as variable
@@ -235,6 +254,29 @@ npx promptfoo@latest eval --filter-metadata category=math
npx promptfoo@latest view
```
## Relay / Proxy API Configuration
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
```yaml
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
config:
max_tokens: 4096
apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages
# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
maxConcurrency: 1 # Respect relay rate limits
```
**Key rules:**
- `apiBaseUrl` goes in `providers[].config` — Promptfoo appends `/v1/messages` automatically
- `maxConcurrency` must be under `commandLineOptions:` — placing it at top level is silently ignored
- When using relay with LLM-as-judge, set `maxConcurrency: 1` to avoid concurrent request limits (generation + grading share the same pool)
- Pass relay token as `ANTHROPIC_API_KEY` env var
## Troubleshooting
**Python not found:**
@@ -246,7 +288,20 @@ export PROMPTFOO_PYTHON=python3
Outputs over 30000 characters are truncated. Use `head_limit` in assertions.
**File not found errors:**
Ensure paths are relative to `promptfooconfig.yaml` location.
All `file://` paths resolve relative to `promptfooconfig.yaml` location.
**maxConcurrency ignored (shows "up to N at a time"):**
`maxConcurrency` must be under `commandLineOptions:`, not at the YAML top level. This is a common mistake.
**LLM-as-judge returns 401 with relay API:**
Each `llm-rubric` assertion must have its own `provider` with `apiBaseUrl`. The main provider config is not inherited by grader assertions.
**HTML tags in model output inflating metrics:**
Models may output `<br>`, `<b>`, etc. in structured content. Strip HTML in Python assertions before measuring:
```python
import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)
```
## Echo Provider (Preview Mode)
@@ -327,7 +382,7 @@ For Chinese/long-form content evaluations (10k+ characters):
```yaml
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 8192 # Increase for long outputs

View File

@@ -21,10 +21,12 @@ providers:
```yaml
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 4096
temperature: 0.7
# For relay/proxy APIs:
# apiBaseUrl: https://your-relay.example.com/api
```
### OpenAI
@@ -41,7 +43,7 @@ providers:
```yaml
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
- id: anthropic:messages:claude-sonnet-4-6
label: Claude
- id: openai:gpt-4.1
label: GPT-4.1
@@ -193,6 +195,26 @@ defaultTest:
provider: anthropic:claude-3-haiku # Override for this assertion
```
### Relay/Proxy Provider with LLM-as-Judge
When using a relay or proxy API, each `llm-rubric` needs its own `provider` config — the main provider's `apiBaseUrl` is NOT inherited:
```yaml
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api
defaultTest:
assert:
- type: llm-rubric
value: "Evaluate quality"
provider:
id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api # Must repeat here
```
## Environment Variables
| Variable | Description |
@@ -203,6 +225,21 @@ defaultTest:
| `PROMPTFOO_CACHE_ENABLED` | Enable caching (default: true) |
| `PROMPTFOO_CACHE_PATH` | Cache directory |
## Concurrency Control
**CRITICAL**: `maxConcurrency` must be placed under `commandLineOptions:` in the YAML config. Placing it at the top level is silently ignored.
```yaml
# ✅ Correct — under commandLineOptions
commandLineOptions:
maxConcurrency: 1
# ❌ Wrong — top level (silently ignored, defaults to ~4)
maxConcurrency: 1
```
When using relay APIs with LLM-as-judge, set `maxConcurrency: 1` because generation and grading share the same concurrent request pool.
## CLI Commands
```bash
@@ -216,11 +253,11 @@ npx promptfoo@latest eval [options]
# --config <path> Config file path
# --output <path> Output file path
# --grader <provider> Override grader model
# --no-cache Disable caching
# --no-cache Disable caching (important for re-runs)
# --filter-metadata Filter tests by metadata
# --repeat <n> Repeat each test n times
# --delay <ms> Delay between requests
# --max-concurrency Parallel requests
# --max-concurrency Parallel requests (CLI override)
# View results
npx promptfoo@latest view [options]