From 11b7539f102814ac9540111ac5f22e5bc03cf961 Mon Sep 17 00:00:00 2001
From: daymade <daymadev89@gmail.com>
Date: Tue, 24 Feb 2026 22:29:24 +0800
Subject: [PATCH] Update promptfoo-evaluation skill with relay API and
 concurrency lessons

- Update model ID to claude-sonnet-4-6 (latest, Feb 2026)
- Add Relay/Proxy API Configuration section with apiBaseUrl patterns
- Document that maxConcurrency MUST be under commandLineOptions (not top-level)
- Add LLM-as-judge relay provider config (apiBaseUrl not inherited)
- Add 5 new Troubleshooting entries from real-world title-agent eval
- Add Concurrency Control section to API reference
- Clarify file:// path resolution (always relative to promptfooconfig.yaml)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 promptfoo-evaluation/SKILL.md                 | 65 +++++++++++++++++--
 .../references/promptfoo_api.md               | 45 +++++++++++--
 2 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/promptfoo-evaluation/SKILL.md b/promptfoo-evaluation/SKILL.md
index 7af1f11..1b9739a 100644
--- a/promptfoo-evaluation/SKILL.md
+++ b/promptfoo-evaluation/SKILL.md
@@ -51,14 +51,18 @@ prompts:
 
 # Models to compare
 providers:
-  - id: anthropic:messages:claude-sonnet-4-5-20250929
-    label: Claude-4.5-Sonnet
+  - id: anthropic:messages:claude-sonnet-4-6
+    label: Claude-Sonnet-4.6
   - id: openai:gpt-4.1
     label: GPT-4.1
 
 # Test cases
 tests: file://tests/cases.yaml
 
+# Concurrency control (MUST be under commandLineOptions, NOT top-level)
+commandLineOptions:
+  maxConcurrency: 2
+
 # Default assertions for all tests
 defaultTest:
   assert:
@@ -177,10 +181,25 @@ assert:
     provider: openai:gpt-4.1  # Optional: override grader model
 ```
 
+**When using a relay/proxy API**, each `llm-rubric` assertion needs its own `provider` config with `apiBaseUrl`. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
+
+```yaml
+assert:
+  - type: llm-rubric
+    value: |
+      Evaluate quality on a 0-1 scale.
+    threshold: 0.7
+    provider:
+      id: anthropic:messages:claude-sonnet-4-6
+      config:
+        apiBaseUrl: https://your-relay.example.com/api
+```
+
 **Best practices:**
 - Provide clear scoring criteria
 - Use `threshold` to set minimum passing score
 - Default grader uses available API keys (OpenAI → Anthropic → Google)
+- **When using relay/proxy**: every `llm-rubric` must have its own `provider` with `apiBaseUrl` — the main provider's `apiBaseUrl` is NOT inherited
 
 ## Common Assertion Types
 
@@ -196,7 +215,7 @@ assert:
 
 ## File References
 
-All paths are relative to config file location:
+All `file://` paths are resolved relative to `promptfooconfig.yaml` location (NOT the YAML file containing the reference). This is a common gotcha when `tests:` references a separate YAML file — the `file://` paths inside that test file still resolve from the config root.
 
 ```yaml
 # Load file content as variable
@@ -235,6 +254,29 @@ npx promptfoo@latest eval --filter-metadata category=math
 npx promptfoo@latest view
 ```
 
+## Relay / Proxy API Configuration
+
+When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
+
+```yaml
+providers:
+  - id: anthropic:messages:claude-sonnet-4-6
+    label: Claude-Sonnet-4.6
+    config:
+      max_tokens: 4096
+      apiBaseUrl: https://your-relay.example.com/api  # Promptfoo appends /v1/messages
+
+# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
+commandLineOptions:
+  maxConcurrency: 1  # Respect relay rate limits
+```
+
+**Key rules:**
+- `apiBaseUrl` goes in `providers[].config` — Promptfoo appends `/v1/messages` automatically
+- `maxConcurrency` must be under `commandLineOptions:` — placing it at top level is silently ignored
+- When using relay with LLM-as-judge, set `maxConcurrency: 1` to avoid concurrent request limits (generation + grading share the same pool)
+- Pass relay token as `ANTHROPIC_API_KEY` env var
+
 ## Troubleshooting
 
 **Python not found:**
@@ -246,7 +288,20 @@ export PROMPTFOO_PYTHON=python3
 Outputs over 30000 characters are truncated. Use `head_limit` in assertions.
 
 **File not found errors:**
-Ensure paths are relative to `promptfooconfig.yaml` location.
+All `file://` paths resolve relative to `promptfooconfig.yaml` location.
+
+**maxConcurrency ignored (shows "up to N at a time"):**
+`maxConcurrency` must be under `commandLineOptions:`, not at the YAML top level. This is a common mistake.
+
+**LLM-as-judge returns 401 with relay API:**
+Each `llm-rubric` assertion must have its own `provider` with `apiBaseUrl`. The main provider config is not inherited by grader assertions.
+
+**HTML tags in model output inflating metrics:**
+Models may output `<br>`, `<b>`, etc. in structured content. Strip HTML in Python assertions before measuring:
+```python
+import re
+clean_text = re.sub(r'<[^>]+>', '', raw_text)
+```
 
 ## Echo Provider (Preview Mode)
 
@@ -327,7 +382,7 @@ For Chinese/long-form content evaluations (10k+ characters):
 
 ```yaml
 providers:
-  - id: anthropic:messages:claude-sonnet-4-5-20250929
+  - id: anthropic:messages:claude-sonnet-4-6
     config:
       max_tokens: 8192  # Increase for long outputs
 
diff --git a/promptfoo-evaluation/references/promptfoo_api.md b/promptfoo-evaluation/references/promptfoo_api.md
index 2c723e4..ba37a58 100644
--- a/promptfoo-evaluation/references/promptfoo_api.md
+++ b/promptfoo-evaluation/references/promptfoo_api.md
@@ -21,10 +21,12 @@ providers:
 
 ```yaml
 providers:
-  - id: anthropic:messages:claude-sonnet-4-5-20250929
+  - id: anthropic:messages:claude-sonnet-4-6
     config:
       max_tokens: 4096
       temperature: 0.7
+      # For relay/proxy APIs:
+      # apiBaseUrl: https://your-relay.example.com/api
 ```
 
 ### OpenAI
@@ -41,7 +43,7 @@ providers:
 
 ```yaml
 providers:
-  - id: anthropic:messages:claude-sonnet-4-5-20250929
+  - id: anthropic:messages:claude-sonnet-4-6
     label: Claude
   - id: openai:gpt-4.1
     label: GPT-4.1
@@ -193,6 +195,26 @@ defaultTest:
       provider: anthropic:claude-3-haiku  # Override for this assertion
 ```
 
+### Relay/Proxy Provider with LLM-as-Judge
+
+When using a relay or proxy API, each `llm-rubric` needs its own `provider` config — the main provider's `apiBaseUrl` is NOT inherited:
+
+```yaml
+providers:
+  - id: anthropic:messages:claude-sonnet-4-6
+    config:
+      apiBaseUrl: https://your-relay.example.com/api
+
+defaultTest:
+  assert:
+    - type: llm-rubric
+      value: "Evaluate quality"
+      provider:
+        id: anthropic:messages:claude-sonnet-4-6
+        config:
+          apiBaseUrl: https://your-relay.example.com/api  # Must repeat here
+```
+
 ## Environment Variables
 
 | Variable | Description |
@@ -203,6 +225,21 @@ defaultTest:
 | `PROMPTFOO_CACHE_ENABLED` | Enable caching (default: true) |
 | `PROMPTFOO_CACHE_PATH` | Cache directory |
 
+## Concurrency Control
+
+**CRITICAL**: `maxConcurrency` must be placed under `commandLineOptions:` in the YAML config. Placing it at the top level is silently ignored.
+
+```yaml
+# ✅ Correct — under commandLineOptions
+commandLineOptions:
+  maxConcurrency: 1
+
+# ❌ Wrong — top level (silently ignored, defaults to ~4)
+maxConcurrency: 1
+```
+
+When using relay APIs with LLM-as-judge, set `maxConcurrency: 1` because generation and grading share the same concurrent request pool.
+
 ## CLI Commands
 
 ```bash
@@ -216,11 +253,11 @@ npx promptfoo@latest eval [options]
 #   --config <path>     Config file path
 #   --output <path>     Output file path
 #   --grader <provider> Override grader model
-#   --no-cache          Disable caching
+#   --no-cache          Disable caching (important for re-runs)
 #   --filter-metadata   Filter tests by metadata
 #   --repeat <n>        Repeat each test n times
 #   --delay <ms>        Delay between requests
-#   --max-concurrency   Parallel requests
+#   --max-concurrency   Parallel requests (CLI override)
 
 # View results
 npx promptfoo@latest view [options]