Files
claude-code-skills-reference/promptfoo-evaluation/references/promptfoo_api.md
daymade 11b7539f10 Update promptfoo-evaluation skill with relay API and concurrency lessons
- Update model ID to claude-sonnet-4-6 (latest, Feb 2026)
- Add Relay/Proxy API Configuration section with apiBaseUrl patterns
- Document that maxConcurrency MUST be under commandLineOptions (not top-level)
- Add LLM-as-judge relay provider config (apiBaseUrl not inherited)
- Add 5 new Troubleshooting entries from real-world title-agent eval
- Add Concurrency Control section to API reference
- Clarify file:// path resolution (always relative to promptfooconfig.yaml)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 22:29:24 +08:00

6.5 KiB

Promptfoo API Reference

Provider Configuration

Echo Provider (No API Calls)

providers:
  - echo  # Returns prompt as-is, no API calls

Use cases:

  • Preview rendered prompts without cost
  • Debug variable substitution
  • Verify few-shot structure
  • Test configuration before production runs

Cost: Free - no tokens consumed.

Anthropic

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 4096
      temperature: 0.7
      # For relay/proxy APIs:
      # apiBaseUrl: https://your-relay.example.com/api

OpenAI

providers:
  - id: openai:gpt-4.1
    config:
      temperature: 0.5
      max_tokens: 2048

Multiple Providers (A/B Testing)

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude
  - id: openai:gpt-4.1
    label: GPT-4.1

Assertion Reference

Python Assertion Context

class AssertionContext:
    prompt: str              # Raw prompt sent to LLM
    vars: dict               # Test case variables
    test: dict               # Complete test case
    config: dict             # Assertion config
    provider: Any            # Provider info
    providerResponse: Any    # Full response

GradingResult Format

{
    "pass": bool,           # Required: pass/fail
    "score": float,         # 0.0-1.0 score
    "reason": str,          # Explanation
    "named_scores": dict,   # Custom metrics
    "component_results": [] # Nested results
}

Assertion Types

Type Description Parameters
contains Substring check value
icontains Case-insensitive value
equals Exact match value
regex Pattern match value
not-contains Absence check value
starts-with Prefix check value
contains-any Any substring value (array)
contains-all All substrings value (array)
cost Token cost threshold
latency Response time threshold (ms)
perplexity Model confidence threshold
python Custom Python value (file/code)
javascript Custom JS value (code)
llm-rubric LLM grading value, threshold
factuality Fact checking value (reference)
model-graded-closedqa Q&A grading value
similar Semantic similarity value, threshold

Test Case Configuration

Full Test Case Structure

- description: "Test name"
  vars:
    var1: "value"
    var2: file://path.txt
  assert:
    - type: contains
      value: "expected"
  metadata:
    category: "test-category"
    priority: high
  options:
    provider: specific-provider
    transform: "output.trim()"

Loading Variables from Files

vars:
  # Text file (loaded as string)
  content: file://data/input.txt

  # JSON/YAML (parsed to object)
  config: file://config.json

  # Python script (executed, returns value)
  dynamic: file://scripts/generate.py

  # PDF (text extracted)
  document: file://docs/report.pdf

  # Image (base64 encoded)
  image: file://images/photo.png

Advanced Patterns

Dynamic Test Generation (Python)

# tests/generate.py
def get_tests():
    return [
        {
            "vars": {"input": f"test {i}"},
            "assert": [{"type": "contains", "value": str(i)}]
        }
        for i in range(10)
    ]
tests: file://tests/generate.py:get_tests

Scenario-based Testing

scenarios:
  - config:
      - vars:
          language: "French"
      - vars:
          language: "Spanish"
    tests:
      - vars:
          text: "Hello"
        assert:
          - type: llm-rubric
            value: "Translation is accurate"

Transform Output

defaultTest:
  options:
    transform: |
      output.replace(/\n/g, ' ').trim()

Custom Grading Provider

defaultTest:
  options:
    provider: openai:gpt-4.1
  assert:
    - type: llm-rubric
      value: "Evaluate quality"
      provider: anthropic:claude-3-haiku  # Override for this assertion

Relay/Proxy Provider with LLM-as-Judge

When using a relay or proxy API, each llm-rubric needs its own provider config — the main provider's apiBaseUrl is NOT inherited:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiBaseUrl: https://your-relay.example.com/api

defaultTest:
  assert:
    - type: llm-rubric
      value: "Evaluate quality"
      provider:
        id: anthropic:messages:claude-sonnet-4-6
        config:
          apiBaseUrl: https://your-relay.example.com/api  # Must repeat here

Environment Variables

Variable Description
ANTHROPIC_API_KEY Anthropic API key
OPENAI_API_KEY OpenAI API key
PROMPTFOO_PYTHON Python binary path
PROMPTFOO_CACHE_ENABLED Enable caching (default: true)
PROMPTFOO_CACHE_PATH Cache directory

Concurrency Control

CRITICAL: maxConcurrency must be placed under commandLineOptions: in the YAML config. Placing it at the top level is silently ignored.

# ✅ Correct — under commandLineOptions
commandLineOptions:
  maxConcurrency: 1

# ❌ Wrong — top level (silently ignored, defaults to ~4)
maxConcurrency: 1

When using relay APIs with LLM-as-judge, set maxConcurrency: 1 because generation and grading share the same concurrent request pool.

CLI Commands

# Initialize project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval [options]

# Options:
#   --config <path>     Config file path
#   --output <path>     Output file path
#   --grader <provider> Override grader model
#   --no-cache          Disable caching (important for re-runs)
#   --filter-metadata   Filter tests by metadata
#   --repeat <n>        Repeat each test n times
#   --delay <ms>        Delay between requests
#   --max-concurrency   Parallel requests (CLI override)

# View results
npx promptfoo@latest view [options]

# Share results
npx promptfoo@latest share

# Generate report
npx promptfoo@latest generate dataset

Output Formats

# JSON (default)
--output results.json

# CSV
--output results.csv

# HTML report
--output results.html

# YAML
--output results.yaml