firefrost-gaming/claude-code-skills-reference

Files

daymade 11b7539f10 Update promptfoo-evaluation skill with relay API and concurrency lessons

- Update model ID to claude-sonnet-4-6 (latest, Feb 2026)
- Add Relay/Proxy API Configuration section with apiBaseUrl patterns
- Document that maxConcurrency MUST be under commandLineOptions (not top-level)
- Add LLM-as-judge relay provider config (apiBaseUrl not inherited)
- Add 5 new Troubleshooting entries from real-world title-agent eval
- Add Concurrency Control section to API reference
- Clarify file:// path resolution (always relative to promptfooconfig.yaml)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 22:29:24 +08:00

6.5 KiB

Raw Blame History

Promptfoo API Reference

Provider Configuration

Echo Provider (No API Calls)

providers:
  - echo  # Returns prompt as-is, no API calls

Use cases:

Preview rendered prompts without cost
Debug variable substitution
Verify few-shot structure
Test configuration before production runs

Cost: Free - no tokens consumed.

Anthropic

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 4096
      temperature: 0.7
      # For relay/proxy APIs:
      # apiBaseUrl: https://your-relay.example.com/api

OpenAI

providers:
  - id: openai:gpt-4.1
    config:
      temperature: 0.5
      max_tokens: 2048

Multiple Providers (A/B Testing)

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude
  - id: openai:gpt-4.1
    label: GPT-4.1

Assertion Reference

Python Assertion Context

class AssertionContext:
    prompt: str              # Raw prompt sent to LLM
    vars: dict               # Test case variables
    test: dict               # Complete test case
    config: dict             # Assertion config
    provider: Any            # Provider info
    providerResponse: Any    # Full response

GradingResult Format

{
    "pass": bool,           # Required: pass/fail
    "score": float,         # 0.0-1.0 score
    "reason": str,          # Explanation
    "named_scores": dict,   # Custom metrics
    "component_results": [] # Nested results
}

Assertion Types

Type	Description	Parameters
`contains`	Substring check	`value`
`icontains`	Case-insensitive	`value`
`equals`	Exact match	`value`
`regex`	Pattern match	`value`
`not-contains`	Absence check	`value`
`starts-with`	Prefix check	`value`
`contains-any`	Any substring	`value` (array)
`contains-all`	All substrings	`value` (array)
`cost`	Token cost	`threshold`
`latency`	Response time	`threshold` (ms)
`perplexity`	Model confidence	`threshold`
`python`	Custom Python	`value` (file/code)
`javascript`	Custom JS	`value` (code)
`llm-rubric`	LLM grading	`value`, `threshold`
`factuality`	Fact checking	`value` (reference)
`model-graded-closedqa`	Q&A grading	`value`
`similar`	Semantic similarity	`value`, `threshold`

Test Case Configuration

Full Test Case Structure

- description: "Test name"
  vars:
    var1: "value"
    var2: file://path.txt
  assert:
    - type: contains
      value: "expected"
  metadata:
    category: "test-category"
    priority: high
  options:
    provider: specific-provider
    transform: "output.trim()"

Loading Variables from Files

vars:
  # Text file (loaded as string)
  content: file://data/input.txt

  # JSON/YAML (parsed to object)
  config: file://config.json

  # Python script (executed, returns value)
  dynamic: file://scripts/generate.py

  # PDF (text extracted)
  document: file://docs/report.pdf

  # Image (base64 encoded)
  image: file://images/photo.png

Advanced Patterns

Dynamic Test Generation (Python)

# tests/generate.py
def get_tests():
    return [
        {
            "vars": {"input": f"test {i}"},
            "assert": [{"type": "contains", "value": str(i)}]
        }
        for i in range(10)
    ]

tests: file://tests/generate.py:get_tests

Scenario-based Testing

scenarios:
  - config:
      - vars:
          language: "French"
      - vars:
          language: "Spanish"
    tests:
      - vars:
          text: "Hello"
        assert:
          - type: llm-rubric
            value: "Translation is accurate"

Transform Output

defaultTest:
  options:
    transform: |
      output.replace(/\n/g, ' ').trim()

Custom Grading Provider

defaultTest:
  options:
    provider: openai:gpt-4.1
  assert:
    - type: llm-rubric
      value: "Evaluate quality"
      provider: anthropic:claude-3-haiku  # Override for this assertion

Relay/Proxy Provider with LLM-as-Judge

When using a relay or proxy API, each llm-rubric needs its own provider config — the main provider's apiBaseUrl is NOT inherited:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiBaseUrl: https://your-relay.example.com/api

defaultTest:
  assert:
    - type: llm-rubric
      value: "Evaluate quality"
      provider:
        id: anthropic:messages:claude-sonnet-4-6
        config:
          apiBaseUrl: https://your-relay.example.com/api  # Must repeat here

Environment Variables

Variable	Description
`ANTHROPIC_API_KEY`	Anthropic API key
`OPENAI_API_KEY`	OpenAI API key
`PROMPTFOO_PYTHON`	Python binary path
`PROMPTFOO_CACHE_ENABLED`	Enable caching (default: true)
`PROMPTFOO_CACHE_PATH`	Cache directory

Concurrency Control

CRITICAL: maxConcurrency must be placed under commandLineOptions: in the YAML config. Placing it at the top level is silently ignored.

# ✅ Correct — under commandLineOptions
commandLineOptions:
  maxConcurrency: 1

# ❌ Wrong — top level (silently ignored, defaults to ~4)
maxConcurrency: 1

When using relay APIs with LLM-as-judge, set maxConcurrency: 1 because generation and grading share the same concurrent request pool.

CLI Commands

# Initialize project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval [options]

# Options:
#   --config <path>     Config file path
#   --output <path>     Output file path
#   --grader <provider> Override grader model
#   --no-cache          Disable caching (important for re-runs)
#   --filter-metadata   Filter tests by metadata
#   --repeat <n>        Repeat each test n times
#   --delay <ms>        Delay between requests
#   --max-concurrency   Parallel requests (CLI override)

# View results
npx promptfoo@latest view [options]

# Share results
npx promptfoo@latest share

# Generate report
npx promptfoo@latest generate dataset

Output Formats

# JSON (default)
--output results.json

# CSV
--output results.csv

# HTML report
--output results.html

# YAML
--output results.yaml

6.5 KiB Raw Blame History