Release v1.18.0: Add iOS-APP-developer and promptfoo-evaluation skills

### Added - **New Skill**: iOS-APP-developer v1.1.0 - iOS development with XcodeGen, SwiftUI, and SPM - XcodeGen project.yml configuration - SPM dependency resolution - Device deployment and code signing - Camera/AVFoundation debugging - iOS version compatibility handling - Library not loaded @rpath framework error fixes - State machine testing patterns for @MainActor classes - Bundled references: xcodegen-full.md, camera-avfoundation.md, swiftui-compatibility.md, testing-mainactor.md - **New Skill**: promptfoo-evaluation v1.0.0 - LLM evaluation framework using Promptfoo - Promptfoo configuration (promptfooconfig.yaml) - Python custom assertions - llm-rubric for LLM-as-judge evaluations - Few-shot example management - Model comparison and prompt testing - Bundled reference: promptfoo_api.md ### Changed - Updated marketplace version from 1.16.0 to 1.18.0 - Updated marketplace skills count from 23 to 25 - Updated skill-creator to v1.2.2: - Fixed best practices documentation URL (platform.claude.com) - Enhanced quick_validate.py to exclude file:// prefixed paths from validation - Updated marketplace.json metadata description to include new skills 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-20 17:23:08 +08:00
parent b1a21dc05b
commit 4e3a54175e
12 changed files with 1805 additions and 5 deletions
--- a/promptfoo-evaluation/.security-scan-passed
+++ b/promptfoo-evaluation/.security-scan-passed
@@ -0,0 +1,4 @@
+Security scan passed
+Scanned at: 2025-12-11T22:24:55.327388
+Tool: gitleaks + pattern-based validation
+Content hash: d04b93ec8a47fa7b64a2d0ee9790997e5ecc212ddbfa4c2c58fddafa2424d49a
--- a/promptfoo-evaluation/SKILL.md
+++ b/promptfoo-evaluation/SKILL.md
@@ -0,0 +1,392 @@
+---
+name: promptfoo-evaluation
+description: Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
+---
+
+# Promptfoo Evaluation
+
+## Overview
+
+This skill provides guidance for configuring and running LLM evaluations using [Promptfoo](https://www.promptfoo.dev/), an open-source CLI tool for testing and comparing LLM outputs.
+
+## Quick Start
+
+```bash
+# Initialize a new evaluation project
+npx promptfoo@latest init
+
+# Run evaluation
+npx promptfoo@latest eval
+
+# View results in browser
+npx promptfoo@latest view
+```
+
+## Configuration Structure
+
+A typical Promptfoo project structure:
+
+```
+project/
+├── promptfooconfig.yaml    # Main configuration
+├── prompts/
+│   ├── system.md           # System prompt
+│   └── chat.json           # Chat format prompt
+├── tests/
+│   └── cases.yaml          # Test cases
+└── scripts/
+    └── metrics.py          # Custom Python assertions
+```
+
+## Core Configuration (promptfooconfig.yaml)
+
+```yaml
+# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
+description: "My LLM Evaluation"
+
+# Prompts to test
+prompts:
+  - file://prompts/system.md
+  - file://prompts/chat.json
+
+# Models to compare
+providers:
+  - id: anthropic:messages:claude-sonnet-4-5-20250929
+    label: Claude-4.5-Sonnet
+  - id: openai:gpt-4.1
+    label: GPT-4.1
+
+# Test cases
+tests: file://tests/cases.yaml
+
+# Default assertions for all tests
+defaultTest:
+  assert:
+    - type: python
+      value: file://scripts/metrics.py:custom_assert
+    - type: llm-rubric
+      value: |
+        Evaluate the response quality on a 0-1 scale.
+      threshold: 0.7
+
+# Output path
+outputPath: results/eval-results.json
+```
+
+## Prompt Formats
+
+### Text Prompt (system.md)
+
+```markdown
+You are a helpful assistant.
+
+Task: {{task}}
+Context: {{context}}
+```
+
+### Chat Format (chat.json)
+
+```json
+[
+  {"role": "system", "content": "{{system_prompt}}"},
+  {"role": "user", "content": "{{user_input}}"}
+]
+```
+
+### Few-Shot Pattern
+
+Embed examples directly in prompt or use chat format with assistant messages:
+
+```json
+[
+  {"role": "system", "content": "{{system_prompt}}"},
+  {"role": "user", "content": "Example input: {{example_input}}"},
+  {"role": "assistant", "content": "{{example_output}}"},
+  {"role": "user", "content": "Now process: {{actual_input}}"}
+]
+```
+
+## Test Cases (tests/cases.yaml)
+
+```yaml
+- description: "Test case 1"
+  vars:
+    system_prompt: file://prompts/system.md
+    user_input: "Hello world"
+    # Load content from files
+    context: file://data/context.txt
+  assert:
+    - type: contains
+      value: "expected text"
+    - type: python
+      value: file://scripts/metrics.py:custom_check
+      threshold: 0.8
+```
+
+## Python Custom Assertions
+
+Create a Python file for custom assertions (e.g., `scripts/metrics.py`):
+
+```python
+def get_assert(output: str, context: dict) -> dict:
+    """Default assertion function."""
+    vars_dict = context.get('vars', {})
+
+    # Access test variables
+    expected = vars_dict.get('expected', '')
+
+    # Return result
+    return {
+        "pass": expected in output,
+        "score": 0.8,
+        "reason": "Contains expected content",
+        "named_scores": {"relevance": 0.9}
+    }
+
+def custom_check(output: str, context: dict) -> dict:
+    """Custom named assertion."""
+    word_count = len(output.split())
+    passed = 100 <= word_count <= 500
+
+    return {
+        "pass": passed,
+        "score": min(1.0, word_count / 300),
+        "reason": f"Word count: {word_count}"
+    }
+```
+
+**Key points:**
+- Default function name is `get_assert`
+- Specify function with `file://path.py:function_name`
+- Return `bool`, `float` (score), or `dict` with pass/score/reason
+- Access variables via `context['vars']`
+
+## LLM-as-Judge (llm-rubric)
+
+```yaml
+assert:
+  - type: llm-rubric
+    value: |
+      Evaluate the response based on:
+      1. Accuracy of information
+      2. Clarity of explanation
+      3. Completeness
+
+      Score 0.0-1.0 where 0.7+ is passing.
+    threshold: 0.7
+    provider: openai:gpt-4.1  # Optional: override grader model
+```
+
+**Best practices:**
+- Provide clear scoring criteria
+- Use `threshold` to set minimum passing score
+- Default grader uses available API keys (OpenAI → Anthropic → Google)
+
+## Common Assertion Types
+
+| Type | Usage | Example |
+|------|-------|---------|
+| `contains` | Check substring | `value: "hello"` |
+| `icontains` | Case-insensitive | `value: "HELLO"` |
+| `equals` | Exact match | `value: "42"` |
+| `regex` | Pattern match | `value: "\\d{4}"` |
+| `python` | Custom logic | `value: file://script.py` |
+| `llm-rubric` | LLM grading | `value: "Is professional"` |
+| `latency` | Response time | `threshold: 1000` |
+
+## File References
+
+All paths are relative to config file location:
+
+```yaml
+# Load file content as variable
+vars:
+  content: file://data/input.txt
+
+# Load prompt from file
+prompts:
+  - file://prompts/main.md
+
+# Load test cases from file
+tests: file://tests/cases.yaml
+
+# Load Python assertion
+assert:
+  - type: python
+    value: file://scripts/check.py:validate
+```
+
+## Running Evaluations
+
+```bash
+# Basic run
+npx promptfoo@latest eval
+
+# With specific config
+npx promptfoo@latest eval --config path/to/config.yaml
+
+# Output to file
+npx promptfoo@latest eval --output results.json
+
+# Filter tests
+npx promptfoo@latest eval --filter-metadata category=math
+
+# View results
+npx promptfoo@latest view
+```
+
+## Troubleshooting
+
+**Python not found:**
+```bash
+export PROMPTFOO_PYTHON=python3
+```
+
+**Large outputs truncated:**
+Outputs over 30000 characters are truncated. Use `head_limit` in assertions.
+
+**File not found errors:**
+Ensure paths are relative to `promptfooconfig.yaml` location.
+
+## Echo Provider (Preview Mode)
+
+Use the **echo provider** to preview rendered prompts without making API calls:
+
+```yaml
+# promptfooconfig-preview.yaml
+providers:
+  - echo  # Returns prompt as output, no API calls
+
+tests:
+  - vars:
+      input: "test content"
+```
+
+**Use cases:**
+- Preview prompt rendering before expensive API calls
+- Verify Few-shot examples are loaded correctly
+- Debug variable substitution issues
+- Validate prompt structure
+
+```bash
+# Run preview mode
+npx promptfoo@latest eval --config promptfooconfig-preview.yaml
+```
+
+**Cost:** Free - no API tokens consumed.
+
+## Advanced Few-Shot Implementation
+
+### Multi-turn Conversation Pattern
+
+For complex few-shot learning with full examples:
+
+```json
+[
+  {"role": "system", "content": "{{system_prompt}}"},
+
+  // Few-shot Example 1
+  {"role": "user", "content": "Task: {{example_input_1}}"},
+  {"role": "assistant", "content": "{{example_output_1}}"},
+
+  // Few-shot Example 2 (optional)
+  {"role": "user", "content": "Task: {{example_input_2}}"},
+  {"role": "assistant", "content": "{{example_output_2}}"},
+
+  // Actual test
+  {"role": "user", "content": "Task: {{actual_input}}"}
+]
+```
+
+**Test case configuration:**
+
+```yaml
+tests:
+  - vars:
+      system_prompt: file://prompts/system.md
+      # Few-shot examples
+      example_input_1: file://data/examples/input1.txt
+      example_output_1: file://data/examples/output1.txt
+      example_input_2: file://data/examples/input2.txt
+      example_output_2: file://data/examples/output2.txt
+      # Actual test
+      actual_input: file://data/test1.txt
+```
+
+**Best practices:**
+- Use 1-3 few-shot examples (more may dilute effectiveness)
+- Ensure examples match the task format exactly
+- Load examples from files for better maintainability
+- Use echo provider first to verify structure
+
+## Long Text Handling
+
+For Chinese/long-form content evaluations (10k+ characters):
+
+**Configuration:**
+
+```yaml
+providers:
+  - id: anthropic:messages:claude-sonnet-4-5-20250929
+    config:
+      max_tokens: 8192  # Increase for long outputs
+
+defaultTest:
+  assert:
+    - type: python
+      value: file://scripts/metrics.py:check_length
+```
+
+**Python assertion for text metrics:**
+
+```python
+import re
+
+def strip_tags(text: str) -> str:
+    """Remove HTML tags for pure text."""
+    return re.sub(r'<[^>]+>', '', text)
+
+def check_length(output: str, context: dict) -> dict:
+    """Check output length constraints."""
+    raw_input = context['vars'].get('raw_input', '')
+
+    input_len = len(strip_tags(raw_input))
+    output_len = len(strip_tags(output))
+
+    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
+
+    return {
+        "pass": 0.7 <= reduction_ratio <= 0.9,
+        "score": reduction_ratio,
+        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
+        "named_scores": {
+            "input_length": input_len,
+            "output_length": output_len,
+            "reduction_ratio": reduction_ratio
+        }
+    }
+```
+
+## Real-World Example
+
+**Project:** Chinese short-video content curation from long transcripts
+
+**Structure:**
+```
+tiaogaoren/
+├── promptfooconfig.yaml          # Production config
+├── promptfooconfig-preview.yaml  # Preview config (echo provider)
+├── prompts/
+│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
+│   └── v4/system-v4.md          # System prompt
+├── tests/cases.yaml              # 3 test samples
+├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
+├── data/                         # 5 samples (2 few-shot, 3 eval)
+└── results/
+```
+
+**See:** `~/workspace/prompts/tiaogaoren/` for full implementation.
+
+## Resources
+
+For detailed API reference and advanced patterns, see [references/promptfoo_api.md](references/promptfoo_api.md).
--- a/promptfoo-evaluation/references/promptfoo_api.md
+++ b/promptfoo-evaluation/references/promptfoo_api.md
@@ -0,0 +1,249 @@
+# Promptfoo API Reference
+
+## Provider Configuration
+
+### Echo Provider (No API Calls)
+
+```yaml
+providers:
+  - echo  # Returns prompt as-is, no API calls
+```
+
+**Use cases:**
+- Preview rendered prompts without cost
+- Debug variable substitution
+- Verify few-shot structure
+- Test configuration before production runs
+
+**Cost:** Free - no tokens consumed.
+
+### Anthropic
+
+```yaml
+providers:
+  - id: anthropic:messages:claude-sonnet-4-5-20250929
+    config:
+      max_tokens: 4096
+      temperature: 0.7
+```
+
+### OpenAI
+
+```yaml
+providers:
+  - id: openai:gpt-4.1
+    config:
+      temperature: 0.5
+      max_tokens: 2048
+```
+
+### Multiple Providers (A/B Testing)
+
+```yaml
+providers:
+  - id: anthropic:messages:claude-sonnet-4-5-20250929
+    label: Claude
+  - id: openai:gpt-4.1
+    label: GPT-4.1
+```
+
+## Assertion Reference
+
+### Python Assertion Context
+
+```python
+class AssertionContext:
+    prompt: str              # Raw prompt sent to LLM
+    vars: dict               # Test case variables
+    test: dict               # Complete test case
+    config: dict             # Assertion config
+    provider: Any            # Provider info
+    providerResponse: Any    # Full response
+```
+
+### GradingResult Format
+
+```python
+{
+    "pass": bool,           # Required: pass/fail
+    "score": float,         # 0.0-1.0 score
+    "reason": str,          # Explanation
+    "named_scores": dict,   # Custom metrics
+    "component_results": [] # Nested results
+}
+```
+
+### Assertion Types
+
+| Type | Description | Parameters |
+|------|-------------|------------|
+| `contains` | Substring check | `value` |
+| `icontains` | Case-insensitive | `value` |
+| `equals` | Exact match | `value` |
+| `regex` | Pattern match | `value` |
+| `not-contains` | Absence check | `value` |
+| `starts-with` | Prefix check | `value` |
+| `contains-any` | Any substring | `value` (array) |
+| `contains-all` | All substrings | `value` (array) |
+| `cost` | Token cost | `threshold` |
+| `latency` | Response time | `threshold` (ms) |
+| `perplexity` | Model confidence | `threshold` |
+| `python` | Custom Python | `value` (file/code) |
+| `javascript` | Custom JS | `value` (code) |
+| `llm-rubric` | LLM grading | `value`, `threshold` |
+| `factuality` | Fact checking | `value` (reference) |
+| `model-graded-closedqa` | Q&A grading | `value` |
+| `similar` | Semantic similarity | `value`, `threshold` |
+
+## Test Case Configuration
+
+### Full Test Case Structure
+
+```yaml
+- description: "Test name"
+  vars:
+    var1: "value"
+    var2: file://path.txt
+  assert:
+    - type: contains
+      value: "expected"
+  metadata:
+    category: "test-category"
+    priority: high
+  options:
+    provider: specific-provider
+    transform: "output.trim()"
+```
+
+### Loading Variables from Files
+
+```yaml
+vars:
+  # Text file (loaded as string)
+  content: file://data/input.txt
+
+  # JSON/YAML (parsed to object)
+  config: file://config.json
+
+  # Python script (executed, returns value)
+  dynamic: file://scripts/generate.py
+
+  # PDF (text extracted)
+  document: file://docs/report.pdf
+
+  # Image (base64 encoded)
+  image: file://images/photo.png
+```
+
+## Advanced Patterns
+
+### Dynamic Test Generation (Python)
+
+```python
+# tests/generate.py
+def get_tests():
+    return [
+        {
+            "vars": {"input": f"test {i}"},
+            "assert": [{"type": "contains", "value": str(i)}]
+        }
+        for i in range(10)
+    ]
+```
+
+```yaml
+tests: file://tests/generate.py:get_tests
+```
+
+### Scenario-based Testing
+
+```yaml
+scenarios:
+  - config:
+      - vars:
+          language: "French"
+      - vars:
+          language: "Spanish"
+    tests:
+      - vars:
+          text: "Hello"
+        assert:
+          - type: llm-rubric
+            value: "Translation is accurate"
+```
+
+### Transform Output
+
+```yaml
+defaultTest:
+  options:
+    transform: |
+      output.replace(/\n/g, ' ').trim()
+```
+
+### Custom Grading Provider
+
+```yaml
+defaultTest:
+  options:
+    provider: openai:gpt-4.1
+  assert:
+    - type: llm-rubric
+      value: "Evaluate quality"
+      provider: anthropic:claude-3-haiku  # Override for this assertion
+```
+
+## Environment Variables
+
+| Variable | Description |
+|----------|-------------|
+| `ANTHROPIC_API_KEY` | Anthropic API key |
+| `OPENAI_API_KEY` | OpenAI API key |
+| `PROMPTFOO_PYTHON` | Python binary path |
+| `PROMPTFOO_CACHE_ENABLED` | Enable caching (default: true) |
+| `PROMPTFOO_CACHE_PATH` | Cache directory |
+
+## CLI Commands
+
+```bash
+# Initialize project
+npx promptfoo@latest init
+
+# Run evaluation
+npx promptfoo@latest eval [options]
+
+# Options:
+#   --config <path>     Config file path
+#   --output <path>     Output file path
+#   --grader <provider> Override grader model
+#   --no-cache          Disable caching
+#   --filter-metadata   Filter tests by metadata
+#   --repeat <n>        Repeat each test n times
+#   --delay <ms>        Delay between requests
+#   --max-concurrency   Parallel requests
+
+# View results
+npx promptfoo@latest view [options]
+
+# Share results
+npx promptfoo@latest share
+
+# Generate report
+npx promptfoo@latest generate dataset
+```
+
+## Output Formats
+
+```bash
+# JSON (default)
+--output results.json
+
+# CSV
+--output results.csv
+
+# HTML report
+--output results.html
+
+# YAML
+--output results.yaml
+```