**Bug fixes (run_experiment.py):** - Fix broken revert logic: was saving HEAD as pre_commit (no-op revert), now uses git reset --hard HEAD~1 for correct rollback - Remove broken --loop mode (agent IS the loop, script handles one iteration) - Fix shell injection: all git commands use subprocess list form - Replace shell tail with Python file read **Bug fixes (other scripts):** - setup_experiment.py: fix shell injection in git branch creation, remove dead --skip-baseline flag, fix evaluator docstring parsing - log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None), add domain_filter to CSV/markdown export, move import time to top - evaluators: add FileNotFoundError handling, fix output format mismatch in llm_judge_copy, add peak_kb on macOS, add ValueError handling **Plugin packaging (NEW):** - plugin.json, settings.json, CLAUDE.md for plugin registry - 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume - /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly) - experiment-runner agent for autonomous loop iterations - Registered in marketplace.json as plugin #20 **SKILL.md rewrite:** - Replace ambiguous "Loop Protocol" with clear "Agent Protocol" - Add results.tsv format spec, strategy escalation, self-improvement - Replace "NEVER STOP" with resumable stopping logic **Docs & sync:** - Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill - 6 new MkDocs pages, mkdocs.yml nav updated - Counts updated: 17 agents, 22 slash commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
78 lines
2.7 KiB
Python
78 lines
2.7 KiB
Python
#!/usr/bin/env python3
|
|
"""LLM judge for content quality (headlines, titles, descriptions).
|
|
Uses the user's existing CLI tool (claude, codex, gemini) for evaluation.
|
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
|
|
|
import subprocess
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
# --- CONFIGURE THESE ---
|
|
TARGET_FILE = "content/titles.md" # File being optimized
|
|
CLI_TOOL = "claude" # or: codex, gemini
|
|
# --- END CONFIG ---
|
|
|
|
# The judge prompt is FIXED — the agent cannot change how it's evaluated
|
|
JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly.
|
|
|
|
Criteria (each scored 1-10):
|
|
|
|
1. CURIOSITY GAP — Does this make you want to click? Is there an information gap
|
|
that can only be resolved by reading? Generic titles score 1-3. Specific,
|
|
intriguing titles score 7-10.
|
|
|
|
2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved
|
|
performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9.
|
|
|
|
3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out,
|
|
or recognition? Flat titles score 1-3. Emotionally charged score 7-10.
|
|
|
|
4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or
|
|
search results? Would they pause on this headline? Rate honestly.
|
|
|
|
5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally?
|
|
Keyword-stuffed = 3. Natural integration of search terms = 8-10.
|
|
|
|
Output EXACTLY this format (nothing else):
|
|
curiosity: <score>
|
|
specificity: <score>
|
|
emotional: <score>
|
|
scroll_stop: <score>
|
|
seo: <score>
|
|
ctr_score: <average of all 5 scores>
|
|
|
|
Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
|
|
|
|
try:
|
|
content = Path(TARGET_FILE).read_text()
|
|
except FileNotFoundError:
|
|
print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
|
|
|
|
# Call the user's CLI tool
|
|
result = subprocess.run(
|
|
[CLI_TOOL, "-p", full_prompt],
|
|
capture_output=True, text=True, timeout=120
|
|
)
|
|
|
|
if result.returncode != 0:
|
|
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
# Parse output — look for ctr_score line
|
|
output = result.stdout
|
|
for line in output.splitlines():
|
|
line = line.strip()
|
|
if line.startswith("ctr_score:"):
|
|
print(line)
|
|
elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")):
|
|
print(line)
|
|
|
|
# Verify ctr_score was found
|
|
if "ctr_score:" not in output:
|
|
print("Could not parse ctr_score from LLM output", file=sys.stderr)
|
|
print(f"Raw output: {output[:500]}", file=sys.stderr)
|
|
sys.exit(1)
|