refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo. Architecture changes: - Multi-experiment support: .autoresearch/{domain}/{name}/ structure - Domain categories: engineering, marketing, content, prompts, custom - Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope - User chooses scope during setup, not installation New evaluators (8 ready-to-use): - Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage - LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy - LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed Script improvements: - setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators - run_experiment.py: --experiment domain/name, --resume, --loop, --single - log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output Results export: - Terminal (default), CSV, and Markdown formats - Per-experiment, per-domain, or cross-experiment dashboard view SKILL.md rewritten: - Clear activation triggers (when the skill should activate) - Practical examples for each domain - Evaluator documentation with cost transparency - Simplified loop protocol matching Karpathy's original philosophy
2026-03-13 08:22:14 +01:00
parent c834d71a44
commit 12591282da
13 changed files with 1744 additions and 702 deletions
--- a/engineering/autoresearch-agent/SKILL.md
+++ b/engineering/autoresearch-agent/SKILL.md
@@ -1,9 +1,9 @@
 ---
 name: "autoresearch-agent"
-description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo."
+description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
 license: MIT
 metadata:
-  version: 1.0.0
+  version: 2.0.0
  author: Alireza Rezvani
  category: engineering
  updated: 2026-03-13
@@ -13,194 +13,233 @@ metadata:
 > You sleep. The agent experiments. You wake up to results.
-Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop.
+Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
-**Works for any domain with a measurable metric:**
+Not one guess — fifty measured attempts, compounding.
 - ML training optimization (original use case — optimize `train.py` by `val_bpb`)
 - Prompt engineering (optimize system prompts by LLM-eval quality score)
 - Code performance (optimize a module by benchmark runtime/score)
 - Agent skill improvement (optimize `SKILL.md` by task completion rate)
 ---
-## Before Starting
+## When This Skill Activates
-Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing.
+Recognize these patterns from the user:
-If no `program.md` exists, run the **Setup Wizard** below.
+- "Make this faster / smaller / better"
 - "Optimize [file] for [metric]"
 - "Improve my [headlines / copy / prompts]"
 - "Run experiments overnight"
 - "I want to get [metric] from X to Y"
 - Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
 If the user describes a target file + a way to measure success → this skill applies.
 ---
-## Setup Wizard
+## Setup
-Answer these 5 questions to configure the experiment:
+### First Time — Create the Experiment
-### 1. What are we optimizing?
+Run the setup script. The user decides where experiments live:
 The **target** — one file the agent is allowed to modify:
 - `train.py` — ML training loop (Karpathy-style)
 - `prompt.md` — system prompt for an LLM
 - `src/module.py` — a specific code module
 - `SKILL.md` — an agent skill definition
-### 2. What's the metric?
+**Project-level** (inside repo, git-tracked, shareable with team):
-A **single number** that defines success. Lower or higher = better:
+```bash
- `val_bpb` — validation bits per byte (ML, lower is better)
+python scripts/setup_experiment.py \
- `eval_score` — LLM quality score 0-100 (higher is better)
+  --domain engineering \
- `p50_ms` — median latency in milliseconds (lower is better)
+  --name api-speed \
- `pass_rate` — test pass rate 0-1 (higher is better)
+  --target src/api/search.py \
-
+  --eval "pytest bench.py --tb=no -q" \
-### 3. What's the time budget per experiment?
+  --metric p50_ms \
-How long one experiment run takes:
+  --direction lower \
- `5m` — fast iteration (Karpathy default, ~12 experiments/hour)
+  --scope project
 - `10m` — moderate (6/hour)
 - `30m` — slow but thorough (2/hour)
 ### 4. What can the agent change?
 Constraints on the target file:
 - Architecture only? Hyperparameters only? Everything?
 - What packages/imports are available?
 - What's explicitly off-limits?
 ### 5. What's the evaluation function?
 How we score each experiment:
 - Fixed script that outputs the metric (e.g. `python evaluate.py`)
 - API call that returns a score
 - Test suite with a pass rate
 Once answered, run: `python scripts/setup_experiment.py` to initialize.
 ---
 ## The Three Files
 Every autoresearch project has the same structure:
 ```
 project/
 ├── program.md       ← Human writes this: objectives, constraints, strategy
 ├── target.*         ← Agent modifies this: the thing being optimized
 ├── evaluate.py      ← Fixed: the measurement function (never touch)
 ├── results.tsv      ← Auto-generated: experiment log (git-tracked for continuity)
 └── scripts/
    ├── setup_experiment.py   ← Initialize a new run
    ├── run_experiment.py     ← Execute one experiment iteration
    └── log_results.py        ← Record results to TSV
 ```
-### `program.md` — Your Research Directions
+**User-level** (personal, in `~/.autoresearch/`):
-Write this once. The agent reads it before every experiment. It should contain:
+```bash
- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code)
+python scripts/setup_experiment.py \
- **Strategy:** What directions to explore first
+  --domain marketing \
- **Constraints:** What the agent cannot change
+  --name medium-ctr \
- **Stopping criteria:** When a result is "good enough"
+  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user
 ```
-See `references/program-template.md` for domain-specific templates.
+The `--scope` flag determines where `.autoresearch/` lives:
 - `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
 - `user` → `~/.autoresearch/` in the home directory. Everything is personal.
-### Target File — The Only File the Agent Edits
+### What Setup Creates
 Whatever you're optimizing. Strict scope: **one file, one metric**.
-### `evaluate.py` — Fixed Evaluation (Never Modified)
+```
-The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured.
+.autoresearch/
 ├── config.yaml                        ← Global settings
 ├── .gitignore                         ← Ignores results.tsv, *.log
 └── {domain}/{experiment-name}/
    ├── program.md                     ← Objectives, constraints, strategy
    ├── config.cfg                     ← Target, eval cmd, metric, direction
    ├── results.tsv                    ← Experiment log (gitignored)
    └── evaluate.py                    ← Evaluation script (if --evaluator used)
 ```
 ### Domains
 | Domain | Use Cases |
 |--------|-----------|
 | `engineering` | Code speed, memory, bundle size, test pass rate, build time |
 | `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
 | `content` | Article structure, SEO descriptions, readability, CTR |
 | `prompts` | System prompts, chatbot tone, agent instructions |
 | `custom` | Anything else with a measurable metric |
 ### If `program.md` Already Exists
 The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
 ---
 ## The Experiment Loop
-Run: `python scripts/run_experiment.py --loop`
+### Starting an Experiment
 ```bash
 # Run specific experiment
 python scripts/run_experiment.py --experiment engineering/api-speed --loop
 # Single iteration (test setup)
 python scripts/run_experiment.py --experiment engineering/api-speed --single
 # Resume last active experiment
 python scripts/run_experiment.py --resume --loop
 # Dry run (show what would happen)
 python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
 ```
 ### The Loop Protocol
 ```
 LOOP FOREVER:
-1. Read program.md for current strategy
+1. Read program.md for current strategy and constraints
-2. Review git history: what has been tried? What worked?
+2. Review git log: what has been tried? What worked? What crashed?
-3. Propose ONE change to the target file
+3. Review results.tsv: current best metric, trend, recent failures
-4. Apply the change
+4. Propose ONE change to the target file
-5. git commit (with descriptive message)
+5. Apply the change
-6. Run evaluation: python evaluate.py > run.log 2>&1
+6. git commit -m "experiment: [short description of what changed]"
-7. Parse metric from run.log
+7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
-8. If metric improved → ADVANCE (keep commit, log "keep")
+8. Parse metric from run.log (grep for metric_name: value)
-9. If metric equal/worse → REVERT (git reset, log "discard")
+9. Decision:
-10. If crash → attempt fix, if unfixable log "crash" and revert
+   - Metric improved → KEEP (advance branch, log "keep")
-11. Update results.tsv
+   - Metric equal or worse → REVERT (git reset --hard, log "discard")
-12. Go to 1
+   - Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
 10. Append result to results.tsv
 11. Go to 1
 ```
-### Rules (from Karpathy's original)
+### Rules
- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted.
+- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win.
+- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
- **One change per experiment** — don't change 5 things at once. You won't know what worked.
+- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on.
+- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash.
+- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
- **No new dependencies** — only use what's already available.
+- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
 - **No new dependencies.** Only use what's already available in the project.
 ---
-## Results Log
+## Evaluators
-`results.tsv` (tab-separated, not git-tracked):
+Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.
 ### Free Evaluators (no API cost)
 | Evaluator | Metric | Use Case |
 |-----------|--------|----------|
 | `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
 | `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
 | `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
 | `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
 | `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
 ### LLM Judge Evaluators (uses your subscription)
 | Evaluator | Metric | Use Case |
 |-----------|--------|----------|
 | `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
 | `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
 | `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |
 LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
 The user's existing subscription covers the cost:
 - Claude Code Max → unlimited Claude calls for evaluation
 - Codex CLI (ChatGPT Pro) → unlimited Codex calls
 - Gemini CLI (free tier) → free evaluation calls
 ### Custom Evaluators
 If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.
 ```python
 #!/usr/bin/env python3
 # My custom evaluator — DO NOT MODIFY after experiment starts
 import subprocess
 result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
 # Parse and output
 print(f"my_metric: {parse_score(result.stdout)}")
 ```
 commit  metric  status  description
 a1b2c3d 0.9979  keep    baseline
 b2c3d4e 0.9932  keep    increased learning rate
 c3d4e5f 1.0050  discard switched to GeLU (worse)
 d4e5f6g 0.0000  crash   doubled model width (OOM)
 ```
 Run `python scripts/log_results.py --summary` for a visual summary.
 ---
-## Domain-Specific Configurations
+## Viewing Results
-### ML Training (Karpathy-style)
+```bash
-```yaml
+# Single experiment
-target: train.py
+python scripts/log_results.py --experiment engineering/api-speed
-evaluate: uv run prepare.py --eval-only
+
-metric: val_bpb (lower is better)
+# All experiments in a domain
-time_budget: 5m
+python scripts/log_results.py --domain engineering
-git_branch: autoresearch/{date}-{tag}
+
 # Cross-experiment dashboard
 python scripts/log_results.py --dashboard
 # Export formats
 python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
 python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
 python scripts/log_results.py --dashboard --format markdown --output dashboard.md
 ```
-### Prompt Engineering
+### Dashboard Output
-```yaml
+
-target: prompt.md
+```
-evaluate: python evaluate.py --model gpt-4o --test-cases tests/
+DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
-metric: eval_score (0-100, higher is better)
+engineering     api-speed            47    14   185ms        -76.9%        active
-time_budget: 2m
+engineering     bundle-size          23     8   412KB        -58.3%        paused
-git_branch: prompt-research/{date}
+marketing       medium-ctr           31    11   8.4/10       +68.0%        active
 prompts         support-tone         15     6   82/100       +46.4%        done
 ```
-### Code Performance
+### Export Formats
 ```yaml
 target: src/hot_module.py
 evaluate: python benchmark.py --runs 5 --warmup 1
 metric: p50_ms (lower is better)
 time_budget: 10m
 git_branch: perf-research/{date}
 ```
-### Agent Skill Optimization
+- **TSV** — default, tab-separated (compatible with spreadsheets)
-```yaml
+- **CSV** — comma-separated, with proper quoting
-target: SKILL.md
+- **Markdown** — formatted table, readable in GitHub/docs
 evaluate: python scripts/skill_evaluator.py --task-suite tests/
 metric: pass_rate (0-1, higher is better)
 time_budget: 5m
 git_branch: skill-research/{date}
 ```
 See `references/experiment-domains.md` for full setup guides per domain.
 ---
-## Scripts
+## Proactive Triggers
-| Script | Purpose |
+Flag these without being asked:
-|--------|---------|
+
-| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run |
+- **No evaluation command works** → Test it before starting the loop. Run once, verify output.
-| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) |
+- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
-| `log_results.py` | Record results to TSV; `--summary` prints progress table |
+- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
 - **Time budget too short** → If eval takes longer than budget, every run crashes.
 - **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
 - **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
 - **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.
 ---
@@ -214,7 +253,6 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
 ### Multi-tool install
 ```bash
 # Clone the repo, then use the convert script for your tool:
 ./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
 ```
@@ -225,22 +263,9 @@ clawhub install autoresearch-agent
 ---
 ## Proactive Triggers
 Flag these issues without being asked:
 - **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
 - **Target file has no git history** → `git init` and commit baseline first.
 - **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
 - **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
 - **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
 - **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
 ---
 ## Related Skills
- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics.
+- **self-improving-agent** — improves an agent's own memory/rules over time. NOT for structured experiment loops.
- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops.
+- **senior-ml-engineer** — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops.
+- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function.
+- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.
--- a/engineering/autoresearch-agent/evaluators/benchmark_size.py
+++ b/engineering/autoresearch-agent/evaluators/benchmark_size.py
@@ -0,0 +1,56 @@
 #!/usr/bin/env python3
 """Measure file, bundle, or Docker image size.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import os
 import subprocess
 import sys
 # --- CONFIGURE ONE OF THESE ---
 # Option 1: File size
 TARGET_FILE = "dist/main.js"
 # Option 2: Directory size (uncomment to use)
 # TARGET_DIR = "dist/"
 # Option 3: Docker image (uncomment to use)
 # DOCKER_IMAGE = "myapp:latest"
 # DOCKER_BUILD_CMD = "docker build -t myapp:latest ."
 # Option 4: Build first, then measure (uncomment to use)
 # BUILD_CMD = "npm run build"
 # --- END CONFIG ---
 # Build if needed
 if "BUILD_CMD" in dir() or "BUILD_CMD" in globals():
    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)
    if result.returncode != 0:
        print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr)
        sys.exit(1)
 # Measure
 if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
    if "DOCKER_BUILD_CMD" in dir():
        subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
    result = subprocess.run(
        f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
        shell=True, capture_output=True, text=True
    )
    size_bytes = int(result.stdout.strip())
 elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
    size_bytes = sum(
        os.path.getsize(os.path.join(dp, f))
        for dp, _, fns in os.walk(TARGET_DIR) for f in fns
    )
 elif os.path.exists(TARGET_FILE):
    size_bytes = os.path.getsize(TARGET_FILE)
 else:
    print(f"Target not found: {TARGET_FILE}", file=sys.stderr)
    sys.exit(1)
 size_kb = size_bytes / 1024
 size_mb = size_bytes / (1024 * 1024)
 print(f"size_bytes: {size_bytes}")
 print(f"size_kb: {size_kb:.1f}")
 print(f"size_mb: {size_mb:.2f}")
--- a/engineering/autoresearch-agent/evaluators/benchmark_speed.py
+++ b/engineering/autoresearch-agent/evaluators/benchmark_speed.py
@@ -0,0 +1,40 @@
 #!/usr/bin/env python3
 """Measure execution speed of a target function or command.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import statistics
 import subprocess
 import sys
 import time
 # --- CONFIGURE THESE ---
 COMMAND = "python src/module.py"  # Command to benchmark
 RUNS = 5                          # Number of runs
 WARMUP = 1                        # Warmup runs (not counted)
 # --- END CONFIG ---
 times = []
 # Warmup
 for _ in range(WARMUP):
    subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
 # Benchmark
 for i in range(RUNS):
    t0 = time.perf_counter()
    result = subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
    elapsed = (time.perf_counter() - t0) * 1000  # ms
    if result.returncode != 0:
        print(f"Run {i+1} failed (exit {result.returncode})", file=sys.stderr)
        print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
        sys.exit(1)
    times.append(elapsed)
 p50 = statistics.median(times)
 p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 5 else max(times)
 print(f"p50_ms: {p50:.2f}")
 print(f"p95_ms: {p95:.2f}")
 print(f"runs: {RUNS}")
--- a/engineering/autoresearch-agent/evaluators/build_speed.py
+++ b/engineering/autoresearch-agent/evaluators/build_speed.py
@@ -0,0 +1,39 @@
 #!/usr/bin/env python3
 """Measure build/compile time.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import subprocess
 import sys
 import time
 # --- CONFIGURE THESE ---
 BUILD_CMD = "npm run build"    # or: docker build -t test .
 CLEAN_CMD = ""                 # optional: npm run clean (run before each build)
 RUNS = 3                       # Number of builds to average
 # --- END CONFIG ---
 times = []
 for i in range(RUNS):
    # Clean if configured
    if CLEAN_CMD:
        subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)
    t0 = time.perf_counter()
    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
    elapsed = time.perf_counter() - t0
    if result.returncode != 0:
        print(f"Build {i+1} failed (exit {result.returncode})", file=sys.stderr)
        print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
        sys.exit(1)
    times.append(elapsed)
 import statistics
 avg = statistics.mean(times)
 median = statistics.median(times)
 print(f"build_seconds: {median:.2f}")
 print(f"build_avg: {avg:.2f}")
 print(f"runs: {RUNS}")
--- a/engineering/autoresearch-agent/evaluators/llm_judge_content.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_content.py
@@ -0,0 +1,72 @@
 #!/usr/bin/env python3
 """LLM judge for content quality (headlines, titles, descriptions).
 Uses the user's existing CLI tool (claude, codex, gemini) for evaluation.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import subprocess
 import sys
 from pathlib import Path
 # --- CONFIGURE THESE ---
 TARGET_FILE = "content/titles.md"  # File being optimized
 CLI_TOOL = "claude"                # or: codex, gemini
 # --- END CONFIG ---
 # The judge prompt is FIXED — the agent cannot change how it's evaluated
 JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly.
 Criteria (each scored 1-10):
 1. CURIOSITY GAP — Does this make you want to click? Is there an information gap
   that can only be resolved by reading? Generic titles score 1-3. Specific,
   intriguing titles score 7-10.
 2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved
   performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9.
 3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out,
   or recognition? Flat titles score 1-3. Emotionally charged score 7-10.
 4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or
   search results? Would they pause on this headline? Rate honestly.
 5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally?
   Keyword-stuffed = 3. Natural integration of search terms = 8-10.
 Output EXACTLY this format (nothing else):
 curiosity: <score>
 specificity: <score>
 emotional: <score>
 scroll_stop: <score>
 seo: <score>
 ctr_score: <average of all 5 scores>
 Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
 content = Path(TARGET_FILE).read_text()
 full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
 # Call the user's CLI tool
 result = subprocess.run(
    [CLI_TOOL, "-p", full_prompt],
    capture_output=True, text=True, timeout=120
 )
 if result.returncode != 0:
    print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
    sys.exit(1)
 # Parse output — look for ctr_score line
 output = result.stdout
 for line in output.splitlines():
    line = line.strip()
    if line.startswith("ctr_score:"):
        print(line)
    elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")):
        print(line)
 # Verify ctr_score was found
 if "ctr_score:" not in output:
    print("Could not parse ctr_score from LLM output", file=sys.stderr)
    print(f"Raw output: {output[:500]}", file=sys.stderr)
    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/llm_judge_copy.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_copy.py
@@ -0,0 +1,88 @@
 #!/usr/bin/env python3
 """LLM judge for marketing copy (social posts, ads, emails).
 Uses the user's existing CLI tool for evaluation.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import subprocess
 import sys
 from pathlib import Path
 # --- CONFIGURE THESE ---
 TARGET_FILE = "posts.md"           # Copy being optimized
 CLI_TOOL = "claude"                # or: codex, gemini
 PLATFORM = "twitter"               # twitter, linkedin, instagram, email, ad
 # --- END CONFIG ---
 JUDGE_PROMPTS = {
    "twitter": """Score this Twitter/X post strictly:
 1. HOOK (1-10) — Does the first line stop the scroll?
 2. VALUE (1-10) — Does it provide insight, entertainment, or utility?
 3. ENGAGEMENT (1-10) — Would people reply, retweet, or like?
 4. BREVITY (1-10) — Is every word earning its place? No filler?
 5. CTA (1-10) — Is there a clear next action (even implicit)?""",
    "linkedin": """Score this LinkedIn post strictly:
 1. HOOK (1-10) — Does the first line make you click "see more"?
 2. STORYTELLING (1-10) — Is there a narrative arc or just statements?
 3. CREDIBILITY (1-10) — Does it demonstrate expertise without bragging?
 4. ENGAGEMENT (1-10) — Would professionals comment or share?
 5. CTA (1-10) — Does it invite discussion or action?""",
    "instagram": """Score this Instagram caption strictly:
 1. HOOK (1-10) — Does the first line grab attention?
 2. RELATABILITY (1-10) — Does the audience see themselves in this?
 3. VISUAL MATCH (1-10) — Does the copy complement visual content?
 4. HASHTAG STRATEGY (1-10) — Are hashtags relevant and not spammy?
 5. CTA (1-10) — Does it encourage saves, shares, or comments?""",
    "email": """Score this email subject + preview strictly:
 1. OPEN INCENTIVE (1-10) — Would you open this in a crowded inbox?
 2. SPECIFICITY (1-10) — Is it concrete or vague?
 3. URGENCY (1-10) — Is there a reason to open now vs later?
 4. PERSONALIZATION (1-10) — Does it feel written for someone, not everyone?
 5. PREVIEW SYNC (1-10) — Does the preview text complement the subject?""",
    "ad": """Score this ad copy strictly:
 1. ATTENTION (1-10) — Does it stop someone scrolling past ads?
 2. DESIRE (1-10) — Does it create want for the product/service?
 3. PROOF (1-10) — Is there credibility (numbers, social proof)?
 4. ACTION (1-10) — Is the CTA clear and compelling?
 5. OBJECTION HANDLING (1-10) — Does it preempt "why not"?""",
 }
 platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])
 JUDGE_PROMPT = f"""{platform_prompt}
 Output EXACTLY this format:
 criterion_1: <score>
 criterion_2: <score>
 criterion_3: <score>
 criterion_4: <score>
 criterion_5: <score>
 engagement_score: <average of all 5>
 Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""
 content = Path(TARGET_FILE).read_text()
 full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"
 result = subprocess.run(
    [CLI_TOOL, "-p", full_prompt],
    capture_output=True, text=True, timeout=120
 )
 if result.returncode != 0:
    print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
    sys.exit(1)
 output = result.stdout
 for line in output.splitlines():
    line = line.strip()
    if line.startswith("engagement_score:") or line.startswith("criterion_"):
        print(line)
 if "engagement_score:" not in output:
    print("Could not parse engagement_score from LLM output", file=sys.stderr)
    print(f"Raw: {output[:500]}", file=sys.stderr)
    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py
@@ -0,0 +1,99 @@
 #!/usr/bin/env python3
 """LLM judge for prompt/instruction quality.
 Uses the user's existing CLI tool for evaluation.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import json
 import subprocess
 import sys
 from pathlib import Path
 # --- CONFIGURE THESE ---
 TARGET_FILE = "prompt.md"          # Prompt being optimized
 TEST_CASES_FILE = "tests/cases.json"  # Test cases: [{"input": "...", "expected": "..."}]
 CLI_TOOL = "claude"                # or: codex, gemini
 # --- END CONFIG ---
 JUDGE_PROMPT_TEMPLATE = """You are evaluating a system prompt's effectiveness.
 SYSTEM PROMPT BEING TESTED:
 {prompt}
 TEST INPUT:
 {input}
 EXPECTED OUTPUT (reference):
 {expected}
 ACTUAL OUTPUT:
 {actual}
 Score the actual output on these criteria (each 1-10):
 1. ACCURACY — Does it match the expected output's intent and facts?
 2. COMPLETENESS — Does it cover all required elements?
 3. CLARITY — Is it well-structured and easy to understand?
 4. INSTRUCTION_FOLLOWING — Does it follow the system prompt's guidelines?
 Output EXACTLY: quality_score: <average of all 4>
 Nothing else."""
 prompt = Path(TARGET_FILE).read_text()
 test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
 scores = []
 for i, case in enumerate(test_cases):
    # Generate output using the prompt
    gen_prompt = f"{prompt}\n\n{case['input']}"
    gen_result = subprocess.run(
        [CLI_TOOL, "-p", gen_prompt],
        capture_output=True, text=True, timeout=60
    )
    if gen_result.returncode != 0:
        print(f"Generation failed for case {i+1}", file=sys.stderr)
        scores.append(0)
        continue
    actual = gen_result.stdout.strip()
    # Judge the output
    judge_prompt = JUDGE_PROMPT_TEMPLATE.format(
        prompt=prompt[:500],
        input=case["input"],
        expected=case.get("expected", "N/A"),
        actual=actual[:500]
    )
    judge_result = subprocess.run(
        [CLI_TOOL, "-p", judge_prompt],
        capture_output=True, text=True, timeout=60
    )
    if judge_result.returncode != 0:
        scores.append(0)
        continue
    # Parse score
    for line in judge_result.stdout.splitlines():
        if "quality_score:" in line:
            try:
                score = float(line.split(":")[-1].strip())
                scores.append(score)
            except ValueError:
                scores.append(0)
            break
    else:
        scores.append(0)
    print(f"  Case {i+1}/{len(test_cases)}: {scores[-1]:.1f}", file=sys.stderr)
 if not scores:
    print("No test cases evaluated", file=sys.stderr)
    sys.exit(1)
 avg = sum(scores) / len(scores)
 quality = avg * 10  # Scale to 0-100
 print(f"quality_score: {quality:.2f}")
 print(f"cases_tested: {len(scores)}")
 print(f"avg_per_case: {avg:.2f}")
--- a/engineering/autoresearch-agent/evaluators/memory_usage.py
+++ b/engineering/autoresearch-agent/evaluators/memory_usage.py
@@ -0,0 +1,52 @@
 #!/usr/bin/env python3
 """Measure peak memory usage of a command.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import os
 import platform
 import subprocess
 import sys
 # --- CONFIGURE THESE ---
 COMMAND = "python src/module.py"  # Command to measure
 # --- END CONFIG ---
 system = platform.system()
 if system == "Linux":
    # Use /usr/bin/time for peak RSS
    result = subprocess.run(
        f"/usr/bin/time -v {COMMAND}",
        shell=True, capture_output=True, text=True, timeout=300
    )
    output = result.stderr
    for line in output.splitlines():
        if "Maximum resident set size" in line:
            kb = int(line.split(":")[-1].strip())
            mb = kb / 1024
            print(f"peak_mb: {mb:.1f}")
            print(f"peak_kb: {kb}")
            sys.exit(0)
    print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
    sys.exit(1)
 elif system == "Darwin":
    # macOS: use /usr/bin/time -l
    result = subprocess.run(
        f"/usr/bin/time -l {COMMAND}",
        shell=True, capture_output=True, text=True, timeout=300
    )
    output = result.stderr
    for line in output.splitlines():
        if "maximum resident set size" in line.lower():
            # macOS reports in bytes
            val = int(line.strip().split()[0])
            mb = val / (1024 * 1024)
            print(f"peak_mb: {mb:.1f}")
            sys.exit(0)
    print("Could not parse memory from time output", file=sys.stderr)
    sys.exit(1)
 else:
    print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/test_pass_rate.py
+++ b/engineering/autoresearch-agent/evaluators/test_pass_rate.py
@@ -0,0 +1,55 @@
 #!/usr/bin/env python3
 """Measure test suite pass rate.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
 import re
 import subprocess
 import sys
 # --- CONFIGURE THESE ---
 TEST_CMD = "pytest tests/ --tb=no -q"  # Test command
 # --- END CONFIG ---
 result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
 output = result.stdout + "\n" + result.stderr
 # Try to parse pytest output: "X passed, Y failed, Z errors"
 passed = failed = errors = 0
 # pytest short format: "5 passed, 2 failed in 1.23s"
 match = re.search(r"(\d+) passed", output)
 if match:
    passed = int(match.group(1))
 match = re.search(r"(\d+) failed", output)
 if match:
    failed = int(match.group(1))
 match = re.search(r"(\d+) error", output)
 if match:
    errors = int(match.group(1))
 total = passed + failed + errors
 if total == 0:
    # Try unittest format: "Ran X tests"
    match = re.search(r"Ran (\d+) test", output)
    if match:
        total = int(match.group(1))
        if result.returncode == 0:
            passed = total
        else:
            # Count failures from output
            fail_match = re.search(r"FAILED \(failures=(\d+)", output)
            if fail_match:
                failed = int(fail_match.group(1))
                passed = total - failed
 if total == 0:
    print("Could not parse test results", file=sys.stderr)
    print(f"Output: {output[:500]}", file=sys.stderr)
    sys.exit(1)
 rate = passed / total
 print(f"pass_rate: {rate:.4f}")
 print(f"passed: {passed}")
 print(f"failed: {failed}")
 print(f"total: {total}")
--- a/engineering/autoresearch-agent/references/experiment-domains.md
+++ b/engineering/autoresearch-agent/references/experiment-domains.md
@@ -1,175 +1,255 @@
 # Experiment Domains Guide
-## Domain 1: ML Training (Karpathy-style)
+## Domain: Engineering
-**Best for:** LLM/neural net training optimization on a single GPU
+### Code Speed Optimization
 **Requirements:**
 - NVIDIA GPU (H100 recommended, A100/RTX also work)
 - CUDA + PyTorch
 - `uv` package manager
 - ~50GB disk (training data)
 **Setup:**
 ```bash
-# Clone autoresearch repo (the ML training environment)
+python scripts/setup_experiment.py \
-git clone https://github.com/karpathy/autoresearch my-ml-research
+  --domain engineering \
-cd my-ml-research
+  --name api-speed \
-uv sync
+  --target src/api/search.py \
-uv run prepare.py  # one-time data download + tokenizer (~2 min)
+  --eval "python -m pytest tests/bench_search.py --tb=no -q" \
-
+  --metric p50_ms \
-# Initialize autoresearch skill
+  --direction lower \
-cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts
+  --evaluator benchmark_speed
 # Configure
 python scripts/setup_experiment.py --domain ml --tag mar13
 ```
-**Metric:** `val_bpb` — validation bits per byte. Lower = better model.
+**What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O.
 **Cost:** Free — just runs benchmarks.
 **Speed:** ~5 min/experiment, ~12/hour, ~100 overnight.
-**What the agent can change in `train.py`:**
+### Bundle Size Reduction
 - Model depth, width, attention heads
 - Learning rate, scheduler, warmup
 - Optimizer (Muon, AdamW, variants)
 - Batch size, gradient accumulation
 - Architecture (attention patterns, FFN type)
-**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):**
+```bash
-Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256.
+python scripts/setup_experiment.py \
  --domain engineering \
  --name bundle-size \
  --target webpack.config.js \
  --eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
  --metric size_bytes \
  --direction lower \
  --evaluator benchmark_size
 ```
 Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`.
 ### Test Pass Rate
 ```bash
 python scripts/setup_experiment.py \
  --domain engineering \
  --name fix-flaky-tests \
  --target src/utils/parser.py \
  --eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
  --metric pass_rate \
  --direction higher \
  --evaluator test_pass_rate
 ```
 ### Docker Build Speed
 ```bash
 python scripts/setup_experiment.py \
  --domain engineering \
  --name docker-build \
  --target Dockerfile \
  --eval "python .autoresearch/engineering/docker-build/evaluate.py" \
  --metric build_seconds \
  --direction lower \
  --evaluator build_speed
 ```
 ### Memory Optimization
 ```bash
 python scripts/setup_experiment.py \
  --domain engineering \
  --name memory-usage \
  --target src/processor.py \
  --eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
  --metric peak_mb \
  --direction lower \
  --evaluator memory_usage
 ```
 ### ML Training (Karpathy-style)
 Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch).
 ```bash
 python scripts/setup_experiment.py \
  --domain engineering \
  --name ml-training \
  --target train.py \
  --eval "uv run train.py" \
  --metric val_bpb \
  --direction lower \
  --time-budget 5
 ```
 ---
-## Domain 2: Prompt Engineering
+## Domain: Marketing
-**Best for:** Optimizing system prompts for quality/accuracy/tone
+### Medium Article Headlines
 **Requirements:**
 - LLM API access (OpenAI, Anthropic, etc.)
 - Test cases with expected outputs
 - An LLM judge for scoring (can be same model)
 **Setup:**
 ```bash
-mkdir my-prompt-research && cd my-prompt-research
+python scripts/setup_experiment.py \
-git init
+  --domain marketing \
-
+  --name medium-ctr \
-# Create prompt.md (the thing being optimized)
+  --target content/titles.md \
-echo "You are a helpful assistant." > prompt.md
+  --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
-
+  --metric ctr_score \
-# Create evaluate.py (fixed — never modify)
+  --direction higher \
-cat > evaluate.py << 'EOF'
+  --evaluator llm_judge_content
 #!/usr/bin/env python3
 # Fixed evaluation harness — DO NOT MODIFY
 import json, sys
 from pathlib import Path
 PROMPT = Path("prompt.md").read_text()
 # Load test cases
 TEST_CASES = json.loads(Path("tests/cases.json").read_text())
 # Run prompt against test cases, score with LLM judge
 # ... (customize for your LLM + scoring logic)
 total = sum(score_case(PROMPT, case) for case in TEST_CASES)
 score = total / len(TEST_CASES) * 100
 print(f"eval_score: {score:.2f}")
 EOF
 # Initialize
 python scripts/setup_experiment.py --domain prompt --tag mar13
 ```
-**Metric:** `eval_score` (0-100). Higher = better prompt.
+Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`.
 **What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers.
 **Cost:** Uses your CLI subscription (Claude Max = unlimited).
 **Speed:** ~2 min/experiment, ~30/hour.
 ### Social Media Copy
 ```bash
 python scripts/setup_experiment.py \
  --domain marketing \
  --name twitter-engagement \
  --target social/tweets.md \
  --eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy
 ```
 Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram).
 ### Email Subject Lines
 ```bash
 python scripts/setup_experiment.py \
  --domain marketing \
  --name email-open-rate \
  --target emails/subjects.md \
  --eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy
 ```
 Edit `evaluate.py`: set `PLATFORM = "email"`.
 ### Ad Copy
 ```bash
 python scripts/setup_experiment.py \
  --domain marketing \
  --name ad-copy-q2 \
  --target ads/google-search.md \
  --eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy
 ```
 Edit `evaluate.py`: set `PLATFORM = "ad"`.
 ---
-## Domain 3: Code Performance
+## Domain: Content
-**Best for:** Optimizing a specific hot module for speed
+### Article Structure & Readability
 **Requirements:**
 - A Python module with measurable performance
 - Existing tests (correctness must not regress)
 - A benchmark harness
 **Setup:**
 ```bash
-cd my-project
+python scripts/setup_experiment.py \
-
+  --domain content \
-# Create benchmark.py (fixed — never modify)
+  --name article-structure \
-cat > benchmark.py << 'EOF'
+  --target drafts/my-article.md \
-#!/usr/bin/env python3
+  --eval "python .autoresearch/content/article-structure/evaluate.py" \
-# Fixed benchmark — DO NOT MODIFY
+  --metric ctr_score \
-import time, statistics
+  --direction higher \
-from src.module import your_function
+  --evaluator llm_judge_content
 from tests.test_module import run_tests
 # Correctness check first
 if not run_tests():
    print("TESTS FAILED")
    sys.exit(1)
 # Benchmark
 data = generate_test_data(n=10000)
 times = []
 for _ in range(10):
    t0 = time.perf_counter()
    your_function(data)
    times.append((time.perf_counter() - t0) * 1000)
 p50 = statistics.median(times)
 print(f"p50_ms: {p50:.2f}")
 print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
 EOF
 python scripts/setup_experiment.py --domain code \
  --target src/module.py \
  --tag mar13
 ```
-**Metric:** `p50_ms` — median latency. Lower = faster.
+### SEO Descriptions
 ```bash
 python scripts/setup_experiment.py \
  --domain content \
  --name seo-meta \
  --target seo/descriptions.md \
  --eval "python .autoresearch/content/seo-meta/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content
 ```
 ---
-## Domain 4: Agent Skill Optimization
+## Domain: Prompts
-**Best for:** Improving the quality of claude-skills SKILL.md files
+### System Prompt Optimization
 **Requirements:**
 - A SKILL.md to optimize
 - A task evaluation suite (15-20 standardized tasks)
 - An LLM judge for scoring
 **Setup:**
 ```bash
-# Create a new skill research project
+python scripts/setup_experiment.py \
-mkdir skill-research-{skill-name} && cd skill-research-{skill-name}
+  --domain prompts \
-git init
+  --name support-bot \
-
+  --target prompts/support-system.md \
-# Copy the skill to optimize
+  --eval "python .autoresearch/prompts/support-bot/evaluate.py" \
-cp ~/.claude/skills/{skill-name}/SKILL.md .
+  --metric quality_score \
-
+  --direction higher \
-# Create evaluate.py
+  --evaluator llm_judge_prompt
 cat > scripts/skill_evaluator.py << 'EOF'
 #!/usr/bin/env python3
 # Fixed evaluator — DO NOT MODIFY
 # Runs SKILL.md against 15 standardized tasks using LLM judge
 # Outputs: pass_rate: 0.80 (etc.)
 EOF
 python scripts/setup_experiment.py --domain skill --tag mar13
 ```
-**Metric:** `pass_rate` (0-1). Higher = better skill.
+Requires `tests/cases.json` with test inputs and expected outputs:
 ```json
 [
  {
    "input": "I can't log in to my account",
    "expected": "Ask for email, check account status, offer password reset"
  },
  {
    "input": "How do I cancel my subscription?",
    "expected": "Empathetic response, explain cancellation steps, offer retention"
  }
 ]
 ```
 ### Agent Skill Optimization
 ```bash
 python scripts/setup_experiment.py \
  --domain prompts \
  --name skill-improvement \
  --target SKILL.md \
  --eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
  --metric quality_score \
  --direction higher \
  --evaluator llm_judge_prompt
 ```
 ---
 ## Choosing Your Domain
-| Question | Recommendation |
+| I want to... | Domain | Evaluator | Cost |
-|----------|---------------|
+|-------------|--------|-----------|------|
-| Do I have a GPU and want to improve an LLM? | ML Training |
+| Speed up my code | engineering | benchmark_speed | Free |
-| Do I want to improve a prompt/system message? | Prompt Engineering |
+| Shrink my bundle | engineering | benchmark_size | Free |
-| Do I have slow Python code I want to speed up? | Code Performance |
+| Fix flaky tests | engineering | test_pass_rate | Free |
-| Do I want to improve one of my claude-skills? | Skill Optimization |
+| Speed up Docker builds | engineering | build_speed | Free |
 | Reduce memory usage | engineering | memory_usage | Free |
 | Train ML models | engineering | (custom) | Free + GPU |
 | Write better headlines | marketing | llm_judge_content | Subscription |
 | Improve social posts | marketing | llm_judge_copy | Subscription |
 | Optimize email subjects | marketing | llm_judge_copy | Subscription |
 | Improve ad copy | marketing | llm_judge_copy | Subscription |
 | Optimize article structure | content | llm_judge_content | Subscription |
 | Improve SEO descriptions | content | llm_judge_content | Subscription |
 | Optimize system prompts | prompts | llm_judge_prompt | Subscription |
 | Improve agent skills | prompts | llm_judge_prompt | Subscription |
-**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results.
+**First time?** Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.
--- a/engineering/autoresearch-agent/scripts/log_results.py
+++ b/engineering/autoresearch-agent/scripts/log_results.py
@@ -1,125 +1,389 @@
 #!/usr/bin/env python3
 """
-autoresearch-agent: Results Logger
+autoresearch-agent: Results Viewer
-View and analyze experiment results from results.tsv.
+View experiment results in multiple formats: terminal, CSV, Markdown.
 Supports single experiment, domain, or cross-experiment dashboard.
 Usage:
-    python scripts/log_results.py --summary          # Print progress table
+    python scripts/log_results.py --experiment engineering/api-speed
-    python scripts/log_results.py --best             # Show best result
+    python scripts/log_results.py --domain engineering
-    python scripts/log_results.py --history          # Full experiment history
+    python scripts/log_results.py --dashboard
-    python scripts/log_results.py --record commit val status desc  # Add entry manually
+    python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
    python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
    python scripts/log_results.py --dashboard --format markdown --output dashboard.md
 """
 import argparse
 import csv
 import io
 import sys
 from pathlib import Path
-def load_results(path):
+def find_autoresearch_root():
-    tsv = Path(path) / "results.tsv"
+    """Find .autoresearch/ in project or user home."""
    project_root = Path(".").resolve() / ".autoresearch"
    if project_root.exists():
        return project_root
    user_root = Path.home() / ".autoresearch"
    if user_root.exists():
        return user_root
    return None
 def load_config(experiment_dir):
    """Load config.cfg."""
    cfg_file = experiment_dir / "config.cfg"
    config = {}
    if cfg_file.exists():
        for line in cfg_file.read_text().splitlines():
            if ":" in line:
                k, v = line.split(":", 1)
                config[k.strip()] = v.strip()
    return config
 def load_results(experiment_dir):
    """Load results.tsv into list of dicts."""
    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return []
    lines = tsv.read_text().splitlines()[1:]  # skip header
    results = []
-    for line in lines:
+    for line in tsv.read_text().splitlines()[1:]:
        parts = line.split("\t")
        if len(parts) >= 4:
            try:
-                metric_val = float(parts[1]) if parts[1] != "N/A" else None
+                metric = float(parts[1]) if parts[1] != "N/A" else None
            except ValueError:
-                metric_val = None
+                metric = None
            results.append({
                "commit": parts[0],
-                "metric": metric_val,
+                "metric": metric,
                "status": parts[2],
-                "description": parts[3]
+                "description": parts[3],
            })
    return results
-def print_summary(results, metric_name="metric", direction="lower"):
+def compute_stats(results, direction):
-    if not results:
+    """Compute statistics from results."""
        print("No experiments logged yet.")
        return
    keeps = [r for r in results if r["status"] == "keep"]
    discards = [r for r in results if r["status"] == "discard"]
    crashes = [r for r in results if r["status"] == "crash"]
-    print(f"\n{'─'*60}")
+    valid_keeps = [r for r in keeps if r["metric"] is not None]
-    print(f"  autoresearch-agent — Results Summary")
+    baseline = valid_keeps[0]["metric"] if valid_keeps else None
-    print(f"{'─'*60}")
+    if valid_keeps:
-    print(f"  Total experiments: {len(results)}")
+        best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps)
-    print(f"  ✅ Keep:    {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)")
+    else:
-    print(f"  ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)")
+        best = None
    print(f"  💥 Crash:   {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
-    if keeps:
+    pct_change = None
-        valid = [r for r in keeps if r["metric"] is not None]
+    if baseline and best and baseline != 0:
-        if valid:
+        if direction == "lower":
-            baseline = valid[0]["metric"]
+            pct_change = (baseline - best) / baseline * 100
-            best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid)
+        else:
-            best_run = next(r for r in valid if r["metric"] == best)
+            pct_change = (best - baseline) / baseline * 100
            improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
-            print(f"\n  {metric_name}:")
+    return {
-            print(f"    Baseline: {baseline:.6f}")
+        "total": len(results),
-            print(f"    Best:     {best:.6f}  (commit: {best_run['commit']})")
+        "keeps": len(keeps),
-            print(f"    Change:   {improvement:+.2f}%")
+        "discards": len(discards),
-
+        "crashes": len(crashes),
-    print(f"{'─'*60}\n")
+        "baseline": baseline,
        "best": best,
        "pct_change": pct_change,
    }
-def print_history(results):
+# --- Terminal Output ---
 def print_experiment(experiment_dir, experiment_path):
    """Print single experiment results to terminal."""
    config = load_config(experiment_dir)
    results = load_results(experiment_dir)
    direction = config.get("metric_direction", "lower")
    metric_name = config.get("metric", "metric")
    if not results:
-        print("No experiments logged yet.")
+        print(f"No results for {experiment_path}")
        return
-    print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION")
+    stats = compute_stats(results, direction)
    print("─" * 60)
    for r in results:
        metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash   "
        status_icon = {"keep": "✅", "discard": "❌", "crash": "💥"}.get(r["status"], "?")
        print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
    print(f"\n{'─' * 65}")
    print(f"  {experiment_path}")
    print(f"  Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})")
    print(f"{'─' * 65}")
    print(f"  Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}")
    if stats["baseline"] is not None and stats["best"] is not None:
        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
        print(f"  Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}")
    print(f"\n  {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION")
    print(f"  {'─' * 60}")
    for r in results:
        m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A     "
        icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?")
        print(f"  {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}")
    print()
 def print_dashboard(root):
    """Print cross-experiment dashboard."""
    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
            config = load_config(exp_dir)
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
            # Determine status
            status = "idle"
            if stats["total"] > 0:
                tsv = exp_dir / "results.tsv"
                if tsv.exists():
                    import time
                    age_hours = (time.time() - tsv.stat().st_mtime) / 3600
                    status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"
            best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—"
            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"
            experiments.append({
                "domain": domain_dir.name,
                "name": exp_dir.name,
                "runs": stats["total"],
                "kept": stats["keeps"],
                "best": best_str,
                "change": pct_str,
                "status": status,
                "metric": config.get("metric", "?"),
            })
    if not experiments:
        print("No experiments found.")
        return experiments
    print(f"\n{'─' * 90}")
    print(f"  autoresearch — Dashboard")
    print(f"{'─' * 90}")
    print(f"  {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}")
    print(f"  {'─' * 85}")
    for e in experiments:
        print(f"  {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}")
    print()
    return experiments
 # --- CSV Export ---
 def export_experiment_csv(experiment_dir, experiment_path):
    """Export single experiment as CSV string."""
    config = load_config(experiment_dir)
    results = load_results(experiment_dir)
    direction = config.get("metric_direction", "lower")
    stats = compute_stats(results, direction)
    buf = io.StringIO()
    writer = csv.writer(buf)
    # Header with metadata
    writer.writerow(["# Experiment", experiment_path])
    writer.writerow(["# Target", config.get("target", "")])
    writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"])
    if stats["baseline"] is not None:
        writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
    if stats["best"] is not None:
        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
        writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
    writer.writerow(["# Total", stats["total"]])
    writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
    writer.writerow([])
    writer.writerow(["Commit", "Metric", "Status", "Description"])
    for r in results:
        m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A"
        writer.writerow([r["commit"], m, r["status"], r["description"]])
    return buf.getvalue()
 def export_dashboard_csv(root):
    """Export dashboard as CSV string."""
    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
            config = load_config(exp_dir)
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
            best_str = f"{stats['best']:.6f}" if stats["best"] else ""
            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
            experiments.append([
                domain_dir.name, exp_dir.name, config.get("metric", ""),
                stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
                best_str, pct_str
            ])
    buf = io.StringIO()
    writer = csv.writer(buf)
    writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"])
    for e in experiments:
        writer.writerow(e)
    return buf.getvalue()
 # --- Markdown Export ---
 def export_experiment_markdown(experiment_dir, experiment_path):
    """Export single experiment as Markdown string."""
    config = load_config(experiment_dir)
    results = load_results(experiment_dir)
    direction = config.get("metric_direction", "lower")
    metric_name = config.get("metric", "metric")
    stats = compute_stats(results, direction)
    lines = []
    lines.append(f"# Autoresearch: {experiment_path}\n")
    lines.append(f"**Target:** `{config.get('target', '?')}`  ")
    lines.append(f"**Metric:** `{metric_name}` ({direction} is better)  ")
    lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")
    if stats["baseline"] is not None and stats["best"] is not None:
        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
        lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")
    lines.append(f"| Commit | Metric | Status | Description |")
    lines.append(f"|--------|--------|--------|-------------|")
    for r in results:
        m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A"
        lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |")
    lines.append("")
    return "\n".join(lines)
 def export_dashboard_markdown(root):
    """Export dashboard as Markdown string."""
    lines = []
    lines.append("# Autoresearch Dashboard\n")
    lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |")
    lines.append("|--------|-----------|--------|------|------|------|--------|--------|")
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
            config = load_config(exp_dir)
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
            best = f"`{stats['best']:.4f}`" if stats["best"] else "—"
            pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else "—"
            import time
            tsv = exp_dir / "results.tsv"
            status = "idle"
            if tsv.exists() and stats["total"] > 0:
                age_h = (time.time() - tsv.stat().st_mtime) / 3600
                status = "active" if age_h < 1 else "paused" if age_h < 24 else "done"
            lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |")
    lines.append("")
    return "\n".join(lines)
 # --- Main ---
 def main():
-    parser = argparse.ArgumentParser()
+    parser = argparse.ArgumentParser(description="autoresearch-agent results viewer")
-    parser.add_argument("--summary", action="store_true")
+    parser.add_argument("--experiment", help="Show one experiment: domain/name")
-    parser.add_argument("--best", action="store_true")
+    parser.add_argument("--domain", help="Show all experiments in a domain")
-    parser.add_argument("--history", action="store_true")
+    parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard")
-    parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC"))
+    parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal",
-    parser.add_argument("--path", default=".")
+                        help="Output format (default: terminal)")
-    parser.add_argument("--metric", default="metric")
+    parser.add_argument("--output", "-o", help="Write to file instead of stdout")
-    parser.add_argument("--direction", default="lower", choices=["lower", "higher"])
+    parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)")
    args = parser.parse_args()
-    path = Path(args.path).resolve()
+    root = find_autoresearch_root()
    if root is None:
        print("No .autoresearch/ found. Run setup_experiment.py first.")
        sys.exit(1)
-    if args.record:
+    output_text = None
        commit, metric, status, desc = args.record
        tsv = path / "results.tsv"
        if not tsv.exists():
            tsv.write_text("commit\tmetric\tstatus\tdescription\n")
        with open(tsv, "a") as f:
            f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
        print(f"✓ Logged: {commit} {metric} {status}")
        return
-    results = load_results(path)
+    # Single experiment
    if args.experiment:
        experiment_dir = root / args.experiment
        if not experiment_dir.exists():
            print(f"Experiment not found: {args.experiment}")
            sys.exit(1)
-    if args.history:
+        if args.format == "csv":
-        print_history(results)
+            output_text = export_experiment_csv(experiment_dir, args.experiment)
-    elif args.best:
+        elif args.format == "markdown":
-        keeps = [r for r in results if r["status"] == "keep" and r["metric"]]
+            output_text = export_experiment_markdown(experiment_dir, args.experiment)
-        if not keeps:
+        else:
-            print("No successful experiments yet.")
+            print_experiment(experiment_dir, args.experiment)
            return
-        best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
+
-        print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}")
+    # Domain
    elif args.domain:
        domain_dir = root / args.domain
        if not domain_dir.exists():
            print(f"Domain not found: {args.domain}")
            sys.exit(1)
        for exp_dir in sorted(domain_dir.iterdir()):
            if exp_dir.is_dir() and (exp_dir / "config.cfg").exists():
                if args.format == "terminal":
                    print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}")
                # For CSV/MD, fall through to dashboard with domain filter
        if args.format != "terminal":
            # Use dashboard export filtered to domain
            output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
        else:
            return
    # Dashboard
    elif args.dashboard or args.all:
        if args.format == "csv":
            output_text = export_dashboard_csv(root)
        elif args.format == "markdown":
            output_text = export_dashboard_markdown(root)
        else:
            print_dashboard(root)
            return
    else:
-        print_summary(results, args.metric, args.direction)
+        # Default: dashboard
        if args.format == "terminal":
            print_dashboard(root)
            return
        output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
    # Write output
    if output_text:
        if args.output:
            Path(args.output).write_text(output_text)
            print(f"Written to {args.output}")
        else:
            print(output_text)
 if __name__ == "__main__":
--- a/engineering/autoresearch-agent/scripts/run_experiment.py
+++ b/engineering/autoresearch-agent/scripts/run_experiment.py
@@ -2,17 +2,15 @@
 """
 autoresearch-agent: Experiment Runner
-Executes the autonomous experiment loop:
+Executes the autonomous experiment loop for a specific experiment.
- Reads .autoresearch.cfg for project config
+Reads config from .autoresearch/{domain}/{name}/config.cfg.
 - Runs the target evaluation
 - Keeps improvements (git commit) or discards failures (git reset)
 - Logs everything to results.tsv
 - Loops indefinitely until interrupted
 Usage:
-    python scripts/run_experiment.py --loop      # Run forever
+    python scripts/run_experiment.py --experiment engineering/api-speed --loop
-    python scripts/run_experiment.py --single    # Run one experiment
+    python scripts/run_experiment.py --experiment engineering/api-speed --single
-    python scripts/run_experiment.py --dry-run   # Show what would happen
+    python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
    python scripts/run_experiment.py --resume --loop
    python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
 """
 import argparse
@@ -25,11 +23,22 @@ from datetime import datetime
 from pathlib import Path
-def load_config(path):
+def find_autoresearch_root():
-    """Load .autoresearch.cfg"""
+    """Find .autoresearch/ in project or user home."""
-    cfg_file = Path(path) / ".autoresearch.cfg"
+    project_root = Path(".").resolve() / ".autoresearch"
    if project_root.exists():
        return project_root
    user_root = Path.home() / ".autoresearch"
    if user_root.exists():
        return user_root
    return None
 def load_config(experiment_dir):
    """Load config.cfg from experiment directory."""
    cfg_file = experiment_dir / "config.cfg"
    if not cfg_file.exists():
-        print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.")
+        print(f"  Error: no config.cfg in {experiment_dir}")
        sys.exit(1)
    config = {}
    for line in cfg_file.read_text().splitlines():
@@ -49,239 +58,293 @@ def run_cmd(cmd, cwd=None, timeout=None):
 def get_current_commit(path):
    """Get short hash of current HEAD."""
    _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
    return commit
-def get_current_metric(path, metric_grep):
+def get_best_metric(experiment_dir, direction):
-    """Read the last recorded metric from results.tsv."""
+    """Read the best metric from results.tsv."""
-    tsv = Path(path) / "results.tsv"
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return None
-    lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l]
+    lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l]
    if not lines:
        return None
-    last = lines[-1].split("\t")
+    metrics = []
-    try:
+    for line in lines:
-        return float(last[1])
+        parts = line.split("\t")
-    except (ValueError, IndexError):
+        try:
            if parts[1] != "N/A":
                metrics.append(float(parts[1]))
        except (ValueError, IndexError):
            continue
    if not metrics:
        return None
    return min(metrics) if direction == "lower" else max(metrics)
-def run_evaluation(path, evaluate_cmd, time_budget_minutes):
+def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
-    """Run evaluation with time limit."""
+    """Run evaluation with time limit. Output goes to log_file."""
-    hard_limit = time_budget_minutes * 60 * 2.5  # 2.5x as hard timeout
+    hard_limit = time_budget_minutes * 60 * 2.5
    t0 = time.time()
    try:
        code, _, _ = run_cmd(
-            f"{evaluate_cmd} > run.log 2>&1",
+            f"{eval_cmd} > {log_file} 2>&1",
-            cwd=path,
+            cwd=str(project_root),
            timeout=hard_limit
        )
        elapsed = time.time() - t0
        return code, elapsed
    except subprocess.TimeoutExpired:
        elapsed = time.time() - t0
-        return -1, elapsed  # -1 = timeout
+        return -1, elapsed
-def extract_metric(path, metric_grep):
+def extract_metric(log_file, metric_grep):
-    """Extract metric value from run.log."""
+    """Extract metric value from log file."""
-    code, out, _ = run_cmd(
+    log_path = Path(log_file)
-        f"grep '{metric_grep}' run.log | tail -1",
+    if not log_path.exists():
        cwd=path
    )
    if not out:
        return None
    try:
        return float(out.split(":")[-1].strip())
    except ValueError:
        return None
    for line in reversed(log_path.read_text().splitlines()):
        stripped = line.strip()
        if stripped.startswith(metric_grep.lstrip("^")):
            try:
                return float(stripped.split(":")[-1].strip())
            except ValueError:
                continue
    return None
 def is_improvement(new_val, old_val, direction):
    """Check if new result is better than old."""
    if old_val is None:
-        return True  # First run always "improves"
+        return True
    if direction == "lower":
        return new_val < old_val
-    else:
+    return new_val > old_val
        return new_val > old_val
-def log_result(path, commit, metric_val, status, description):
+def log_result(experiment_dir, commit, metric_val, status, description):
    """Append result to results.tsv."""
-    tsv = Path(path) / "results.tsv"
+    tsv = experiment_dir / "results.tsv"
    metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
    with open(tsv, "a") as f:
        f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
-def get_experiment_count(path):
+def get_experiment_count(experiment_dir):
    """Count experiments run so far."""
-    tsv = Path(path) / "results.tsv"
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return 0
-    lines = tsv.read_text().splitlines()
+    return max(0, len(tsv.read_text().splitlines()) - 1)
    return max(0, len(lines) - 1)  # subtract header
-def run_single_experiment(path, config, exp_num, dry_run=False):
+def get_last_active(root):
    """Find the most recently modified experiment."""
    latest = None
    latest_time = 0
    for domain_dir in root.iterdir():
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in domain_dir.iterdir():
            if not exp_dir.is_dir():
                continue
            cfg = exp_dir / "config.cfg"
            if cfg.exists() and cfg.stat().st_mtime > latest_time:
                latest_time = cfg.stat().st_mtime
                latest = f"{domain_dir.name}/{exp_dir.name}"
    return latest
 def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
    """Run one experiment iteration."""
    direction = config.get("metric_direction", "lower")
    metric_grep = config.get("metric_grep", "^metric:")
-    evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py")
+    eval_cmd = config.get("evaluate_cmd", "python evaluate.py")
    time_budget = int(config.get("time_budget_minutes", 5))
    metric_name = config.get("metric", "metric")
    log_file = str(experiment_dir / "run.log")
-    best_so_far = get_current_metric(path, metric_grep)
+    best = get_best_metric(experiment_dir, direction)
    ts = datetime.now().strftime("%H:%M:%S")
    print(f"\n[{ts}] Experiment #{exp_num}")
-    print(f"  Best {metric_name} so far: {best_so_far}")
+    print(f"  Best {metric_name}: {best}")
    if dry_run:
        print("  [DRY RUN] Would run evaluation and check metric")
        return "dry_run"
-    # Save pre-experiment state for rollback
+    # Save state for rollback
-    code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path)
+    code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
    if code != 0:
-        print("  ✗ Can't get git state. Is this a git repo with commits?")
+        print("  Error: can't get git state")
        return "error"
    # Run evaluation
-    print(f"  Running: {evaluate_cmd} (budget: {time_budget} min)")
+    print(f"  Running: {eval_cmd} (budget: {time_budget}m)")
-    ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget)
+    ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file)
-    # Handle timeout
+    commit = get_current_commit(str(project_root))
    # Timeout
    if ret_code == -1:
-        print(f"  ✗ TIMEOUT after {elapsed:.0f}s — discarding")
+        print(f"  TIMEOUT after {elapsed:.0f}s — discarding")
-        run_cmd("git checkout -- .", cwd=path)  # revert uncommitted changes
+        run_cmd("git checkout -- .", cwd=str(project_root))
-        # Commit was already made by the agent before evaluation
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
        curr_commit = get_current_commit(path)
        log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
        return "crash"
-    # Handle non-zero exit
+    # Crash
    if ret_code != 0:
-        # Check if it crashed
+        _, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
-        code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path)
+        print(f"  CRASH (exit {ret_code}) after {elapsed:.0f}s")
        print(f"  ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
        print(f"  Last output: {tail[:200]}")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
-        curr_commit = get_current_commit(path)
+        log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
        log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
        return "crash"
    # Extract metric
-    metric_val = extract_metric(path, metric_grep)
+    metric_val = extract_metric(log_file, metric_grep)
    if metric_val is None:
-        print(f"  ✗ Could not parse metric from run.log")
+        print(f"  Could not parse {metric_name} from run.log")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
-        curr_commit = get_current_commit(path)
+        log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
        log_result(path, curr_commit, None, "crash", "metric_parse_failed")
        return "crash"
    curr_commit = get_current_commit(path)
    delta = ""
-    if best_so_far is not None:
+    if best is not None:
-        diff = metric_val - best_so_far
+        diff = metric_val - best
-        delta = f" (Δ{diff:+.4f})"
+        delta = f" (delta {diff:+.4f})"
    print(f"  {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
    # Keep or discard
-    if is_improvement(metric_val, best_so_far, direction):
+    if is_improvement(metric_val, best, direction):
-        print(f"  ✅ KEEP — improvement confirmed")
+        print(f"  KEEP — improvement")
-        log_result(path, curr_commit, metric_val, "keep",
+        log_result(experiment_dir, commit, metric_val, "keep",
-                   f"improvement_{metric_name}_{metric_val:.4f}")
+                   f"improved_{metric_name}_{metric_val:.4f}")
        return "keep"
    else:
-        print(f"  ❌ DISCARD — no improvement")
+        print(f"  DISCARD — no improvement")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
-        curr_commit = get_current_commit(path)
+        best_str = f"{best:.4f}" if best else "?"
-        log_result(path, curr_commit, metric_val, "discard",
+        log_result(experiment_dir, commit, metric_val, "discard",
-                   f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}")
+                   f"no_improvement_{metric_val:.4f}_vs_{best_str}")
        return "discard"
-def print_summary(path):
+def print_summary(experiment_dir, config):
-    """Print experiment summary."""
+    """Print session summary."""
-    tsv = Path(path) / "results.tsv"
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return
-    lines = tsv.read_text().splitlines()[1:]  # skip header
+    lines = tsv.read_text().splitlines()[1:]
    if not lines:
        return
    keeps = [l for l in lines if "\tkeep\t" in l]
    discards = [l for l in lines if "\tdiscard\t" in l]
    crashes = [l for l in lines if "\tcrash\t" in l]
    metric_name = config.get("metric", "metric")
    direction = config.get("metric_direction", "lower")
-    print(f"\n{'='*50}")
+    print(f"\n{'=' * 55}")
-    print(f"  Session Summary")
+    print(f"  autoresearch — Session Summary")
    print(f"  Experiments: {len(lines)} total")
-    print(f"  ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}")
+    print(f"  Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")
    if keeps:
        try:
-            first_metric = float(keeps[0].split("\t")[1])
+            valid = []
-            last_metric = float(keeps[-1].split("\t")[1])
+            for l in keeps:
-            direction = "↓" if last_metric < first_metric else "↑"
+                parts = l.split("\t")
-            print(f"  Best progress: {first_metric:.6f} → {last_metric:.6f} {direction}")
+                if parts[1] != "N/A":
                    valid.append(float(parts[1]))
            if len(valid) >= 2:
                first, last = valid[0], valid[-1]
                best = min(valid) if direction == "lower" else max(valid)
                pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
                print(f"  {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
        except (ValueError, IndexError):
            pass
-    print(f"{'='*50}\n")
+    print(f"{'=' * 55}\n")
 def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent runner")
    parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
    parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
    parser.add_argument("--loop", action="store_true", help="Run forever")
    parser.add_argument("--single", action="store_true", help="Run one experiment")
-    parser.add_argument("--dry-run", action="store_true", help="Dry run only")
+    parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
    parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
    parser.add_argument("--path", default=".", help="Project root")
    parser.add_argument("--max-experiments", type=int, default=0,
                        help="Max experiments (0 = unlimited)")
    args = parser.parse_args()
-    path = Path(args.path).resolve()
+    project_root = Path(args.path).resolve()
-    config = load_config(path)
+    root = find_autoresearch_root()
-    print(f"\n🔬 autoresearch-agent")
+    if root is None:
-    print(f"   Project: {path}")
+        print("No .autoresearch/ found. Run setup_experiment.py first.")
-    print(f"   Target: {config.get('target', '?')}")
+        sys.exit(1)
    print(f"   Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
    print(f"   Budget: {config.get('time_budget_minutes', '?')} min/experiment")
    print(f"   Mode: {'loop' if args.loop else 'single'}")
-    if args.single:
+    # Resolve experiment
-        exp_num = get_experiment_count(path) + 1
+    experiment_path = args.experiment
-        run_single_experiment(path, config, exp_num, args.dry_run)
+    if args.resume:
        experiment_path = get_last_active(root)
        if not experiment_path:
            print("No experiments found to resume.")
            sys.exit(1)
        print(f"Resuming: {experiment_path}")
    if not experiment_path:
        print("Specify --experiment domain/name or --resume")
        sys.exit(1)
    experiment_dir = root / experiment_path
    if not experiment_dir.exists():
        print(f"Experiment not found: {experiment_dir}")
        print("Run: python scripts/setup_experiment.py --list")
        sys.exit(1)
    config = load_config(experiment_dir)
    domain, name = experiment_path.split("/", 1)
    print(f"\n  autoresearch-agent")
    print(f"  Experiment: {experiment_path}")
    print(f"  Target: {config.get('target', '?')}")
    print(f"  Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
    print(f"  Budget: {config.get('time_budget_minutes', '?')} min/experiment")
    print(f"  Mode: {'loop' if args.loop else 'single'}")
    if args.single or args.dry_run:
        exp_num = get_experiment_count(experiment_dir) + 1
        run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
        return
-    if not args.loop and not args.dry_run:
+    if not args.loop:
        print("\nSpecify --loop (forever) or --single (one experiment)")
        sys.exit(1)
-    # Setup graceful shutdown
+    # Graceful shutdown
    def handle_interrupt(sig, frame):
-        print_summary(path)
+        print_summary(experiment_dir, config)
-        print("\n⏹ Stopped by user.")
+        print("\nStopped by user.")
        sys.exit(0)
    signal.signal(signal.SIGINT, handle_interrupt)
    signal.signal(signal.SIGTERM, handle_interrupt)
    # Main loop
    consecutive_crashes = 0
-    exp_num = get_experiment_count(path) + 1
+    exp_num = get_experiment_count(experiment_dir) + 1
-    print(f"\nStarting loop. Ctrl+C to stop and print summary.\n")
+    print(f"\nStarting loop. Ctrl+C to stop.\n")
    while True:
-        result = run_single_experiment(path, config, exp_num, args.dry_run)
+        result = run_single(project_root, experiment_dir, config, exp_num, False)
        exp_num += 1
        if result == "crash":
@@ -289,21 +352,16 @@ def main():
        else:
            consecutive_crashes = 0
        # Bail if 5 consecutive crashes
        if consecutive_crashes >= 5:
-            print("\n⚠ 5 consecutive crashes. Pausing for investigation.")
+            print("\n  5 consecutive crashes. Pausing.")
-            print("  Check run.log for the last error.")
+            print("  Check .autoresearch/{}/run.log".format(experiment_path))
            break
-        # Check max experiments
+        if 0 < args.max_experiments < exp_num:
-        if args.max_experiments > 0 and exp_num > args.max_experiments:
+            print(f"\n  Reached max experiments ({args.max_experiments})")
            print(f"\n✓ Reached max experiments ({args.max_experiments})")
            break
-        if args.single:
+    print_summary(experiment_dir, config)
            break
    print_summary(path)
 if __name__ == "__main__":
--- a/engineering/autoresearch-agent/scripts/setup_experiment.py
+++ b/engineering/autoresearch-agent/scripts/setup_experiment.py
@@ -1,65 +1,52 @@
 #!/usr/bin/env python3
 """
-autoresearch-agent: Setup Wizard
+autoresearch-agent: Setup Experiment
-Initializes a new research run:
+Initialize a new experiment with domain, target, evaluator, and git branch.
-1. Validates the project structure
+Creates the .autoresearch/{domain}/{name}/ directory structure.
 2. Creates a git branch
 3. Runs the baseline experiment
 4. Initializes results.tsv
 Usage:
-    python scripts/setup_experiment.py [--config experiment.yaml]
+    python scripts/setup_experiment.py --domain engineering --name api-speed \
-    python scripts/setup_experiment.py --domain ml|prompt|code|skill
+        --target src/api/search.py --eval "pytest bench.py" \
        --metric p50_ms --direction lower
    python scripts/setup_experiment.py --domain marketing --name medium-ctr \
        --target content/titles.md --eval "python evaluate.py" \
        --metric ctr_score --direction higher --evaluator llm_judge_content
    python scripts/setup_experiment.py --list          # List all experiments
    python scripts/setup_experiment.py --list-evaluators  # List available evaluators
 """
 import argparse
 import os
 import shutil
 import subprocess
 import sys
 import time
 from datetime import datetime
 from pathlib import Path
 DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"]
-DOMAINS = {
+EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators"
-    "ml": {
+
-        "target": "train.py",
+DEFAULT_CONFIG = """# autoresearch global config
-        "evaluate_cmd": "uv run train.py",
+default_time_budget_minutes: 5
-        "metric": "val_bpb",
+default_scope: project
-        "metric_direction": "lower",
+dashboard_format: markdown
-        "time_budget_minutes": 5,
+"""
-        "metric_grep": "^val_bpb:",
+
-    },
+GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state
-    "prompt": {
+**/results.tsv
-        "target": "prompt.md",
+**/run.log
-        "evaluate_cmd": "python evaluate.py",
+**/run.*.log
-        "metric": "eval_score",
+config.yaml
-        "metric_direction": "higher",
+"""
        "time_budget_minutes": 2,
        "metric_grep": "^eval_score:",
    },
    "code": {
        "target": "src/module.py",
        "evaluate_cmd": "python benchmark.py",
        "metric": "p50_ms",
        "metric_direction": "lower",
        "time_budget_minutes": 10,
        "metric_grep": "^p50_ms:",
    },
    "skill": {
        "target": "SKILL.md",
        "evaluate_cmd": "python scripts/skill_evaluator.py",
        "metric": "pass_rate",
        "metric_direction": "higher",
        "time_budget_minutes": 5,
        "metric_grep": "^pass_rate:",
    },
 }
 def run_cmd(cmd, cwd=None, timeout=None):
-    """Run a shell command and return (returncode, stdout, stderr)."""
+    """Run shell command, return (returncode, stdout, stderr)."""
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True,
        cwd=cwd, timeout=timeout
@@ -67,188 +54,315 @@ def run_cmd(cmd, cwd=None, timeout=None):
    return result.returncode, result.stdout.strip(), result.stderr.strip()
-def check_git_repo(path):
+def get_autoresearch_root(scope, project_root=None):
-    """Verify we're in a git repo."""
+    """Get the .autoresearch root directory based on scope."""
-    code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path)
+    if scope == "user":
-    if code != 0:
+        return Path.home() / ".autoresearch"
-        print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'")
+    return Path(project_root or ".") / ".autoresearch"
 def init_root(root):
    """Initialize .autoresearch root if it doesn't exist."""
    created = False
    if not root.exists():
        root.mkdir(parents=True)
        created = True
        print(f"  Created {root}/")
    config_file = root / "config.yaml"
    if not config_file.exists():
        config_file.write_text(DEFAULT_CONFIG)
        print(f"  Created {config_file}")
    gitignore = root / ".gitignore"
    if not gitignore.exists():
        gitignore.write_text(GITIGNORE_CONTENT)
        print(f"  Created {gitignore}")
    return created
 def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""):
    """Generate a program.md template for the experiment."""
    direction_word = "Minimize" if direction == "lower" else "Maximize"
    content = f"""# autoresearch — {name}
 ## Goal
 {direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better.
 ## What the Agent Can Change
 - Only `{target}` — this is the single file being optimized.
 - Everything inside that file is fair game unless constrained below.
 ## What the Agent Cannot Change
 - The evaluation script (`evaluate.py` or the eval command). It is read-only.
 - Dependencies — do not add new packages or imports that aren't already available.
 - Any other files in the project unless explicitly noted here.
 {f"- Additional constraints: {constraints}" if constraints else ""}
 ## Strategy
 1. First run: establish baseline. Do not change anything.
 2. Profile/analyze the current state — understand why the metric is what it is.
 3. Try the most obvious improvement first (low-hanging fruit).
 4. If that works, push further in the same direction.
 5. If stuck, try something orthogonal or radical.
 6. Read the git log of previous experiments. Don't repeat failed approaches.
 ## Simplicity Rule
 A small improvement that adds ugly complexity is NOT worth it.
 Equal performance with simpler code IS worth it.
 Removing code that gets same results is the best outcome.
 ## Stop When
 You don't stop. The human will interrupt you when they're satisfied.
 If no improvement in 20+ consecutive runs, change strategy drastically.
 """
    (experiment_dir / "program.md").write_text(content)
 def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget):
    """Write experiment config."""
    content = f"""target: {target}
 evaluate_cmd: {eval_cmd}
 metric: {metric}
 metric_direction: {direction}
 metric_grep: ^{metric}:
 time_budget_minutes: {time_budget}
 created: {datetime.now().strftime('%Y-%m-%d %H:%M')}
 """
    (experiment_dir / "config.cfg").write_text(content)
 def init_results_tsv(experiment_dir):
    """Create results.tsv with header."""
    tsv = experiment_dir / "results.tsv"
    if tsv.exists():
        print(f"  results.tsv already exists ({tsv.stat().st_size} bytes)")
        return
    tsv.write_text("commit\tmetric\tstatus\tdescription\n")
    print("  Created results.tsv")
 def copy_evaluator(experiment_dir, evaluator_name):
    """Copy a built-in evaluator to the experiment directory."""
    source = EVALUATOR_DIR / f"{evaluator_name}.py"
    if not source.exists():
        print(f"  Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}")
        print(f"  Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}")
        return False
-    print("✓ Git repository found")
+    dest = experiment_dir / "evaluate.py"
    shutil.copy2(source, dest)
    print(f"  Copied evaluator: {evaluator_name}.py -> evaluate.py")
    return True
-def check_program_md(path):
+def create_branch(path, domain, name):
    """Check program.md exists and has content."""
    pm = Path(path) / "program.md"
    if not pm.exists():
        print("⚠ program.md not found. Creating template...")
        return False
    content = pm.read_text()
    if len(content) < 100:
        print("⚠ program.md looks empty. Fill it out before running experiments.")
        return False
    print(f"✓ program.md found ({len(content)} chars)")
    return True
 def check_target_file(path, target):
    """Check target file exists."""
    tf = Path(path) / target
    if not tf.exists():
        print(f"✗ Target file not found: {target}")
        return False
    print(f"✓ Target file found: {target}")
    return True
 def check_evaluate_script(path):
    """Check evaluate.py exists."""
    ev = Path(path) / "evaluate.py"
    if not ev.exists():
        print("⚠ evaluate.py not found. You need a fixed evaluation function.")
        print("  Create evaluate.py that outputs: metric_name: <value>")
        return False
    print("✓ evaluate.py found")
    return True
 def create_branch(path, tag):
    """Create and checkout the experiment branch."""
-    branch = f"autoresearch/{tag}"
+    branch = f"autoresearch/{domain}/{name}"
-    code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path)
+    code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
    if code != 0:
        if "already exists" in err:
-            print(f"✗ Branch '{branch}' already exists. Use a different tag.")
+            print(f"  Branch '{branch}' already exists. Checking out...")
-        else:
+            run_cmd(f"git checkout {branch}", cwd=path)
-            print(f"✗ Failed to create branch: {err}")
+            return branch
        print(f"  Warning: could not create branch: {err}")
        return None
-    print(f"✓ Created branch: {branch}")
+    print(f"  Created branch: {branch}")
    return branch
-def init_results_tsv(path):
+def list_experiments(root):
-    """Create results.tsv with header."""
+    """List all experiments across all domains."""
-    tsv = Path(path) / "results.tsv"
+    if not root.exists():
-    if tsv.exists():
+        print("No experiments found. Run setup to create your first experiment.")
        print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
        return
-    tsv.write_text("commit\tmetric\tstatus\tdescription\n")
+
-    print("✓ Created results.tsv")
+    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir():
                continue
            cfg_file = exp_dir / "config.cfg"
            if not cfg_file.exists():
                continue
            config = {}
            for line in cfg_file.read_text().splitlines():
                if ":" in line:
                    k, v = line.split(":", 1)
                    config[k.strip()] = v.strip()
            # Count results
            tsv = exp_dir / "results.tsv"
            runs = 0
            if tsv.exists():
                runs = max(0, len(tsv.read_text().splitlines()) - 1)
            experiments.append({
                "domain": domain_dir.name,
                "name": exp_dir.name,
                "target": config.get("target", "?"),
                "metric": config.get("metric", "?"),
                "runs": runs,
            })
    if not experiments:
        print("No experiments found.")
        return
    print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}")
    print("-" * 95)
    for e in experiments:
        print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}")
    print(f"\nTotal: {len(experiments)} experiments")
-def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes):
+def list_evaluators():
-    """Run the baseline experiment."""
+    """List available built-in evaluators."""
-    print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...")
+    if not EVALUATOR_DIR.exists():
-    timeout = time_budget_minutes * 60 * 2.5  # 2.5x budget as hard limit
+        print("No evaluators directory found.")
        return
-    t0 = time.time()
+    print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n")
-    code, out, err = run_cmd(
+    for f in sorted(EVALUATOR_DIR.glob("*.py")):
-        f"{evaluate_cmd} > run.log 2>&1",
+        # Read first docstring line
-        cwd=path,
+        desc = ""
-        timeout=timeout
+        for line in f.read_text().splitlines():
-    )
+            if line.strip().startswith('"""') or line.strip().startswith("'''"):
-    elapsed = time.time() - t0
+                continue
-
+            if line.strip() and not line.startswith("#!"):
-    if code != 0:
+                desc = line.strip().strip('"').strip("'")
-        print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log")
+                break
-        return None
+        print(f"  {f.stem:<25} {desc}")
    # Extract metric
    grep_code, grep_out, _ = run_cmd(
        f"grep '{metric_grep}' run.log | tail -1",
        cwd=path
    )
    if not grep_out:
        print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
        return None
    metric_value = grep_out.split(":")[-1].strip()
    print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
    return metric_value
 def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent setup")
-    parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain")
+    parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain")
    parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)")
    parser.add_argument("--target", help="Target file to optimize")
-    parser.add_argument("--evaluate-cmd", help="Evaluation command")
+    parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command")
-    parser.add_argument("--metric", help="Metric name")
+    parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')")
-    parser.add_argument("--direction", choices=["lower", "higher"], default="lower")
+    parser.add_argument("--direction", choices=["lower", "higher"], default="lower",
-    parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes")
+                        help="Is lower or higher better?")
-    parser.add_argument("--tag", help="Run tag (used in branch name)")
+    parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)")
    parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)")
    parser.add_argument("--scope", choices=["project", "user"], default="project",
                        help="Where to store experiments: project (./) or user (~/)")
    parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
    parser.add_argument("--path", default=".", help="Project root path")
-    parser.add_argument("--skip-baseline", action="store_true")
+    parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
    parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
    parser.add_argument("--list", action="store_true", help="List all experiments")
    parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
    args = parser.parse_args()
-    path = Path(args.path).resolve()
+    project_root = Path(args.path).resolve()
    print(f"\n🔬 autoresearch-agent setup")
    print(f"   Project: {path}")
    print(f"   Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
-    # Get config from domain or args
+    # List mode
-    if args.domain:
+    if args.list:
-        config = DOMAINS[args.domain].copy()
+        root = get_autoresearch_root("project", project_root)
        list_experiments(root)
        user_root = get_autoresearch_root("user")
        if user_root.exists() and user_root != root:
            print(f"\n--- User-level experiments ({user_root}) ---")
            list_experiments(user_root)
        return
    if args.list_evaluators:
        list_evaluators()
        return
    # Validate required args for setup
    if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]):
        parser.error("Required: --domain, --name, --target, --eval, --metric")
    root = get_autoresearch_root(args.scope, project_root)
    print(f"\n  autoresearch-agent setup")
    print(f"  Project: {project_root}")
    print(f"  Scope: {args.scope}")
    print(f"  Domain: {args.domain}")
    print(f"  Experiment: {args.name}")
    print(f"  Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
    # Check git
    code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
    if code != 0:
        print("  Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
        sys.exit(1)
    print("  Git repository found")
    # Check target file
    target_path = project_root / args.target
    if not target_path.exists():
        print(f"  Error: target file not found: {args.target}")
        sys.exit(1)
    print(f"  Target file found: {args.target}")
    # Init root
    init_root(root)
    # Create experiment directory
    experiment_dir = root / args.domain / args.name
    if experiment_dir.exists():
        print(f"  Warning: experiment '{args.domain}/{args.name}' already exists.")
        print(f"  Use --name with a different name, or delete {experiment_dir}")
        sys.exit(1)
    experiment_dir.mkdir(parents=True)
    print(f"  Created {experiment_dir}/")
    # Create files
    create_program_md(experiment_dir, args.domain, args.name,
                      args.target, args.metric, args.direction, args.constraints)
    print("  Created program.md")
    create_config(experiment_dir, args.target, args.eval_cmd,
                  args.metric, args.direction, args.time_budget)
    print("  Created config.cfg")
    init_results_tsv(experiment_dir)
    # Copy evaluator if specified
    if args.evaluator:
        copy_evaluator(experiment_dir, args.evaluator)
    # Create git branch
    if not args.skip_branch:
        create_branch(str(project_root), args.domain, args.name)
    # Test evaluation command
    print(f"\n  Testing evaluation: {args.eval_cmd}")
    code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60)
    if code != 0:
        print(f"  Warning: eval command failed (exit {code})")
        if err:
            print(f"  stderr: {err[:200]}")
        print("  Fix the eval command before running the experiment loop.")
    else:
-        config = {
+        # Check metric is parseable
-            "target": args.target or "target.py",
+        full_output = out + "\n" + err
-            "evaluate_cmd": args.evaluate_cmd or "python evaluate.py",
+        metric_found = False
-            "metric": args.metric or "score",
+        for line in full_output.splitlines():
-            "metric_direction": args.direction,
+            if line.strip().startswith(f"{args.metric}:"):
-            "time_budget_minutes": args.budget,
+                metric_found = True
-            "metric_grep": f"^{args.metric or 'score'}:",
+                print(f"  Eval works. Baseline: {line.strip()}")
-        }
+                break
        if not metric_found:
            print(f"  Warning: eval ran but '{args.metric}:' not found in output.")
            print(f"  Make sure your eval command outputs: {args.metric}: <value>")
-    tag = args.tag or datetime.now().strftime("%b%d").lower()
+    # Summary
-
+    print(f"\n  Setup complete!")
-    # Validation checks
+    print(f"  Experiment: {args.domain}/{args.name}")
-    checks = [
+    print(f"  Target: {args.target}")
-        check_git_repo(path),
+    print(f"  Metric: {args.metric} ({args.direction} is better)")
-        check_program_md(path),
+    print(f"  Budget: {args.time_budget} min/experiment")
-        check_target_file(path, config["target"]),
+    if not args.skip_branch:
-        check_evaluate_script(path),
+        print(f"  Branch: autoresearch/{args.domain}/{args.name}")
-    ]
+    print(f"\n  To start:")
-
+    print(f"  python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")
    if not all(checks):
        print("\n⚠ Fix the above issues before running experiments.")
        sys.exit(1)
    # Create branch
    branch = create_branch(path, tag)
    if not branch:
        sys.exit(1)
    # Init results TSV
    init_results_tsv(path)
    # Save config for run_experiment.py
    config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
    (path / ".autoresearch.cfg").write_text(config_content + "\n")
    print("✓ Saved .autoresearch.cfg")
    # Run baseline
    if not args.skip_baseline:
        baseline = run_baseline(
            path,
            config["evaluate_cmd"],
            config["metric_grep"],
            config["time_budget_minutes"]
        )
        if baseline:
            # Log baseline to TSV
            code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
            with open(path / "results.tsv", "a") as f:
                f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
            print(f"✓ Baseline logged to results.tsv")
    print(f"\n✅ Setup complete!")
    print(f"   Branch: {branch}")
    print(f"   Target: {config['target']}")
    print(f"   Metric: {config['metric']} ({config['metric_direction']} is better)")
    print(f"   Budget: {config['time_budget_minutes']} min/experiment")
    print(f"\nTo start the autonomous loop:")
    print(f"   python scripts/run_experiment.py --loop")
    print(f"\nOr run a single experiment:")
    print(f"   python scripts/run_experiment.py --single")
 if __name__ == "__main__":