refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo. Architecture changes: - Multi-experiment support: .autoresearch/{domain}/{name}/ structure - Domain categories: engineering, marketing, content, prompts, custom - Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope - User chooses scope during setup, not installation New evaluators (8 ready-to-use): - Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage - LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy - LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed Script improvements: - setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators - run_experiment.py: --experiment domain/name, --resume, --loop, --single - log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output Results export: - Terminal (default), CSV, and Markdown formats - Per-experiment, per-domain, or cross-experiment dashboard view SKILL.md rewritten: - Clear activation triggers (when the skill should activate) - Practical examples for each domain - Evaluator documentation with cost transparency - Simplified loop protocol matching Karpathy's original philosophy
2026-03-13 08:22:14 +01:00
parent c834d71a44
commit 12591282da
13 changed files with 1744 additions and 702 deletions
--- a/engineering/autoresearch-agent/SKILL.md
+++ b/engineering/autoresearch-agent/SKILL.md
@@ -1,9 +1,9 @@
 ---
 name: "autoresearch-agent"
-description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo."
+description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
 license: MIT
 metadata:
-  version: 1.0.0
+  version: 2.0.0
  author: Alireza Rezvani
  category: engineering
  updated: 2026-03-13
@@ -13,194 +13,233 @@ metadata:

 > You sleep. The agent experiments. You wake up to results.

-Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop.
+Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.

-**Works for any domain with a measurable metric:**
- ML training optimization (original use case — optimize `train.py` by `val_bpb`)
- Prompt engineering (optimize system prompts by LLM-eval quality score)
- Code performance (optimize a module by benchmark runtime/score)
- Agent skill improvement (optimize `SKILL.md` by task completion rate)
+Not one guess — fifty measured attempts, compounding.

 ---

-## Before Starting
+## When This Skill Activates

-Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing.
+Recognize these patterns from the user:

-If no `program.md` exists, run the **Setup Wizard** below.
+- "Make this faster / smaller / better"
+- "Optimize [file] for [metric]"
+- "Improve my [headlines / copy / prompts]"
+- "Run experiments overnight"
+- "I want to get [metric] from X to Y"
+- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
+
+If the user describes a target file + a way to measure success → this skill applies.

 ---

-## Setup Wizard
+## Setup

-Answer these 5 questions to configure the experiment:
+### First Time — Create the Experiment

-### 1. What are we optimizing?
-The **target** — one file the agent is allowed to modify:
- `train.py` — ML training loop (Karpathy-style)
- `prompt.md` — system prompt for an LLM
- `src/module.py` — a specific code module
- `SKILL.md` — an agent skill definition
+Run the setup script. The user decides where experiments live:

-### 2. What's the metric?
-A **single number** that defines success. Lower or higher = better:
- `val_bpb` — validation bits per byte (ML, lower is better)
- `eval_score` — LLM quality score 0-100 (higher is better)
- `p50_ms` — median latency in milliseconds (lower is better)
- `pass_rate` — test pass rate 0-1 (higher is better)
-
-### 3. What's the time budget per experiment?
-How long one experiment run takes:
- `5m` — fast iteration (Karpathy default, ~12 experiments/hour)
- `10m` — moderate (6/hour)
- `30m` — slow but thorough (2/hour)
-
-### 4. What can the agent change?
-Constraints on the target file:
- Architecture only? Hyperparameters only? Everything?
- What packages/imports are available?
- What's explicitly off-limits?
-
-### 5. What's the evaluation function?
-How we score each experiment:
- Fixed script that outputs the metric (e.g. `python evaluate.py`)
- API call that returns a score
- Test suite with a pass rate
-
-Once answered, run: `python scripts/setup_experiment.py` to initialize.
-
---
-
-## The Three Files
-
-Every autoresearch project has the same structure:
-
-```
-project/
-├── program.md       ← Human writes this: objectives, constraints, strategy
-├── target.*         ← Agent modifies this: the thing being optimized
-├── evaluate.py      ← Fixed: the measurement function (never touch)
-├── results.tsv      ← Auto-generated: experiment log (git-tracked for continuity)
-└── scripts/
-    ├── setup_experiment.py   ← Initialize a new run
-    ├── run_experiment.py     ← Execute one experiment iteration
-    └── log_results.py        ← Record results to TSV
+**Project-level** (inside repo, git-tracked, shareable with team):
+```bash
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name api-speed \
+  --target src/api/search.py \
+  --eval "pytest bench.py --tb=no -q" \
+  --metric p50_ms \
+  --direction lower \
+  --scope project
 ```

-### `program.md` — Your Research Directions
-Write this once. The agent reads it before every experiment. It should contain:
- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code)
- **Strategy:** What directions to explore first
- **Constraints:** What the agent cannot change
- **Stopping criteria:** When a result is "good enough"
+**User-level** (personal, in `~/.autoresearch/`):
+```bash
+python scripts/setup_experiment.py \
+  --domain marketing \
+  --name medium-ctr \
+  --target content/titles.md \
+  --eval "python evaluate.py" \
+  --metric ctr_score \
+  --direction higher \
+  --evaluator llm_judge_content \
+  --scope user
+```

-See `references/program-template.md` for domain-specific templates.
+The `--scope` flag determines where `.autoresearch/` lives:
+- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
+- `user` → `~/.autoresearch/` in the home directory. Everything is personal.

-### Target File — The Only File the Agent Edits
-Whatever you're optimizing. Strict scope: **one file, one metric**.
+### What Setup Creates

-### `evaluate.py` — Fixed Evaluation (Never Modified)
-The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured.
+```
+.autoresearch/
+├── config.yaml                        ← Global settings
+├── .gitignore                         ← Ignores results.tsv, *.log
+└── {domain}/{experiment-name}/
+    ├── program.md                     ← Objectives, constraints, strategy
+    ├── config.cfg                     ← Target, eval cmd, metric, direction
+    ├── results.tsv                    ← Experiment log (gitignored)
+    └── evaluate.py                    ← Evaluation script (if --evaluator used)
+```
+
+### Domains
+
+| Domain | Use Cases |
+|--------|-----------|
+| `engineering` | Code speed, memory, bundle size, test pass rate, build time |
+| `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
+| `content` | Article structure, SEO descriptions, readability, CTR |
+| `prompts` | System prompts, chatbot tone, agent instructions |
+| `custom` | Anything else with a measurable metric |
+
+### If `program.md` Already Exists
+
+The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.

 ---

 ## The Experiment Loop

-Run: `python scripts/run_experiment.py --loop`
+### Starting an Experiment
+
+```bash
+# Run specific experiment
+python scripts/run_experiment.py --experiment engineering/api-speed --loop
+
+# Single iteration (test setup)
+python scripts/run_experiment.py --experiment engineering/api-speed --single
+
+# Resume last active experiment
+python scripts/run_experiment.py --resume --loop
+
+# Dry run (show what would happen)
+python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
+```
+
+### The Loop Protocol

 ```
 LOOP FOREVER:

-1. Read program.md for current strategy
-2. Review git history: what has been tried? What worked?
-3. Propose ONE change to the target file
-4. Apply the change
-5. git commit (with descriptive message)
-6. Run evaluation: python evaluate.py > run.log 2>&1
-7. Parse metric from run.log
-8. If metric improved → ADVANCE (keep commit, log "keep")
-9. If metric equal/worse → REVERT (git reset, log "discard")
-10. If crash → attempt fix, if unfixable log "crash" and revert
-11. Update results.tsv
-12. Go to 1
+1. Read program.md for current strategy and constraints
+2. Review git log: what has been tried? What worked? What crashed?
+3. Review results.tsv: current best metric, trend, recent failures
+4. Propose ONE change to the target file
+5. Apply the change
+6. git commit -m "experiment: [short description of what changed]"
+7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
+8. Parse metric from run.log (grep for metric_name: value)
+9. Decision:
+   - Metric improved → KEEP (advance branch, log "keep")
+   - Metric equal or worse → REVERT (git reset --hard, log "discard")
+   - Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
+10. Append result to results.tsv
+11. Go to 1
 ```

-### Rules (from Karpathy's original)
+### Rules

- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted.
- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win.
- **One change per experiment** — don't change 5 things at once. You won't know what worked.
- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on.
- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash.
- **No new dependencies** — only use what's already available.
+- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
+- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
+- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
+- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
+- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
+- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
+- **No new dependencies.** Only use what's already available in the project.

 ---

-## Results Log
+## Evaluators

-`results.tsv` (tab-separated, not git-tracked):
+Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.

+### Free Evaluators (no API cost)
+
+| Evaluator | Metric | Use Case |
+|-----------|--------|----------|
+| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
+| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
+| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
+| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
+| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
+
+### LLM Judge Evaluators (uses your subscription)
+
+| Evaluator | Metric | Use Case |
+|-----------|--------|----------|
+| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
+| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
+| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |
+
+LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
+
+The user's existing subscription covers the cost:
+- Claude Code Max → unlimited Claude calls for evaluation
+- Codex CLI (ChatGPT Pro) → unlimited Codex calls
+- Gemini CLI (free tier) → free evaluation calls
+
+### Custom Evaluators
+
+If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.
+
+```python
+#!/usr/bin/env python3
+# My custom evaluator — DO NOT MODIFY after experiment starts
+import subprocess
+result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
+# Parse and output
+print(f"my_metric: {parse_score(result.stdout)}")
 ```
-commit  metric  status  description
-a1b2c3d 0.9979  keep    baseline
-b2c3d4e 0.9932  keep    increased learning rate
-c3d4e5f 1.0050  discard switched to GeLU (worse)
-d4e5f6g 0.0000  crash   doubled model width (OOM)
-```
-
-Run `python scripts/log_results.py --summary` for a visual summary.

 ---

-## Domain-Specific Configurations
+## Viewing Results

-### ML Training (Karpathy-style)
-```yaml
-target: train.py
-evaluate: uv run prepare.py --eval-only
-metric: val_bpb (lower is better)
-time_budget: 5m
-git_branch: autoresearch/{date}-{tag}
+```bash
+# Single experiment
+python scripts/log_results.py --experiment engineering/api-speed
+
+# All experiments in a domain
+python scripts/log_results.py --domain engineering
+
+# Cross-experiment dashboard
+python scripts/log_results.py --dashboard
+
+# Export formats
+python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
+python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
+python scripts/log_results.py --dashboard --format markdown --output dashboard.md
 ```

-### Prompt Engineering
-```yaml
-target: prompt.md
-evaluate: python evaluate.py --model gpt-4o --test-cases tests/
-metric: eval_score (0-100, higher is better)
-time_budget: 2m
-git_branch: prompt-research/{date}
+### Dashboard Output
+
+```
+DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
+engineering     api-speed            47    14   185ms        -76.9%        active
+engineering     bundle-size          23     8   412KB        -58.3%        paused
+marketing       medium-ctr           31    11   8.4/10       +68.0%        active
+prompts         support-tone         15     6   82/100       +46.4%        done
 ```

-### Code Performance
-```yaml
-target: src/hot_module.py
-evaluate: python benchmark.py --runs 5 --warmup 1
-metric: p50_ms (lower is better)
-time_budget: 10m
-git_branch: perf-research/{date}
-```
+### Export Formats

-### Agent Skill Optimization
-```yaml
-target: SKILL.md
-evaluate: python scripts/skill_evaluator.py --task-suite tests/
-metric: pass_rate (0-1, higher is better)
-time_budget: 5m
-git_branch: skill-research/{date}
-```
-
-See `references/experiment-domains.md` for full setup guides per domain.
+- **TSV** — default, tab-separated (compatible with spreadsheets)
+- **CSV** — comma-separated, with proper quoting
+- **Markdown** — formatted table, readable in GitHub/docs

 ---

-## Scripts
+## Proactive Triggers

-| Script | Purpose |
-|--------|---------|
-| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run |
-| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) |
-| `log_results.py` | Record results to TSV; `--summary` prints progress table |
+Flag these without being asked:
+
+- **No evaluation command works** → Test it before starting the loop. Run once, verify output.
+- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
+- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
+- **Time budget too short** → If eval takes longer than budget, every run crashes.
+- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
+- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
+- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.

 ---

@@ -214,7 +253,6 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/

 ### Multi-tool install
 ```bash
-# Clone the repo, then use the convert script for your tool:
 ./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
 ```

@@ -225,22 +263,9 @@ clawhub install autoresearch-agent

 ---

-## Proactive Triggers
-
-Flag these issues without being asked:
-
- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
- **Target file has no git history** → `git init` and commit baseline first.
- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
-
---
-
 ## Related Skills

- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics.
- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops.
- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops.
- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function.
+- **self-improving-agent** — improves an agent's own memory/rules over time. NOT for structured experiment loops.
+- **senior-ml-engineer** — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
+- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
+- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.
--- a/engineering/autoresearch-agent/evaluators/benchmark_size.py
+++ b/engineering/autoresearch-agent/evaluators/benchmark_size.py
@@ -0,0 +1,56 @@
+#!/usr/bin/env python3
+"""Measure file, bundle, or Docker image size.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import os
+import subprocess
+import sys
+
+# --- CONFIGURE ONE OF THESE ---
+# Option 1: File size
+TARGET_FILE = "dist/main.js"
+
+# Option 2: Directory size (uncomment to use)
+# TARGET_DIR = "dist/"
+
+# Option 3: Docker image (uncomment to use)
+# DOCKER_IMAGE = "myapp:latest"
+# DOCKER_BUILD_CMD = "docker build -t myapp:latest ."
+
+# Option 4: Build first, then measure (uncomment to use)
+# BUILD_CMD = "npm run build"
+# --- END CONFIG ---
+
+# Build if needed
+if "BUILD_CMD" in dir() or "BUILD_CMD" in globals():
+    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)
+    if result.returncode != 0:
+        print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr)
+        sys.exit(1)
+
+# Measure
+if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
+    if "DOCKER_BUILD_CMD" in dir():
+        subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
+    result = subprocess.run(
+        f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
+        shell=True, capture_output=True, text=True
+    )
+    size_bytes = int(result.stdout.strip())
+elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
+    size_bytes = sum(
+        os.path.getsize(os.path.join(dp, f))
+        for dp, _, fns in os.walk(TARGET_DIR) for f in fns
+    )
+elif os.path.exists(TARGET_FILE):
+    size_bytes = os.path.getsize(TARGET_FILE)
+else:
+    print(f"Target not found: {TARGET_FILE}", file=sys.stderr)
+    sys.exit(1)
+
+size_kb = size_bytes / 1024
+size_mb = size_bytes / (1024 * 1024)
+
+print(f"size_bytes: {size_bytes}")
+print(f"size_kb: {size_kb:.1f}")
+print(f"size_mb: {size_mb:.2f}")
--- a/engineering/autoresearch-agent/evaluators/benchmark_speed.py
+++ b/engineering/autoresearch-agent/evaluators/benchmark_speed.py
@@ -0,0 +1,40 @@
+#!/usr/bin/env python3
+"""Measure execution speed of a target function or command.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import statistics
+import subprocess
+import sys
+import time
+
+# --- CONFIGURE THESE ---
+COMMAND = "python src/module.py"  # Command to benchmark
+RUNS = 5                          # Number of runs
+WARMUP = 1                        # Warmup runs (not counted)
+# --- END CONFIG ---
+
+times = []
+
+# Warmup
+for _ in range(WARMUP):
+    subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
+
+# Benchmark
+for i in range(RUNS):
+    t0 = time.perf_counter()
+    result = subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
+    elapsed = (time.perf_counter() - t0) * 1000  # ms
+
+    if result.returncode != 0:
+        print(f"Run {i+1} failed (exit {result.returncode})", file=sys.stderr)
+        print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
+        sys.exit(1)
+
+    times.append(elapsed)
+
+p50 = statistics.median(times)
+p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 5 else max(times)
+
+print(f"p50_ms: {p50:.2f}")
+print(f"p95_ms: {p95:.2f}")
+print(f"runs: {RUNS}")
--- a/engineering/autoresearch-agent/evaluators/build_speed.py
+++ b/engineering/autoresearch-agent/evaluators/build_speed.py
@@ -0,0 +1,39 @@
+#!/usr/bin/env python3
+"""Measure build/compile time.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import subprocess
+import sys
+import time
+
+# --- CONFIGURE THESE ---
+BUILD_CMD = "npm run build"    # or: docker build -t test .
+CLEAN_CMD = ""                 # optional: npm run clean (run before each build)
+RUNS = 3                       # Number of builds to average
+# --- END CONFIG ---
+
+times = []
+
+for i in range(RUNS):
+    # Clean if configured
+    if CLEAN_CMD:
+        subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)
+
+    t0 = time.perf_counter()
+    result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
+    elapsed = time.perf_counter() - t0
+
+    if result.returncode != 0:
+        print(f"Build {i+1} failed (exit {result.returncode})", file=sys.stderr)
+        print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
+        sys.exit(1)
+
+    times.append(elapsed)
+
+import statistics
+avg = statistics.mean(times)
+median = statistics.median(times)
+
+print(f"build_seconds: {median:.2f}")
+print(f"build_avg: {avg:.2f}")
+print(f"runs: {RUNS}")
--- a/engineering/autoresearch-agent/evaluators/llm_judge_content.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_content.py
@@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+"""LLM judge for content quality (headlines, titles, descriptions).
+Uses the user's existing CLI tool (claude, codex, gemini) for evaluation.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import subprocess
+import sys
+from pathlib import Path
+
+# --- CONFIGURE THESE ---
+TARGET_FILE = "content/titles.md"  # File being optimized
+CLI_TOOL = "claude"                # or: codex, gemini
+# --- END CONFIG ---
+
+# The judge prompt is FIXED — the agent cannot change how it's evaluated
+JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly.
+
+Criteria (each scored 1-10):
+
+1. CURIOSITY GAP — Does this make you want to click? Is there an information gap
+   that can only be resolved by reading? Generic titles score 1-3. Specific,
+   intriguing titles score 7-10.
+
+2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved
+   performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9.
+
+3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out,
+   or recognition? Flat titles score 1-3. Emotionally charged score 7-10.
+
+4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or
+   search results? Would they pause on this headline? Rate honestly.
+
+5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally?
+   Keyword-stuffed = 3. Natural integration of search terms = 8-10.
+
+Output EXACTLY this format (nothing else):
+curiosity: <score>
+specificity: <score>
+emotional: <score>
+scroll_stop: <score>
+seo: <score>
+ctr_score: <average of all 5 scores>
+
+Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
+
+content = Path(TARGET_FILE).read_text()
+full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
+
+# Call the user's CLI tool
+result = subprocess.run(
+    [CLI_TOOL, "-p", full_prompt],
+    capture_output=True, text=True, timeout=120
+)
+
+if result.returncode != 0:
+    print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
+    sys.exit(1)
+
+# Parse output — look for ctr_score line
+output = result.stdout
+for line in output.splitlines():
+    line = line.strip()
+    if line.startswith("ctr_score:"):
+        print(line)
+    elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")):
+        print(line)
+
+# Verify ctr_score was found
+if "ctr_score:" not in output:
+    print("Could not parse ctr_score from LLM output", file=sys.stderr)
+    print(f"Raw output: {output[:500]}", file=sys.stderr)
+    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/llm_judge_copy.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_copy.py
@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+"""LLM judge for marketing copy (social posts, ads, emails).
+Uses the user's existing CLI tool for evaluation.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import subprocess
+import sys
+from pathlib import Path
+
+# --- CONFIGURE THESE ---
+TARGET_FILE = "posts.md"           # Copy being optimized
+CLI_TOOL = "claude"                # or: codex, gemini
+PLATFORM = "twitter"               # twitter, linkedin, instagram, email, ad
+# --- END CONFIG ---
+
+JUDGE_PROMPTS = {
+    "twitter": """Score this Twitter/X post strictly:
+1. HOOK (1-10) — Does the first line stop the scroll?
+2. VALUE (1-10) — Does it provide insight, entertainment, or utility?
+3. ENGAGEMENT (1-10) — Would people reply, retweet, or like?
+4. BREVITY (1-10) — Is every word earning its place? No filler?
+5. CTA (1-10) — Is there a clear next action (even implicit)?""",
+
+    "linkedin": """Score this LinkedIn post strictly:
+1. HOOK (1-10) — Does the first line make you click "see more"?
+2. STORYTELLING (1-10) — Is there a narrative arc or just statements?
+3. CREDIBILITY (1-10) — Does it demonstrate expertise without bragging?
+4. ENGAGEMENT (1-10) — Would professionals comment or share?
+5. CTA (1-10) — Does it invite discussion or action?""",
+
+    "instagram": """Score this Instagram caption strictly:
+1. HOOK (1-10) — Does the first line grab attention?
+2. RELATABILITY (1-10) — Does the audience see themselves in this?
+3. VISUAL MATCH (1-10) — Does the copy complement visual content?
+4. HASHTAG STRATEGY (1-10) — Are hashtags relevant and not spammy?
+5. CTA (1-10) — Does it encourage saves, shares, or comments?""",
+
+    "email": """Score this email subject + preview strictly:
+1. OPEN INCENTIVE (1-10) — Would you open this in a crowded inbox?
+2. SPECIFICITY (1-10) — Is it concrete or vague?
+3. URGENCY (1-10) — Is there a reason to open now vs later?
+4. PERSONALIZATION (1-10) — Does it feel written for someone, not everyone?
+5. PREVIEW SYNC (1-10) — Does the preview text complement the subject?""",
+
+    "ad": """Score this ad copy strictly:
+1. ATTENTION (1-10) — Does it stop someone scrolling past ads?
+2. DESIRE (1-10) — Does it create want for the product/service?
+3. PROOF (1-10) — Is there credibility (numbers, social proof)?
+4. ACTION (1-10) — Is the CTA clear and compelling?
+5. OBJECTION HANDLING (1-10) — Does it preempt "why not"?""",
+}
+
+platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])
+
+JUDGE_PROMPT = f"""{platform_prompt}
+
+Output EXACTLY this format:
+criterion_1: <score>
+criterion_2: <score>
+criterion_3: <score>
+criterion_4: <score>
+criterion_5: <score>
+engagement_score: <average of all 5>
+
+Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""
+
+content = Path(TARGET_FILE).read_text()
+full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"
+
+result = subprocess.run(
+    [CLI_TOOL, "-p", full_prompt],
+    capture_output=True, text=True, timeout=120
+)
+
+if result.returncode != 0:
+    print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
+    sys.exit(1)
+
+output = result.stdout
+for line in output.splitlines():
+    line = line.strip()
+    if line.startswith("engagement_score:") or line.startswith("criterion_"):
+        print(line)
+
+if "engagement_score:" not in output:
+    print("Could not parse engagement_score from LLM output", file=sys.stderr)
+    print(f"Raw: {output[:500]}", file=sys.stderr)
+    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python3
+"""LLM judge for prompt/instruction quality.
+Uses the user's existing CLI tool for evaluation.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import json
+import subprocess
+import sys
+from pathlib import Path
+
+# --- CONFIGURE THESE ---
+TARGET_FILE = "prompt.md"          # Prompt being optimized
+TEST_CASES_FILE = "tests/cases.json"  # Test cases: [{"input": "...", "expected": "..."}]
+CLI_TOOL = "claude"                # or: codex, gemini
+# --- END CONFIG ---
+
+JUDGE_PROMPT_TEMPLATE = """You are evaluating a system prompt's effectiveness.
+
+SYSTEM PROMPT BEING TESTED:
+{prompt}
+
+TEST INPUT:
+{input}
+
+EXPECTED OUTPUT (reference):
+{expected}
+
+ACTUAL OUTPUT:
+{actual}
+
+Score the actual output on these criteria (each 1-10):
+1. ACCURACY — Does it match the expected output's intent and facts?
+2. COMPLETENESS — Does it cover all required elements?
+3. CLARITY — Is it well-structured and easy to understand?
+4. INSTRUCTION_FOLLOWING — Does it follow the system prompt's guidelines?
+
+Output EXACTLY: quality_score: <average of all 4>
+Nothing else."""
+
+prompt = Path(TARGET_FILE).read_text()
+test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
+
+scores = []
+
+for i, case in enumerate(test_cases):
+    # Generate output using the prompt
+    gen_prompt = f"{prompt}\n\n{case['input']}"
+    gen_result = subprocess.run(
+        [CLI_TOOL, "-p", gen_prompt],
+        capture_output=True, text=True, timeout=60
+    )
+    if gen_result.returncode != 0:
+        print(f"Generation failed for case {i+1}", file=sys.stderr)
+        scores.append(0)
+        continue
+
+    actual = gen_result.stdout.strip()
+
+    # Judge the output
+    judge_prompt = JUDGE_PROMPT_TEMPLATE.format(
+        prompt=prompt[:500],
+        input=case["input"],
+        expected=case.get("expected", "N/A"),
+        actual=actual[:500]
+    )
+
+    judge_result = subprocess.run(
+        [CLI_TOOL, "-p", judge_prompt],
+        capture_output=True, text=True, timeout=60
+    )
+
+    if judge_result.returncode != 0:
+        scores.append(0)
+        continue
+
+    # Parse score
+    for line in judge_result.stdout.splitlines():
+        if "quality_score:" in line:
+            try:
+                score = float(line.split(":")[-1].strip())
+                scores.append(score)
+            except ValueError:
+                scores.append(0)
+            break
+    else:
+        scores.append(0)
+
+    print(f"  Case {i+1}/{len(test_cases)}: {scores[-1]:.1f}", file=sys.stderr)
+
+if not scores:
+    print("No test cases evaluated", file=sys.stderr)
+    sys.exit(1)
+
+avg = sum(scores) / len(scores)
+quality = avg * 10  # Scale to 0-100
+
+print(f"quality_score: {quality:.2f}")
+print(f"cases_tested: {len(scores)}")
+print(f"avg_per_case: {avg:.2f}")
--- a/engineering/autoresearch-agent/evaluators/memory_usage.py
+++ b/engineering/autoresearch-agent/evaluators/memory_usage.py
@@ -0,0 +1,52 @@
+#!/usr/bin/env python3
+"""Measure peak memory usage of a command.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import os
+import platform
+import subprocess
+import sys
+
+# --- CONFIGURE THESE ---
+COMMAND = "python src/module.py"  # Command to measure
+# --- END CONFIG ---
+
+system = platform.system()
+
+if system == "Linux":
+    # Use /usr/bin/time for peak RSS
+    result = subprocess.run(
+        f"/usr/bin/time -v {COMMAND}",
+        shell=True, capture_output=True, text=True, timeout=300
+    )
+    output = result.stderr
+    for line in output.splitlines():
+        if "Maximum resident set size" in line:
+            kb = int(line.split(":")[-1].strip())
+            mb = kb / 1024
+            print(f"peak_mb: {mb:.1f}")
+            print(f"peak_kb: {kb}")
+            sys.exit(0)
+    print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
+    sys.exit(1)
+
+elif system == "Darwin":
+    # macOS: use /usr/bin/time -l
+    result = subprocess.run(
+        f"/usr/bin/time -l {COMMAND}",
+        shell=True, capture_output=True, text=True, timeout=300
+    )
+    output = result.stderr
+    for line in output.splitlines():
+        if "maximum resident set size" in line.lower():
+            # macOS reports in bytes
+            val = int(line.strip().split()[0])
+            mb = val / (1024 * 1024)
+            print(f"peak_mb: {mb:.1f}")
+            sys.exit(0)
+    print("Could not parse memory from time output", file=sys.stderr)
+    sys.exit(1)
+
+else:
+    print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
+    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/test_pass_rate.py
+++ b/engineering/autoresearch-agent/evaluators/test_pass_rate.py
@@ -0,0 +1,55 @@
+#!/usr/bin/env python3
+"""Measure test suite pass rate.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import re
+import subprocess
+import sys
+
+# --- CONFIGURE THESE ---
+TEST_CMD = "pytest tests/ --tb=no -q"  # Test command
+# --- END CONFIG ---
+
+result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
+output = result.stdout + "\n" + result.stderr
+
+# Try to parse pytest output: "X passed, Y failed, Z errors"
+passed = failed = errors = 0
+
+# pytest short format: "5 passed, 2 failed in 1.23s"
+match = re.search(r"(\d+) passed", output)
+if match:
+    passed = int(match.group(1))
+match = re.search(r"(\d+) failed", output)
+if match:
+    failed = int(match.group(1))
+match = re.search(r"(\d+) error", output)
+if match:
+    errors = int(match.group(1))
+
+total = passed + failed + errors
+if total == 0:
+    # Try unittest format: "Ran X tests"
+    match = re.search(r"Ran (\d+) test", output)
+    if match:
+        total = int(match.group(1))
+        if result.returncode == 0:
+            passed = total
+        else:
+            # Count failures from output
+            fail_match = re.search(r"FAILED \(failures=(\d+)", output)
+            if fail_match:
+                failed = int(fail_match.group(1))
+                passed = total - failed
+
+if total == 0:
+    print("Could not parse test results", file=sys.stderr)
+    print(f"Output: {output[:500]}", file=sys.stderr)
+    sys.exit(1)
+
+rate = passed / total
+
+print(f"pass_rate: {rate:.4f}")
+print(f"passed: {passed}")
+print(f"failed: {failed}")
+print(f"total: {total}")
--- a/engineering/autoresearch-agent/references/experiment-domains.md
+++ b/engineering/autoresearch-agent/references/experiment-domains.md
@@ -1,175 +1,255 @@
 # Experiment Domains Guide

-## Domain 1: ML Training (Karpathy-style)
+## Domain: Engineering

-**Best for:** LLM/neural net training optimization on a single GPU
+### Code Speed Optimization

-**Requirements:**
- NVIDIA GPU (H100 recommended, A100/RTX also work)
- CUDA + PyTorch
- `uv` package manager
- ~50GB disk (training data)
-
-**Setup:**
 ```bash
-# Clone autoresearch repo (the ML training environment)
-git clone https://github.com/karpathy/autoresearch my-ml-research
-cd my-ml-research
-uv sync
-uv run prepare.py  # one-time data download + tokenizer (~2 min)
-
-# Initialize autoresearch skill
-cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts
-
-# Configure
-python scripts/setup_experiment.py --domain ml --tag mar13
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name api-speed \
+  --target src/api/search.py \
+  --eval "python -m pytest tests/bench_search.py --tb=no -q" \
+  --metric p50_ms \
+  --direction lower \
+  --evaluator benchmark_speed
 ```

-**Metric:** `val_bpb` — validation bits per byte. Lower = better model.
+**What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O.
+**Cost:** Free — just runs benchmarks.
+**Speed:** ~5 min/experiment, ~12/hour, ~100 overnight.

-**What the agent can change in `train.py`:**
- Model depth, width, attention heads
- Learning rate, scheduler, warmup
- Optimizer (Muon, AdamW, variants)
- Batch size, gradient accumulation
- Architecture (attention patterns, FFN type)
+### Bundle Size Reduction

-**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):**
-Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256.
+```bash
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name bundle-size \
+  --target webpack.config.js \
+  --eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
+  --metric size_bytes \
+  --direction lower \
+  --evaluator benchmark_size
+```
+
+Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`.
+
+### Test Pass Rate
+
+```bash
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name fix-flaky-tests \
+  --target src/utils/parser.py \
+  --eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
+  --metric pass_rate \
+  --direction higher \
+  --evaluator test_pass_rate
+```
+
+### Docker Build Speed
+
+```bash
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name docker-build \
+  --target Dockerfile \
+  --eval "python .autoresearch/engineering/docker-build/evaluate.py" \
+  --metric build_seconds \
+  --direction lower \
+  --evaluator build_speed
+```
+
+### Memory Optimization
+
+```bash
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name memory-usage \
+  --target src/processor.py \
+  --eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
+  --metric peak_mb \
+  --direction lower \
+  --evaluator memory_usage
+```
+
+### ML Training (Karpathy-style)
+
+Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch).
+
+```bash
+python scripts/setup_experiment.py \
+  --domain engineering \
+  --name ml-training \
+  --target train.py \
+  --eval "uv run train.py" \
+  --metric val_bpb \
+  --direction lower \
+  --time-budget 5
+```

 ---

-## Domain 2: Prompt Engineering
+## Domain: Marketing

-**Best for:** Optimizing system prompts for quality/accuracy/tone
+### Medium Article Headlines

-**Requirements:**
- LLM API access (OpenAI, Anthropic, etc.)
- Test cases with expected outputs
- An LLM judge for scoring (can be same model)
-
-**Setup:**
 ```bash
-mkdir my-prompt-research && cd my-prompt-research
-git init
-
-# Create prompt.md (the thing being optimized)
-echo "You are a helpful assistant." > prompt.md
-
-# Create evaluate.py (fixed — never modify)
-cat > evaluate.py << 'EOF'
-#!/usr/bin/env python3
-# Fixed evaluation harness — DO NOT MODIFY
-import json, sys
-from pathlib import Path
-
-PROMPT = Path("prompt.md").read_text()
-# Load test cases
-TEST_CASES = json.loads(Path("tests/cases.json").read_text())
-
-# Run prompt against test cases, score with LLM judge
-# ... (customize for your LLM + scoring logic)
-total = sum(score_case(PROMPT, case) for case in TEST_CASES)
-score = total / len(TEST_CASES) * 100
-print(f"eval_score: {score:.2f}")
-EOF
-
-# Initialize
-python scripts/setup_experiment.py --domain prompt --tag mar13
+python scripts/setup_experiment.py \
+  --domain marketing \
+  --name medium-ctr \
+  --target content/titles.md \
+  --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
+  --metric ctr_score \
+  --direction higher \
+  --evaluator llm_judge_content
 ```

-**Metric:** `eval_score` (0-100). Higher = better prompt.
+Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`.
+
+**What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers.
+**Cost:** Uses your CLI subscription (Claude Max = unlimited).
+**Speed:** ~2 min/experiment, ~30/hour.
+
+### Social Media Copy
+
+```bash
+python scripts/setup_experiment.py \
+  --domain marketing \
+  --name twitter-engagement \
+  --target social/tweets.md \
+  --eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
+  --metric engagement_score \
+  --direction higher \
+  --evaluator llm_judge_copy
+```
+
+Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram).
+
+### Email Subject Lines
+
+```bash
+python scripts/setup_experiment.py \
+  --domain marketing \
+  --name email-open-rate \
+  --target emails/subjects.md \
+  --eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
+  --metric engagement_score \
+  --direction higher \
+  --evaluator llm_judge_copy
+```
+
+Edit `evaluate.py`: set `PLATFORM = "email"`.
+
+### Ad Copy
+
+```bash
+python scripts/setup_experiment.py \
+  --domain marketing \
+  --name ad-copy-q2 \
+  --target ads/google-search.md \
+  --eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
+  --metric engagement_score \
+  --direction higher \
+  --evaluator llm_judge_copy
+```
+
+Edit `evaluate.py`: set `PLATFORM = "ad"`.

 ---

-## Domain 3: Code Performance
+## Domain: Content

-**Best for:** Optimizing a specific hot module for speed
+### Article Structure & Readability

-**Requirements:**
- A Python module with measurable performance
- Existing tests (correctness must not regress)
- A benchmark harness
-
-**Setup:**
 ```bash
-cd my-project
-
-# Create benchmark.py (fixed — never modify)
-cat > benchmark.py << 'EOF'
-#!/usr/bin/env python3
-# Fixed benchmark — DO NOT MODIFY
-import time, statistics
-from src.module import your_function
-from tests.test_module import run_tests
-
-# Correctness check first
-if not run_tests():
-    print("TESTS FAILED")
-    sys.exit(1)
-
-# Benchmark
-data = generate_test_data(n=10000)
-times = []
-for _ in range(10):
-    t0 = time.perf_counter()
-    your_function(data)
-    times.append((time.perf_counter() - t0) * 1000)
-
-p50 = statistics.median(times)
-print(f"p50_ms: {p50:.2f}")
-print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
-EOF
-
-python scripts/setup_experiment.py --domain code \
-  --target src/module.py \
-  --tag mar13
+python scripts/setup_experiment.py \
+  --domain content \
+  --name article-structure \
+  --target drafts/my-article.md \
+  --eval "python .autoresearch/content/article-structure/evaluate.py" \
+  --metric ctr_score \
+  --direction higher \
+  --evaluator llm_judge_content
 ```

-**Metric:** `p50_ms` — median latency. Lower = faster.
+### SEO Descriptions
+
+```bash
+python scripts/setup_experiment.py \
+  --domain content \
+  --name seo-meta \
+  --target seo/descriptions.md \
+  --eval "python .autoresearch/content/seo-meta/evaluate.py" \
+  --metric ctr_score \
+  --direction higher \
+  --evaluator llm_judge_content
+```

 ---

-## Domain 4: Agent Skill Optimization
+## Domain: Prompts

-**Best for:** Improving the quality of claude-skills SKILL.md files
+### System Prompt Optimization

-**Requirements:**
- A SKILL.md to optimize
- A task evaluation suite (15-20 standardized tasks)
- An LLM judge for scoring
-
-**Setup:**
 ```bash
-# Create a new skill research project
-mkdir skill-research-{skill-name} && cd skill-research-{skill-name}
-git init
-
-# Copy the skill to optimize
-cp ~/.claude/skills/{skill-name}/SKILL.md .
-
-# Create evaluate.py
-cat > scripts/skill_evaluator.py << 'EOF'
-#!/usr/bin/env python3
-# Fixed evaluator — DO NOT MODIFY
-# Runs SKILL.md against 15 standardized tasks using LLM judge
-# Outputs: pass_rate: 0.80 (etc.)
-EOF
-
-python scripts/setup_experiment.py --domain skill --tag mar13
+python scripts/setup_experiment.py \
+  --domain prompts \
+  --name support-bot \
+  --target prompts/support-system.md \
+  --eval "python .autoresearch/prompts/support-bot/evaluate.py" \
+  --metric quality_score \
+  --direction higher \
+  --evaluator llm_judge_prompt
 ```

-**Metric:** `pass_rate` (0-1). Higher = better skill.
+Requires `tests/cases.json` with test inputs and expected outputs:
+
+```json
+[
+  {
+    "input": "I can't log in to my account",
+    "expected": "Ask for email, check account status, offer password reset"
+  },
+  {
+    "input": "How do I cancel my subscription?",
+    "expected": "Empathetic response, explain cancellation steps, offer retention"
+  }
+]
+```
+
+### Agent Skill Optimization
+
+```bash
+python scripts/setup_experiment.py \
+  --domain prompts \
+  --name skill-improvement \
+  --target SKILL.md \
+  --eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
+  --metric quality_score \
+  --direction higher \
+  --evaluator llm_judge_prompt
+```

 ---

 ## Choosing Your Domain

-| Question | Recommendation |
-|----------|---------------|
-| Do I have a GPU and want to improve an LLM? | ML Training |
-| Do I want to improve a prompt/system message? | Prompt Engineering |
-| Do I have slow Python code I want to speed up? | Code Performance |
-| Do I want to improve one of my claude-skills? | Skill Optimization |
+| I want to... | Domain | Evaluator | Cost |
+|-------------|--------|-----------|------|
+| Speed up my code | engineering | benchmark_speed | Free |
+| Shrink my bundle | engineering | benchmark_size | Free |
+| Fix flaky tests | engineering | test_pass_rate | Free |
+| Speed up Docker builds | engineering | build_speed | Free |
+| Reduce memory usage | engineering | memory_usage | Free |
+| Train ML models | engineering | (custom) | Free + GPU |
+| Write better headlines | marketing | llm_judge_content | Subscription |
+| Improve social posts | marketing | llm_judge_copy | Subscription |
+| Optimize email subjects | marketing | llm_judge_copy | Subscription |
+| Improve ad copy | marketing | llm_judge_copy | Subscription |
+| Optimize article structure | content | llm_judge_content | Subscription |
+| Improve SEO descriptions | content | llm_judge_content | Subscription |
+| Optimize system prompts | prompts | llm_judge_prompt | Subscription |
+| Improve agent skills | prompts | llm_judge_prompt | Subscription |

-**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results.
+**First time?** Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.
--- a/engineering/autoresearch-agent/scripts/log_results.py
+++ b/engineering/autoresearch-agent/scripts/log_results.py
@@ -1,125 +1,389 @@
 #!/usr/bin/env python3
 """
-autoresearch-agent: Results Logger
+autoresearch-agent: Results Viewer

-View and analyze experiment results from results.tsv.
+View experiment results in multiple formats: terminal, CSV, Markdown.
+Supports single experiment, domain, or cross-experiment dashboard.

 Usage:
-    python scripts/log_results.py --summary          # Print progress table
-    python scripts/log_results.py --best             # Show best result
-    python scripts/log_results.py --history          # Full experiment history
-    python scripts/log_results.py --record commit val status desc  # Add entry manually
+    python scripts/log_results.py --experiment engineering/api-speed
+    python scripts/log_results.py --domain engineering
+    python scripts/log_results.py --dashboard
+    python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
+    python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
+    python scripts/log_results.py --dashboard --format markdown --output dashboard.md
 """

 import argparse
+import csv
+import io
 import sys
 from pathlib import Path


-def load_results(path):
-    tsv = Path(path) / "results.tsv"
+def find_autoresearch_root():
+    """Find .autoresearch/ in project or user home."""
+    project_root = Path(".").resolve() / ".autoresearch"
+    if project_root.exists():
+        return project_root
+    user_root = Path.home() / ".autoresearch"
+    if user_root.exists():
+        return user_root
+    return None
+
+
+def load_config(experiment_dir):
+    """Load config.cfg."""
+    cfg_file = experiment_dir / "config.cfg"
+    config = {}
+    if cfg_file.exists():
+        for line in cfg_file.read_text().splitlines():
+            if ":" in line:
+                k, v = line.split(":", 1)
+                config[k.strip()] = v.strip()
+    return config
+
+
+def load_results(experiment_dir):
+    """Load results.tsv into list of dicts."""
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return []
-    lines = tsv.read_text().splitlines()[1:]  # skip header
    results = []
-    for line in lines:
+    for line in tsv.read_text().splitlines()[1:]:
        parts = line.split("\t")
        if len(parts) >= 4:
            try:
-                metric_val = float(parts[1]) if parts[1] != "N/A" else None
+                metric = float(parts[1]) if parts[1] != "N/A" else None
            except ValueError:
-                metric_val = None
+                metric = None
            results.append({
                "commit": parts[0],
-                "metric": metric_val,
+                "metric": metric,
                "status": parts[2],
-                "description": parts[3]
+                "description": parts[3],
            })
    return results


-def print_summary(results, metric_name="metric", direction="lower"):
-    if not results:
-        print("No experiments logged yet.")
-        return
-
+def compute_stats(results, direction):
+    """Compute statistics from results."""
    keeps = [r for r in results if r["status"] == "keep"]
    discards = [r for r in results if r["status"] == "discard"]
    crashes = [r for r in results if r["status"] == "crash"]

-    print(f"\n{'─'*60}")
-    print(f"  autoresearch-agent — Results Summary")
-    print(f"{'─'*60}")
-    print(f"  Total experiments: {len(results)}")
-    print(f"  ✅ Keep:    {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)")
-    print(f"  ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)")
-    print(f"  💥 Crash:   {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
+    valid_keeps = [r for r in keeps if r["metric"] is not None]
+    baseline = valid_keeps[0]["metric"] if valid_keeps else None
+    if valid_keeps:
+        best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps)
+    else:
+        best = None

-    if keeps:
-        valid = [r for r in keeps if r["metric"] is not None]
-        if valid:
-            baseline = valid[0]["metric"]
-            best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid)
-            best_run = next(r for r in valid if r["metric"] == best)
-            improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
+    pct_change = None
+    if baseline and best and baseline != 0:
+        if direction == "lower":
+            pct_change = (baseline - best) / baseline * 100
+        else:
+            pct_change = (best - baseline) / baseline * 100

-            print(f"\n  {metric_name}:")
-            print(f"    Baseline: {baseline:.6f}")
-            print(f"    Best:     {best:.6f}  (commit: {best_run['commit']})")
-            print(f"    Change:   {improvement:+.2f}%")
-
-    print(f"{'─'*60}\n")
+    return {
+        "total": len(results),
+        "keeps": len(keeps),
+        "discards": len(discards),
+        "crashes": len(crashes),
+        "baseline": baseline,
+        "best": best,
+        "pct_change": pct_change,
+    }


-def print_history(results):
+# --- Terminal Output ---
+
+def print_experiment(experiment_dir, experiment_path):
+    """Print single experiment results to terminal."""
+    config = load_config(experiment_dir)
+    results = load_results(experiment_dir)
+    direction = config.get("metric_direction", "lower")
+    metric_name = config.get("metric", "metric")
+
    if not results:
-        print("No experiments logged yet.")
+        print(f"No results for {experiment_path}")
        return

-    print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION")
-    print("─" * 60)
-    for r in results:
-        metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash   "
-        status_icon = {"keep": "✅", "discard": "❌", "crash": "💥"}.get(r["status"], "?")
-        print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
+    stats = compute_stats(results, direction)

+    print(f"\n{'─' * 65}")
+    print(f"  {experiment_path}")
+    print(f"  Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})")
+    print(f"{'─' * 65}")
+    print(f"  Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}")
+
+    if stats["baseline"] is not None and stats["best"] is not None:
+        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
+        print(f"  Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}")
+
+    print(f"\n  {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION")
+    print(f"  {'─' * 60}")
+    for r in results:
+        m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A     "
+        icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?")
+        print(f"  {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}")
+    print()
+
+
+def print_dashboard(root):
+    """Print cross-experiment dashboard."""
+    experiments = []
+    for domain_dir in sorted(root.iterdir()):
+        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
+            continue
+        for exp_dir in sorted(domain_dir.iterdir()):
+            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
+                continue
+            config = load_config(exp_dir)
+            results = load_results(exp_dir)
+            direction = config.get("metric_direction", "lower")
+            stats = compute_stats(results, direction)
+
+            # Determine status
+            status = "idle"
+            if stats["total"] > 0:
+                tsv = exp_dir / "results.tsv"
+                if tsv.exists():
+                    import time
+                    age_hours = (time.time() - tsv.stat().st_mtime) / 3600
+                    status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"
+
+            best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—"
+            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"
+
+            experiments.append({
+                "domain": domain_dir.name,
+                "name": exp_dir.name,
+                "runs": stats["total"],
+                "kept": stats["keeps"],
+                "best": best_str,
+                "change": pct_str,
+                "status": status,
+                "metric": config.get("metric", "?"),
+            })
+
+    if not experiments:
+        print("No experiments found.")
+        return experiments
+
+    print(f"\n{'─' * 90}")
+    print(f"  autoresearch — Dashboard")
+    print(f"{'─' * 90}")
+    print(f"  {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}")
+    print(f"  {'─' * 85}")
+    for e in experiments:
+        print(f"  {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}")
+    print()
+    return experiments
+
+
+# --- CSV Export ---
+
+def export_experiment_csv(experiment_dir, experiment_path):
+    """Export single experiment as CSV string."""
+    config = load_config(experiment_dir)
+    results = load_results(experiment_dir)
+    direction = config.get("metric_direction", "lower")
+    stats = compute_stats(results, direction)
+
+    buf = io.StringIO()
+    writer = csv.writer(buf)
+
+    # Header with metadata
+    writer.writerow(["# Experiment", experiment_path])
+    writer.writerow(["# Target", config.get("target", "")])
+    writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"])
+    if stats["baseline"] is not None:
+        writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
+    if stats["best"] is not None:
+        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
+        writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
+    writer.writerow(["# Total", stats["total"]])
+    writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
+    writer.writerow([])
+
+    writer.writerow(["Commit", "Metric", "Status", "Description"])
+    for r in results:
+        m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A"
+        writer.writerow([r["commit"], m, r["status"], r["description"]])
+
+    return buf.getvalue()
+
+
+def export_dashboard_csv(root):
+    """Export dashboard as CSV string."""
+    experiments = []
+    for domain_dir in sorted(root.iterdir()):
+        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
+            continue
+        for exp_dir in sorted(domain_dir.iterdir()):
+            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
+                continue
+            config = load_config(exp_dir)
+            results = load_results(exp_dir)
+            direction = config.get("metric_direction", "lower")
+            stats = compute_stats(results, direction)
+            best_str = f"{stats['best']:.6f}" if stats["best"] else ""
+            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
+            experiments.append([
+                domain_dir.name, exp_dir.name, config.get("metric", ""),
+                stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
+                best_str, pct_str
+            ])
+
+    buf = io.StringIO()
+    writer = csv.writer(buf)
+    writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"])
+    for e in experiments:
+        writer.writerow(e)
+    return buf.getvalue()
+
+
+# --- Markdown Export ---
+
+def export_experiment_markdown(experiment_dir, experiment_path):
+    """Export single experiment as Markdown string."""
+    config = load_config(experiment_dir)
+    results = load_results(experiment_dir)
+    direction = config.get("metric_direction", "lower")
+    metric_name = config.get("metric", "metric")
+    stats = compute_stats(results, direction)
+
+    lines = []
+    lines.append(f"# Autoresearch: {experiment_path}\n")
+    lines.append(f"**Target:** `{config.get('target', '?')}`  ")
+    lines.append(f"**Metric:** `{metric_name}` ({direction} is better)  ")
+    lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")
+
+    if stats["baseline"] is not None and stats["best"] is not None:
+        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
+        lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")
+
+    lines.append(f"| Commit | Metric | Status | Description |")
+    lines.append(f"|--------|--------|--------|-------------|")
+    for r in results:
+        m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A"
+        lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |")
+    lines.append("")
+
+    return "\n".join(lines)
+
+
+def export_dashboard_markdown(root):
+    """Export dashboard as Markdown string."""
+    lines = []
+    lines.append("# Autoresearch Dashboard\n")
+    lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |")
+    lines.append("|--------|-----------|--------|------|------|------|--------|--------|")
+
+    for domain_dir in sorted(root.iterdir()):
+        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
+            continue
+        for exp_dir in sorted(domain_dir.iterdir()):
+            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
+                continue
+            config = load_config(exp_dir)
+            results = load_results(exp_dir)
+            direction = config.get("metric_direction", "lower")
+            stats = compute_stats(results, direction)
+            best = f"`{stats['best']:.4f}`" if stats["best"] else "—"
+            pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else "—"
+
+            import time
+            tsv = exp_dir / "results.tsv"
+            status = "idle"
+            if tsv.exists() and stats["total"] > 0:
+                age_h = (time.time() - tsv.stat().st_mtime) / 3600
+                status = "active" if age_h < 1 else "paused" if age_h < 24 else "done"
+
+            lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |")
+
+    lines.append("")
+    return "\n".join(lines)
+
+
+# --- Main ---

 def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--summary", action="store_true")
-    parser.add_argument("--best", action="store_true")
-    parser.add_argument("--history", action="store_true")
-    parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC"))
-    parser.add_argument("--path", default=".")
-    parser.add_argument("--metric", default="metric")
-    parser.add_argument("--direction", default="lower", choices=["lower", "higher"])
+    parser = argparse.ArgumentParser(description="autoresearch-agent results viewer")
+    parser.add_argument("--experiment", help="Show one experiment: domain/name")
+    parser.add_argument("--domain", help="Show all experiments in a domain")
+    parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard")
+    parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal",
+                        help="Output format (default: terminal)")
+    parser.add_argument("--output", "-o", help="Write to file instead of stdout")
+    parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)")
    args = parser.parse_args()

-    path = Path(args.path).resolve()
+    root = find_autoresearch_root()
+    if root is None:
+        print("No .autoresearch/ found. Run setup_experiment.py first.")
+        sys.exit(1)

-    if args.record:
-        commit, metric, status, desc = args.record
-        tsv = path / "results.tsv"
-        if not tsv.exists():
-            tsv.write_text("commit\tmetric\tstatus\tdescription\n")
-        with open(tsv, "a") as f:
-            f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
-        print(f"✓ Logged: {commit} {metric} {status}")
-        return
+    output_text = None

-    results = load_results(path)
+    # Single experiment
+    if args.experiment:
+        experiment_dir = root / args.experiment
+        if not experiment_dir.exists():
+            print(f"Experiment not found: {args.experiment}")
+            sys.exit(1)

-    if args.history:
-        print_history(results)
-    elif args.best:
-        keeps = [r for r in results if r["status"] == "keep" and r["metric"]]
-        if not keeps:
-            print("No successful experiments yet.")
+        if args.format == "csv":
+            output_text = export_experiment_csv(experiment_dir, args.experiment)
+        elif args.format == "markdown":
+            output_text = export_experiment_markdown(experiment_dir, args.experiment)
+        else:
+            print_experiment(experiment_dir, args.experiment)
            return
-        best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
-        print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}")
+
+    # Domain
+    elif args.domain:
+        domain_dir = root / args.domain
+        if not domain_dir.exists():
+            print(f"Domain not found: {args.domain}")
+            sys.exit(1)
+        for exp_dir in sorted(domain_dir.iterdir()):
+            if exp_dir.is_dir() and (exp_dir / "config.cfg").exists():
+                if args.format == "terminal":
+                    print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}")
+                # For CSV/MD, fall through to dashboard with domain filter
+        if args.format != "terminal":
+            # Use dashboard export filtered to domain
+            output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
+        else:
+            return
+
+    # Dashboard
+    elif args.dashboard or args.all:
+        if args.format == "csv":
+            output_text = export_dashboard_csv(root)
+        elif args.format == "markdown":
+            output_text = export_dashboard_markdown(root)
+        else:
+            print_dashboard(root)
+            return
+
    else:
-        print_summary(results, args.metric, args.direction)
+        # Default: dashboard
+        if args.format == "terminal":
+            print_dashboard(root)
+            return
+        output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
+
+    # Write output
+    if output_text:
+        if args.output:
+            Path(args.output).write_text(output_text)
+            print(f"Written to {args.output}")
+        else:
+            print(output_text)


 if __name__ == "__main__":
--- a/engineering/autoresearch-agent/scripts/run_experiment.py
+++ b/engineering/autoresearch-agent/scripts/run_experiment.py
@@ -2,17 +2,15 @@
 """
 autoresearch-agent: Experiment Runner

-Executes the autonomous experiment loop:
- Reads .autoresearch.cfg for project config
- Runs the target evaluation
- Keeps improvements (git commit) or discards failures (git reset)
- Logs everything to results.tsv
- Loops indefinitely until interrupted
+Executes the autonomous experiment loop for a specific experiment.
+Reads config from .autoresearch/{domain}/{name}/config.cfg.

 Usage:
-    python scripts/run_experiment.py --loop      # Run forever
-    python scripts/run_experiment.py --single    # Run one experiment
-    python scripts/run_experiment.py --dry-run   # Show what would happen
+    python scripts/run_experiment.py --experiment engineering/api-speed --loop
+    python scripts/run_experiment.py --experiment engineering/api-speed --single
+    python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
+    python scripts/run_experiment.py --resume --loop
+    python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
 """

 import argparse
@@ -25,11 +23,22 @@ from datetime import datetime
 from pathlib import Path


-def load_config(path):
-    """Load .autoresearch.cfg"""
-    cfg_file = Path(path) / ".autoresearch.cfg"
+def find_autoresearch_root():
+    """Find .autoresearch/ in project or user home."""
+    project_root = Path(".").resolve() / ".autoresearch"
+    if project_root.exists():
+        return project_root
+    user_root = Path.home() / ".autoresearch"
+    if user_root.exists():
+        return user_root
+    return None
+
+
+def load_config(experiment_dir):
+    """Load config.cfg from experiment directory."""
+    cfg_file = experiment_dir / "config.cfg"
    if not cfg_file.exists():
-        print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.")
+        print(f"  Error: no config.cfg in {experiment_dir}")
        sys.exit(1)
    config = {}
    for line in cfg_file.read_text().splitlines():
@@ -49,239 +58,293 @@ def run_cmd(cmd, cwd=None, timeout=None):


 def get_current_commit(path):
+    """Get short hash of current HEAD."""
    _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
    return commit


-def get_current_metric(path, metric_grep):
-    """Read the last recorded metric from results.tsv."""
-    tsv = Path(path) / "results.tsv"
+def get_best_metric(experiment_dir, direction):
+    """Read the best metric from results.tsv."""
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return None
-    lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l]
+    lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l]
    if not lines:
        return None
-    last = lines[-1].split("\t")
-    try:
-        return float(last[1])
-    except (ValueError, IndexError):
+    metrics = []
+    for line in lines:
+        parts = line.split("\t")
+        try:
+            if parts[1] != "N/A":
+                metrics.append(float(parts[1]))
+        except (ValueError, IndexError):
+            continue
+    if not metrics:
        return None
+    return min(metrics) if direction == "lower" else max(metrics)


-def run_evaluation(path, evaluate_cmd, time_budget_minutes):
-    """Run evaluation with time limit."""
-    hard_limit = time_budget_minutes * 60 * 2.5  # 2.5x as hard timeout
+def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
+    """Run evaluation with time limit. Output goes to log_file."""
+    hard_limit = time_budget_minutes * 60 * 2.5
    t0 = time.time()
    try:
        code, _, _ = run_cmd(
-            f"{evaluate_cmd} > run.log 2>&1",
-            cwd=path,
+            f"{eval_cmd} > {log_file} 2>&1",
+            cwd=str(project_root),
            timeout=hard_limit
        )
        elapsed = time.time() - t0
        return code, elapsed
    except subprocess.TimeoutExpired:
        elapsed = time.time() - t0
-        return -1, elapsed  # -1 = timeout
+        return -1, elapsed


-def extract_metric(path, metric_grep):
-    """Extract metric value from run.log."""
-    code, out, _ = run_cmd(
-        f"grep '{metric_grep}' run.log | tail -1",
-        cwd=path
-    )
-    if not out:
-        return None
-    try:
-        return float(out.split(":")[-1].strip())
-    except ValueError:
+def extract_metric(log_file, metric_grep):
+    """Extract metric value from log file."""
+    log_path = Path(log_file)
+    if not log_path.exists():
        return None
+    for line in reversed(log_path.read_text().splitlines()):
+        stripped = line.strip()
+        if stripped.startswith(metric_grep.lstrip("^")):
+            try:
+                return float(stripped.split(":")[-1].strip())
+            except ValueError:
+                continue
+    return None


 def is_improvement(new_val, old_val, direction):
    """Check if new result is better than old."""
    if old_val is None:
-        return True  # First run always "improves"
+        return True
    if direction == "lower":
        return new_val < old_val
-    else:
-        return new_val > old_val
+    return new_val > old_val


-def log_result(path, commit, metric_val, status, description):
+def log_result(experiment_dir, commit, metric_val, status, description):
    """Append result to results.tsv."""
-    tsv = Path(path) / "results.tsv"
+    tsv = experiment_dir / "results.tsv"
    metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
    with open(tsv, "a") as f:
        f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")


-def get_experiment_count(path):
+def get_experiment_count(experiment_dir):
    """Count experiments run so far."""
-    tsv = Path(path) / "results.tsv"
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return 0
-    lines = tsv.read_text().splitlines()
-    return max(0, len(lines) - 1)  # subtract header
+    return max(0, len(tsv.read_text().splitlines()) - 1)


-def run_single_experiment(path, config, exp_num, dry_run=False):
+def get_last_active(root):
+    """Find the most recently modified experiment."""
+    latest = None
+    latest_time = 0
+    for domain_dir in root.iterdir():
+        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
+            continue
+        for exp_dir in domain_dir.iterdir():
+            if not exp_dir.is_dir():
+                continue
+            cfg = exp_dir / "config.cfg"
+            if cfg.exists() and cfg.stat().st_mtime > latest_time:
+                latest_time = cfg.stat().st_mtime
+                latest = f"{domain_dir.name}/{exp_dir.name}"
+    return latest
+
+
+def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
    """Run one experiment iteration."""
    direction = config.get("metric_direction", "lower")
    metric_grep = config.get("metric_grep", "^metric:")
-    evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py")
+    eval_cmd = config.get("evaluate_cmd", "python evaluate.py")
    time_budget = int(config.get("time_budget_minutes", 5))
    metric_name = config.get("metric", "metric")
+    log_file = str(experiment_dir / "run.log")

-    best_so_far = get_current_metric(path, metric_grep)
+    best = get_best_metric(experiment_dir, direction)
    ts = datetime.now().strftime("%H:%M:%S")

    print(f"\n[{ts}] Experiment #{exp_num}")
-    print(f"  Best {metric_name} so far: {best_so_far}")
+    print(f"  Best {metric_name}: {best}")

    if dry_run:
        print("  [DRY RUN] Would run evaluation and check metric")
        return "dry_run"

-    # Save pre-experiment state for rollback
-    code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path)
+    # Save state for rollback
+    code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
    if code != 0:
-        print("  ✗ Can't get git state. Is this a git repo with commits?")
+        print("  Error: can't get git state")
        return "error"

    # Run evaluation
-    print(f"  Running: {evaluate_cmd} (budget: {time_budget} min)")
-    ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget)
+    print(f"  Running: {eval_cmd} (budget: {time_budget}m)")
+    ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file)

-    # Handle timeout
+    commit = get_current_commit(str(project_root))
+
+    # Timeout
    if ret_code == -1:
-        print(f"  ✗ TIMEOUT after {elapsed:.0f}s — discarding")
-        run_cmd("git checkout -- .", cwd=path)  # revert uncommitted changes
-        # Commit was already made by the agent before evaluation
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
-        curr_commit = get_current_commit(path)
-        log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
+        print(f"  TIMEOUT after {elapsed:.0f}s — discarding")
+        run_cmd("git checkout -- .", cwd=str(project_root))
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
        return "crash"

-    # Handle non-zero exit
+    # Crash
    if ret_code != 0:
-        # Check if it crashed
-        code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path)
-        print(f"  ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
+        _, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
+        print(f"  CRASH (exit {ret_code}) after {elapsed:.0f}s")
        print(f"  Last output: {tail[:200]}")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
-        curr_commit = get_current_commit(path)
-        log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
        return "crash"

    # Extract metric
-    metric_val = extract_metric(path, metric_grep)
+    metric_val = extract_metric(log_file, metric_grep)
    if metric_val is None:
-        print(f"  ✗ Could not parse metric from run.log")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
-        curr_commit = get_current_commit(path)
-        log_result(path, curr_commit, None, "crash", "metric_parse_failed")
+        print(f"  Could not parse {metric_name} from run.log")
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
        return "crash"

-    curr_commit = get_current_commit(path)
    delta = ""
-    if best_so_far is not None:
-        diff = metric_val - best_so_far
-        delta = f" (Δ{diff:+.4f})"
+    if best is not None:
+        diff = metric_val - best
+        delta = f" (delta {diff:+.4f})"

    print(f"  {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")

    # Keep or discard
-    if is_improvement(metric_val, best_so_far, direction):
-        print(f"  ✅ KEEP — improvement confirmed")
-        log_result(path, curr_commit, metric_val, "keep",
-                   f"improvement_{metric_name}_{metric_val:.4f}")
+    if is_improvement(metric_val, best, direction):
+        print(f"  KEEP — improvement")
+        log_result(experiment_dir, commit, metric_val, "keep",
+                   f"improved_{metric_name}_{metric_val:.4f}")
        return "keep"
    else:
-        print(f"  ❌ DISCARD — no improvement")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
-        curr_commit = get_current_commit(path)
-        log_result(path, curr_commit, metric_val, "discard",
-                   f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}")
+        print(f"  DISCARD — no improvement")
+        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        best_str = f"{best:.4f}" if best else "?"
+        log_result(experiment_dir, commit, metric_val, "discard",
+                   f"no_improvement_{metric_val:.4f}_vs_{best_str}")
        return "discard"


-def print_summary(path):
-    """Print experiment summary."""
-    tsv = Path(path) / "results.tsv"
+def print_summary(experiment_dir, config):
+    """Print session summary."""
+    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return
-    lines = tsv.read_text().splitlines()[1:]  # skip header
+    lines = tsv.read_text().splitlines()[1:]
    if not lines:
        return

    keeps = [l for l in lines if "\tkeep\t" in l]
    discards = [l for l in lines if "\tdiscard\t" in l]
    crashes = [l for l in lines if "\tcrash\t" in l]
+    metric_name = config.get("metric", "metric")
+    direction = config.get("metric_direction", "lower")

-    print(f"\n{'='*50}")
-    print(f"  Session Summary")
+    print(f"\n{'=' * 55}")
+    print(f"  autoresearch — Session Summary")
    print(f"  Experiments: {len(lines)} total")
-    print(f"  ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}")
+    print(f"  Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")

    if keeps:
        try:
-            first_metric = float(keeps[0].split("\t")[1])
-            last_metric = float(keeps[-1].split("\t")[1])
-            direction = "↓" if last_metric < first_metric else "↑"
-            print(f"  Best progress: {first_metric:.6f} → {last_metric:.6f} {direction}")
+            valid = []
+            for l in keeps:
+                parts = l.split("\t")
+                if parts[1] != "N/A":
+                    valid.append(float(parts[1]))
+            if len(valid) >= 2:
+                first, last = valid[0], valid[-1]
+                best = min(valid) if direction == "lower" else max(valid)
+                pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
+                print(f"  {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
        except (ValueError, IndexError):
            pass
-    print(f"{'='*50}\n")
+    print(f"{'=' * 55}\n")


 def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent runner")
+    parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
+    parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
    parser.add_argument("--loop", action="store_true", help="Run forever")
    parser.add_argument("--single", action="store_true", help="Run one experiment")
-    parser.add_argument("--dry-run", action="store_true", help="Dry run only")
+    parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
+    parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
    parser.add_argument("--path", default=".", help="Project root")
-    parser.add_argument("--max-experiments", type=int, default=0,
-                        help="Max experiments (0 = unlimited)")
    args = parser.parse_args()

-    path = Path(args.path).resolve()
-    config = load_config(path)
+    project_root = Path(args.path).resolve()
+    root = find_autoresearch_root()

-    print(f"\n🔬 autoresearch-agent")
-    print(f"   Project: {path}")
-    print(f"   Target: {config.get('target', '?')}")
-    print(f"   Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
-    print(f"   Budget: {config.get('time_budget_minutes', '?')} min/experiment")
-    print(f"   Mode: {'loop' if args.loop else 'single'}")
+    if root is None:
+        print("No .autoresearch/ found. Run setup_experiment.py first.")
+        sys.exit(1)

-    if args.single:
-        exp_num = get_experiment_count(path) + 1
-        run_single_experiment(path, config, exp_num, args.dry_run)
+    # Resolve experiment
+    experiment_path = args.experiment
+    if args.resume:
+        experiment_path = get_last_active(root)
+        if not experiment_path:
+            print("No experiments found to resume.")
+            sys.exit(1)
+        print(f"Resuming: {experiment_path}")
+
+    if not experiment_path:
+        print("Specify --experiment domain/name or --resume")
+        sys.exit(1)
+
+    experiment_dir = root / experiment_path
+    if not experiment_dir.exists():
+        print(f"Experiment not found: {experiment_dir}")
+        print("Run: python scripts/setup_experiment.py --list")
+        sys.exit(1)
+
+    config = load_config(experiment_dir)
+
+    domain, name = experiment_path.split("/", 1)
+    print(f"\n  autoresearch-agent")
+    print(f"  Experiment: {experiment_path}")
+    print(f"  Target: {config.get('target', '?')}")
+    print(f"  Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
+    print(f"  Budget: {config.get('time_budget_minutes', '?')} min/experiment")
+    print(f"  Mode: {'loop' if args.loop else 'single'}")
+
+    if args.single or args.dry_run:
+        exp_num = get_experiment_count(experiment_dir) + 1
+        run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
        return

-    if not args.loop and not args.dry_run:
+    if not args.loop:
        print("\nSpecify --loop (forever) or --single (one experiment)")
        sys.exit(1)

-    # Setup graceful shutdown
+    # Graceful shutdown
    def handle_interrupt(sig, frame):
-        print_summary(path)
-        print("\n⏹ Stopped by user.")
+        print_summary(experiment_dir, config)
+        print("\nStopped by user.")
        sys.exit(0)

    signal.signal(signal.SIGINT, handle_interrupt)
    signal.signal(signal.SIGTERM, handle_interrupt)

-    # Main loop
    consecutive_crashes = 0
-    exp_num = get_experiment_count(path) + 1
+    exp_num = get_experiment_count(experiment_dir) + 1

-    print(f"\nStarting loop. Ctrl+C to stop and print summary.\n")
+    print(f"\nStarting loop. Ctrl+C to stop.\n")

    while True:
-        result = run_single_experiment(path, config, exp_num, args.dry_run)
+        result = run_single(project_root, experiment_dir, config, exp_num, False)
        exp_num += 1

        if result == "crash":
@@ -289,21 +352,16 @@ def main():
        else:
            consecutive_crashes = 0

-        # Bail if 5 consecutive crashes
        if consecutive_crashes >= 5:
-            print("\n⚠ 5 consecutive crashes. Pausing for investigation.")
-            print("  Check run.log for the last error.")
+            print("\n  5 consecutive crashes. Pausing.")
+            print("  Check .autoresearch/{}/run.log".format(experiment_path))
            break

-        # Check max experiments
-        if args.max_experiments > 0 and exp_num > args.max_experiments:
-            print(f"\n✓ Reached max experiments ({args.max_experiments})")
+        if 0 < args.max_experiments < exp_num:
+            print(f"\n  Reached max experiments ({args.max_experiments})")
            break

-        if args.single:
-            break
-
-    print_summary(path)
+    print_summary(experiment_dir, config)


 if __name__ == "__main__":
--- a/engineering/autoresearch-agent/scripts/setup_experiment.py
+++ b/engineering/autoresearch-agent/scripts/setup_experiment.py
@@ -1,65 +1,52 @@
 #!/usr/bin/env python3
 """
-autoresearch-agent: Setup Wizard
+autoresearch-agent: Setup Experiment

-Initializes a new research run:
-1. Validates the project structure
-2. Creates a git branch
-3. Runs the baseline experiment
-4. Initializes results.tsv
+Initialize a new experiment with domain, target, evaluator, and git branch.
+Creates the .autoresearch/{domain}/{name}/ directory structure.

 Usage:
-    python scripts/setup_experiment.py [--config experiment.yaml]
-    python scripts/setup_experiment.py --domain ml|prompt|code|skill
+    python scripts/setup_experiment.py --domain engineering --name api-speed \
+        --target src/api/search.py --eval "pytest bench.py" \
+        --metric p50_ms --direction lower
+
+    python scripts/setup_experiment.py --domain marketing --name medium-ctr \
+        --target content/titles.md --eval "python evaluate.py" \
+        --metric ctr_score --direction higher --evaluator llm_judge_content
+
+    python scripts/setup_experiment.py --list          # List all experiments
+    python scripts/setup_experiment.py --list-evaluators  # List available evaluators
 """

 import argparse
 import os
+import shutil
 import subprocess
 import sys
 import time
 from datetime import datetime
 from pathlib import Path

+DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"]

-DOMAINS = {
-    "ml": {
-        "target": "train.py",
-        "evaluate_cmd": "uv run train.py",
-        "metric": "val_bpb",
-        "metric_direction": "lower",
-        "time_budget_minutes": 5,
-        "metric_grep": "^val_bpb:",
-    },
-    "prompt": {
-        "target": "prompt.md",
-        "evaluate_cmd": "python evaluate.py",
-        "metric": "eval_score",
-        "metric_direction": "higher",
-        "time_budget_minutes": 2,
-        "metric_grep": "^eval_score:",
-    },
-    "code": {
-        "target": "src/module.py",
-        "evaluate_cmd": "python benchmark.py",
-        "metric": "p50_ms",
-        "metric_direction": "lower",
-        "time_budget_minutes": 10,
-        "metric_grep": "^p50_ms:",
-    },
-    "skill": {
-        "target": "SKILL.md",
-        "evaluate_cmd": "python scripts/skill_evaluator.py",
-        "metric": "pass_rate",
-        "metric_direction": "higher",
-        "time_budget_minutes": 5,
-        "metric_grep": "^pass_rate:",
-    },
-}
+EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators"
+
+DEFAULT_CONFIG = """# autoresearch global config
+default_time_budget_minutes: 5
+default_scope: project
+dashboard_format: markdown
+"""
+
+GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state
+**/results.tsv
+**/run.log
+**/run.*.log
+config.yaml
+"""


 def run_cmd(cmd, cwd=None, timeout=None):
-    """Run a shell command and return (returncode, stdout, stderr)."""
+    """Run shell command, return (returncode, stdout, stderr)."""
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True,
        cwd=cwd, timeout=timeout
@@ -67,188 +54,315 @@ def run_cmd(cmd, cwd=None, timeout=None):
    return result.returncode, result.stdout.strip(), result.stderr.strip()


-def check_git_repo(path):
-    """Verify we're in a git repo."""
-    code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path)
-    if code != 0:
-        print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'")
+def get_autoresearch_root(scope, project_root=None):
+    """Get the .autoresearch root directory based on scope."""
+    if scope == "user":
+        return Path.home() / ".autoresearch"
+    return Path(project_root or ".") / ".autoresearch"
+
+
+def init_root(root):
+    """Initialize .autoresearch root if it doesn't exist."""
+    created = False
+    if not root.exists():
+        root.mkdir(parents=True)
+        created = True
+        print(f"  Created {root}/")
+
+    config_file = root / "config.yaml"
+    if not config_file.exists():
+        config_file.write_text(DEFAULT_CONFIG)
+        print(f"  Created {config_file}")
+
+    gitignore = root / ".gitignore"
+    if not gitignore.exists():
+        gitignore.write_text(GITIGNORE_CONTENT)
+        print(f"  Created {gitignore}")
+
+    return created
+
+
+def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""):
+    """Generate a program.md template for the experiment."""
+    direction_word = "Minimize" if direction == "lower" else "Maximize"
+    content = f"""# autoresearch — {name}
+
+## Goal
+{direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better.
+
+## What the Agent Can Change
+- Only `{target}` — this is the single file being optimized.
+- Everything inside that file is fair game unless constrained below.
+
+## What the Agent Cannot Change
+- The evaluation script (`evaluate.py` or the eval command). It is read-only.
+- Dependencies — do not add new packages or imports that aren't already available.
+- Any other files in the project unless explicitly noted here.
+{f"- Additional constraints: {constraints}" if constraints else ""}
+
+## Strategy
+1. First run: establish baseline. Do not change anything.
+2. Profile/analyze the current state — understand why the metric is what it is.
+3. Try the most obvious improvement first (low-hanging fruit).
+4. If that works, push further in the same direction.
+5. If stuck, try something orthogonal or radical.
+6. Read the git log of previous experiments. Don't repeat failed approaches.
+
+## Simplicity Rule
+A small improvement that adds ugly complexity is NOT worth it.
+Equal performance with simpler code IS worth it.
+Removing code that gets same results is the best outcome.
+
+## Stop When
+You don't stop. The human will interrupt you when they're satisfied.
+If no improvement in 20+ consecutive runs, change strategy drastically.
+"""
+    (experiment_dir / "program.md").write_text(content)
+
+
+def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget):
+    """Write experiment config."""
+    content = f"""target: {target}
+evaluate_cmd: {eval_cmd}
+metric: {metric}
+metric_direction: {direction}
+metric_grep: ^{metric}:
+time_budget_minutes: {time_budget}
+created: {datetime.now().strftime('%Y-%m-%d %H:%M')}
+"""
+    (experiment_dir / "config.cfg").write_text(content)
+
+
+def init_results_tsv(experiment_dir):
+    """Create results.tsv with header."""
+    tsv = experiment_dir / "results.tsv"
+    if tsv.exists():
+        print(f"  results.tsv already exists ({tsv.stat().st_size} bytes)")
+        return
+    tsv.write_text("commit\tmetric\tstatus\tdescription\n")
+    print("  Created results.tsv")
+
+
+def copy_evaluator(experiment_dir, evaluator_name):
+    """Copy a built-in evaluator to the experiment directory."""
+    source = EVALUATOR_DIR / f"{evaluator_name}.py"
+    if not source.exists():
+        print(f"  Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}")
+        print(f"  Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}")
        return False
-    print("✓ Git repository found")
+    dest = experiment_dir / "evaluate.py"
+    shutil.copy2(source, dest)
+    print(f"  Copied evaluator: {evaluator_name}.py -> evaluate.py")
    return True


-def check_program_md(path):
-    """Check program.md exists and has content."""
-    pm = Path(path) / "program.md"
-    if not pm.exists():
-        print("⚠ program.md not found. Creating template...")
-        return False
-    content = pm.read_text()
-    if len(content) < 100:
-        print("⚠ program.md looks empty. Fill it out before running experiments.")
-        return False
-    print(f"✓ program.md found ({len(content)} chars)")
-    return True
-
-
-def check_target_file(path, target):
-    """Check target file exists."""
-    tf = Path(path) / target
-    if not tf.exists():
-        print(f"✗ Target file not found: {target}")
-        return False
-    print(f"✓ Target file found: {target}")
-    return True
-
-
-def check_evaluate_script(path):
-    """Check evaluate.py exists."""
-    ev = Path(path) / "evaluate.py"
-    if not ev.exists():
-        print("⚠ evaluate.py not found. You need a fixed evaluation function.")
-        print("  Create evaluate.py that outputs: metric_name: <value>")
-        return False
-    print("✓ evaluate.py found")
-    return True
-
-
-def create_branch(path, tag):
+def create_branch(path, domain, name):
    """Create and checkout the experiment branch."""
-    branch = f"autoresearch/{tag}"
-    code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path)
+    branch = f"autoresearch/{domain}/{name}"
+    code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
    if code != 0:
        if "already exists" in err:
-            print(f"✗ Branch '{branch}' already exists. Use a different tag.")
-        else:
-            print(f"✗ Failed to create branch: {err}")
+            print(f"  Branch '{branch}' already exists. Checking out...")
+            run_cmd(f"git checkout {branch}", cwd=path)
+            return branch
+        print(f"  Warning: could not create branch: {err}")
        return None
-    print(f"✓ Created branch: {branch}")
+    print(f"  Created branch: {branch}")
    return branch


-def init_results_tsv(path):
-    """Create results.tsv with header."""
-    tsv = Path(path) / "results.tsv"
-    if tsv.exists():
-        print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
+def list_experiments(root):
+    """List all experiments across all domains."""
+    if not root.exists():
+        print("No experiments found. Run setup to create your first experiment.")
        return
-    tsv.write_text("commit\tmetric\tstatus\tdescription\n")
-    print("✓ Created results.tsv")
+
+    experiments = []
+    for domain_dir in sorted(root.iterdir()):
+        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
+            continue
+        for exp_dir in sorted(domain_dir.iterdir()):
+            if not exp_dir.is_dir():
+                continue
+            cfg_file = exp_dir / "config.cfg"
+            if not cfg_file.exists():
+                continue
+            config = {}
+            for line in cfg_file.read_text().splitlines():
+                if ":" in line:
+                    k, v = line.split(":", 1)
+                    config[k.strip()] = v.strip()
+
+            # Count results
+            tsv = exp_dir / "results.tsv"
+            runs = 0
+            if tsv.exists():
+                runs = max(0, len(tsv.read_text().splitlines()) - 1)
+
+            experiments.append({
+                "domain": domain_dir.name,
+                "name": exp_dir.name,
+                "target": config.get("target", "?"),
+                "metric": config.get("metric", "?"),
+                "runs": runs,
+            })
+
+    if not experiments:
+        print("No experiments found.")
+        return
+
+    print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}")
+    print("-" * 95)
+    for e in experiments:
+        print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}")
+    print(f"\nTotal: {len(experiments)} experiments")


-def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes):
-    """Run the baseline experiment."""
-    print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...")
-    timeout = time_budget_minutes * 60 * 2.5  # 2.5x budget as hard limit
+def list_evaluators():
+    """List available built-in evaluators."""
+    if not EVALUATOR_DIR.exists():
+        print("No evaluators directory found.")
+        return

-    t0 = time.time()
-    code, out, err = run_cmd(
-        f"{evaluate_cmd} > run.log 2>&1",
-        cwd=path,
-        timeout=timeout
-    )
-    elapsed = time.time() - t0
-
-    if code != 0:
-        print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log")
-        return None
-
-    # Extract metric
-    grep_code, grep_out, _ = run_cmd(
-        f"grep '{metric_grep}' run.log | tail -1",
-        cwd=path
-    )
-    if not grep_out:
-        print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
-        return None
-
-    metric_value = grep_out.split(":")[-1].strip()
-    print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
-    return metric_value
+    print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n")
+    for f in sorted(EVALUATOR_DIR.glob("*.py")):
+        # Read first docstring line
+        desc = ""
+        for line in f.read_text().splitlines():
+            if line.strip().startswith('"""') or line.strip().startswith("'''"):
+                continue
+            if line.strip() and not line.startswith("#!"):
+                desc = line.strip().strip('"').strip("'")
+                break
+        print(f"  {f.stem:<25} {desc}")


 def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent setup")
-    parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain")
+    parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain")
+    parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)")
    parser.add_argument("--target", help="Target file to optimize")
-    parser.add_argument("--evaluate-cmd", help="Evaluation command")
-    parser.add_argument("--metric", help="Metric name")
-    parser.add_argument("--direction", choices=["lower", "higher"], default="lower")
-    parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes")
-    parser.add_argument("--tag", help="Run tag (used in branch name)")
+    parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command")
+    parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')")
+    parser.add_argument("--direction", choices=["lower", "higher"], default="lower",
+                        help="Is lower or higher better?")
+    parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)")
+    parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)")
+    parser.add_argument("--scope", choices=["project", "user"], default="project",
+                        help="Where to store experiments: project (./) or user (~/)")
+    parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
    parser.add_argument("--path", default=".", help="Project root path")
-    parser.add_argument("--skip-baseline", action="store_true")
+    parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
+    parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
+    parser.add_argument("--list", action="store_true", help="List all experiments")
+    parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
    args = parser.parse_args()

-    path = Path(args.path).resolve()
-    print(f"\n🔬 autoresearch-agent setup")
-    print(f"   Project: {path}")
-    print(f"   Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
+    project_root = Path(args.path).resolve()

-    # Get config from domain or args
-    if args.domain:
-        config = DOMAINS[args.domain].copy()
+    # List mode
+    if args.list:
+        root = get_autoresearch_root("project", project_root)
+        list_experiments(root)
+        user_root = get_autoresearch_root("user")
+        if user_root.exists() and user_root != root:
+            print(f"\n--- User-level experiments ({user_root}) ---")
+            list_experiments(user_root)
+        return
+
+    if args.list_evaluators:
+        list_evaluators()
+        return
+
+    # Validate required args for setup
+    if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]):
+        parser.error("Required: --domain, --name, --target, --eval, --metric")
+
+    root = get_autoresearch_root(args.scope, project_root)
+
+    print(f"\n  autoresearch-agent setup")
+    print(f"  Project: {project_root}")
+    print(f"  Scope: {args.scope}")
+    print(f"  Domain: {args.domain}")
+    print(f"  Experiment: {args.name}")
+    print(f"  Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
+
+    # Check git
+    code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
+    if code != 0:
+        print("  Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
+        sys.exit(1)
+    print("  Git repository found")
+
+    # Check target file
+    target_path = project_root / args.target
+    if not target_path.exists():
+        print(f"  Error: target file not found: {args.target}")
+        sys.exit(1)
+    print(f"  Target file found: {args.target}")
+
+    # Init root
+    init_root(root)
+
+    # Create experiment directory
+    experiment_dir = root / args.domain / args.name
+    if experiment_dir.exists():
+        print(f"  Warning: experiment '{args.domain}/{args.name}' already exists.")
+        print(f"  Use --name with a different name, or delete {experiment_dir}")
+        sys.exit(1)
+    experiment_dir.mkdir(parents=True)
+    print(f"  Created {experiment_dir}/")
+
+    # Create files
+    create_program_md(experiment_dir, args.domain, args.name,
+                      args.target, args.metric, args.direction, args.constraints)
+    print("  Created program.md")
+
+    create_config(experiment_dir, args.target, args.eval_cmd,
+                  args.metric, args.direction, args.time_budget)
+    print("  Created config.cfg")
+
+    init_results_tsv(experiment_dir)
+
+    # Copy evaluator if specified
+    if args.evaluator:
+        copy_evaluator(experiment_dir, args.evaluator)
+
+    # Create git branch
+    if not args.skip_branch:
+        create_branch(str(project_root), args.domain, args.name)
+
+    # Test evaluation command
+    print(f"\n  Testing evaluation: {args.eval_cmd}")
+    code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60)
+    if code != 0:
+        print(f"  Warning: eval command failed (exit {code})")
+        if err:
+            print(f"  stderr: {err[:200]}")
+        print("  Fix the eval command before running the experiment loop.")
    else:
-        config = {
-            "target": args.target or "target.py",
-            "evaluate_cmd": args.evaluate_cmd or "python evaluate.py",
-            "metric": args.metric or "score",
-            "metric_direction": args.direction,
-            "time_budget_minutes": args.budget,
-            "metric_grep": f"^{args.metric or 'score'}:",
-        }
+        # Check metric is parseable
+        full_output = out + "\n" + err
+        metric_found = False
+        for line in full_output.splitlines():
+            if line.strip().startswith(f"{args.metric}:"):
+                metric_found = True
+                print(f"  Eval works. Baseline: {line.strip()}")
+                break
+        if not metric_found:
+            print(f"  Warning: eval ran but '{args.metric}:' not found in output.")
+            print(f"  Make sure your eval command outputs: {args.metric}: <value>")

-    tag = args.tag or datetime.now().strftime("%b%d").lower()
-
-    # Validation checks
-    checks = [
-        check_git_repo(path),
-        check_program_md(path),
-        check_target_file(path, config["target"]),
-        check_evaluate_script(path),
-    ]
-
-    if not all(checks):
-        print("\n⚠ Fix the above issues before running experiments.")
-        sys.exit(1)
-
-    # Create branch
-    branch = create_branch(path, tag)
-    if not branch:
-        sys.exit(1)
-
-    # Init results TSV
-    init_results_tsv(path)
-
-    # Save config for run_experiment.py
-    config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
-    (path / ".autoresearch.cfg").write_text(config_content + "\n")
-    print("✓ Saved .autoresearch.cfg")
-
-    # Run baseline
-    if not args.skip_baseline:
-        baseline = run_baseline(
-            path,
-            config["evaluate_cmd"],
-            config["metric_grep"],
-            config["time_budget_minutes"]
-        )
-        if baseline:
-            # Log baseline to TSV
-            code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
-            with open(path / "results.tsv", "a") as f:
-                f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
-            print(f"✓ Baseline logged to results.tsv")
-
-    print(f"\n✅ Setup complete!")
-    print(f"   Branch: {branch}")
-    print(f"   Target: {config['target']}")
-    print(f"   Metric: {config['metric']} ({config['metric_direction']} is better)")
-    print(f"   Budget: {config['time_budget_minutes']} min/experiment")
-    print(f"\nTo start the autonomous loop:")
-    print(f"   python scripts/run_experiment.py --loop")
-    print(f"\nOr run a single experiment:")
-    print(f"   python scripts/run_experiment.py --single")
+    # Summary
+    print(f"\n  Setup complete!")
+    print(f"  Experiment: {args.domain}/{args.name}")
+    print(f"  Target: {args.target}")
+    print(f"  Metric: {args.metric} ({args.direction} is better)")
+    print(f"  Budget: {args.time_budget} min/experiment")
+    if not args.skip_branch:
+        print(f"  Branch: autoresearch/{args.domain}/{args.name}")
+    print(f"\n  To start:")
+    print(f"  python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")


 if __name__ == "__main__":