Merge pull request #345 from alirezarezvani/feat/autoresearch-agent-clean

feat: autoresearch-agent — autonomous experiment loop
2026-03-13 07:40:38 +01:00
parent 9cc5d51d4a a799d8bdb8
commit 291cf858a0
6 changed files with 1282 additions and 0 deletions
--- a/engineering/autoresearch-agent/SKILL.md
+++ b/engineering/autoresearch-agent/SKILL.md
@@ -0,0 +1,246 @@
+---
+name: "autoresearch-agent"
+description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo."
+license: MIT
+metadata:
+  version: 1.0.0
+  author: Alireza Rezvani
+  category: engineering
+  updated: 2026-03-13
+---
+
+# Autoresearch Agent
+
+> You sleep. The agent experiments. You wake up to results.
+
+Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop.
+
+**Works for any domain with a measurable metric:**
+- ML training optimization (original use case — optimize `train.py` by `val_bpb`)
+- Prompt engineering (optimize system prompts by LLM-eval quality score)
+- Code performance (optimize a module by benchmark runtime/score)
+- Agent skill improvement (optimize `SKILL.md` by task completion rate)
+
+---
+
+## Before Starting
+
+Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing.
+
+If no `program.md` exists, run the **Setup Wizard** below.
+
+---
+
+## Setup Wizard
+
+Answer these 5 questions to configure the experiment:
+
+### 1. What are we optimizing?
+The **target** — one file the agent is allowed to modify:
+- `train.py` — ML training loop (Karpathy-style)
+- `prompt.md` — system prompt for an LLM
+- `src/module.py` — a specific code module
+- `SKILL.md` — an agent skill definition
+
+### 2. What's the metric?
+A **single number** that defines success. Lower or higher = better:
+- `val_bpb` — validation bits per byte (ML, lower is better)
+- `eval_score` — LLM quality score 0-100 (higher is better)
+- `p50_ms` — median latency in milliseconds (lower is better)
+- `pass_rate` — test pass rate 0-1 (higher is better)
+
+### 3. What's the time budget per experiment?
+How long one experiment run takes:
+- `5m` — fast iteration (Karpathy default, ~12 experiments/hour)
+- `10m` — moderate (6/hour)
+- `30m` — slow but thorough (2/hour)
+
+### 4. What can the agent change?
+Constraints on the target file:
+- Architecture only? Hyperparameters only? Everything?
+- What packages/imports are available?
+- What's explicitly off-limits?
+
+### 5. What's the evaluation function?
+How we score each experiment:
+- Fixed script that outputs the metric (e.g. `python evaluate.py`)
+- API call that returns a score
+- Test suite with a pass rate
+
+Once answered, run: `python scripts/setup_experiment.py` to initialize.
+
+---
+
+## The Three Files
+
+Every autoresearch project has the same structure:
+
+```
+project/
+├── program.md       ← Human writes this: objectives, constraints, strategy
+├── target.*         ← Agent modifies this: the thing being optimized
+├── evaluate.py      ← Fixed: the measurement function (never touch)
+├── results.tsv      ← Auto-generated: experiment log (git-tracked for continuity)
+└── scripts/
+    ├── setup_experiment.py   ← Initialize a new run
+    ├── run_experiment.py     ← Execute one experiment iteration
+    └── log_results.py        ← Record results to TSV
+```
+
+### `program.md` — Your Research Directions
+Write this once. The agent reads it before every experiment. It should contain:
+- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code)
+- **Strategy:** What directions to explore first
+- **Constraints:** What the agent cannot change
+- **Stopping criteria:** When a result is "good enough"
+
+See `references/program-template.md` for domain-specific templates.
+
+### Target File — The Only File the Agent Edits
+Whatever you're optimizing. Strict scope: **one file, one metric**.
+
+### `evaluate.py` — Fixed Evaluation (Never Modified)
+The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured.
+
+---
+
+## The Experiment Loop
+
+Run: `python scripts/run_experiment.py --loop`
+
+```
+LOOP FOREVER:
+
+1. Read program.md for current strategy
+2. Review git history: what has been tried? What worked?
+3. Propose ONE change to the target file
+4. Apply the change
+5. git commit (with descriptive message)
+6. Run evaluation: python evaluate.py > run.log 2>&1
+7. Parse metric from run.log
+8. If metric improved → ADVANCE (keep commit, log "keep")
+9. If metric equal/worse → REVERT (git reset, log "discard")
+10. If crash → attempt fix, if unfixable log "crash" and revert
+11. Update results.tsv
+12. Go to 1
+```
+
+### Rules (from Karpathy's original)
+
+- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted.
+- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win.
+- **One change per experiment** — don't change 5 things at once. You won't know what worked.
+- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on.
+- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash.
+- **No new dependencies** — only use what's already available.
+
+---
+
+## Results Log
+
+`results.tsv` (tab-separated, not git-tracked):
+
+```
+commit  metric  status  description
+a1b2c3d 0.9979  keep    baseline
+b2c3d4e 0.9932  keep    increased learning rate
+c3d4e5f 1.0050  discard switched to GeLU (worse)
+d4e5f6g 0.0000  crash   doubled model width (OOM)
+```
+
+Run `python scripts/log_results.py --summary` for a visual summary.
+
+---
+
+## Domain-Specific Configurations
+
+### ML Training (Karpathy-style)
+```yaml
+target: train.py
+evaluate: uv run prepare.py --eval-only
+metric: val_bpb (lower is better)
+time_budget: 5m
+git_branch: autoresearch/{date}-{tag}
+```
+
+### Prompt Engineering
+```yaml
+target: prompt.md
+evaluate: python evaluate.py --model gpt-4o --test-cases tests/
+metric: eval_score (0-100, higher is better)
+time_budget: 2m
+git_branch: prompt-research/{date}
+```
+
+### Code Performance
+```yaml
+target: src/hot_module.py
+evaluate: python benchmark.py --runs 5 --warmup 1
+metric: p50_ms (lower is better)
+time_budget: 10m
+git_branch: perf-research/{date}
+```
+
+### Agent Skill Optimization
+```yaml
+target: SKILL.md
+evaluate: python scripts/skill_evaluator.py --task-suite tests/
+metric: pass_rate (0-1, higher is better)
+time_budget: 5m
+git_branch: skill-research/{date}
+```
+
+See `references/experiment-domains.md` for full setup guides per domain.
+
+---
+
+## Scripts
+
+| Script | Purpose |
+|--------|---------|
+| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run |
+| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) |
+| `log_results.py` | Record results to TSV; `--summary` prints progress table |
+
+---
+
+## Installation
+
+### One-liner (any tool)
+```bash
+git clone https://github.com/alirezarezvani/claude-skills.git
+cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
+```
+
+### Multi-tool install
+```bash
+# Clone the repo, then use the convert script for your tool:
+./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
+```
+
+### OpenClaw
+```bash
+clawhub install autoresearch-agent
+```
+
+---
+
+## Proactive Triggers
+
+Flag these issues without being asked:
+
+- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
+- **Target file has no git history** → `git init` and commit baseline first.
+- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
+- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
+- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
+- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
+
+---
+
+## Related Skills
+
+- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics.
+- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops.
+- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops.
+- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function.
--- a/engineering/autoresearch-agent/references/experiment-domains.md
+++ b/engineering/autoresearch-agent/references/experiment-domains.md
@@ -0,0 +1,175 @@
+# Experiment Domains Guide
+
+## Domain 1: ML Training (Karpathy-style)
+
+**Best for:** LLM/neural net training optimization on a single GPU
+
+**Requirements:**
+- NVIDIA GPU (H100 recommended, A100/RTX also work)
+- CUDA + PyTorch
+- `uv` package manager
+- ~50GB disk (training data)
+
+**Setup:**
+```bash
+# Clone autoresearch repo (the ML training environment)
+git clone https://github.com/karpathy/autoresearch my-ml-research
+cd my-ml-research
+uv sync
+uv run prepare.py  # one-time data download + tokenizer (~2 min)
+
+# Initialize autoresearch skill
+cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts
+
+# Configure
+python scripts/setup_experiment.py --domain ml --tag mar13
+```
+
+**Metric:** `val_bpb` — validation bits per byte. Lower = better model.
+
+**What the agent can change in `train.py`:**
+- Model depth, width, attention heads
+- Learning rate, scheduler, warmup
+- Optimizer (Muon, AdamW, variants)
+- Batch size, gradient accumulation
+- Architecture (attention patterns, FFN type)
+
+**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):**
+Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256.
+
+---
+
+## Domain 2: Prompt Engineering
+
+**Best for:** Optimizing system prompts for quality/accuracy/tone
+
+**Requirements:**
+- LLM API access (OpenAI, Anthropic, etc.)
+- Test cases with expected outputs
+- An LLM judge for scoring (can be same model)
+
+**Setup:**
+```bash
+mkdir my-prompt-research && cd my-prompt-research
+git init
+
+# Create prompt.md (the thing being optimized)
+echo "You are a helpful assistant." > prompt.md
+
+# Create evaluate.py (fixed — never modify)
+cat > evaluate.py << 'EOF'
+#!/usr/bin/env python3
+# Fixed evaluation harness — DO NOT MODIFY
+import json, sys
+from pathlib import Path
+
+PROMPT = Path("prompt.md").read_text()
+# Load test cases
+TEST_CASES = json.loads(Path("tests/cases.json").read_text())
+
+# Run prompt against test cases, score with LLM judge
+# ... (customize for your LLM + scoring logic)
+total = sum(score_case(PROMPT, case) for case in TEST_CASES)
+score = total / len(TEST_CASES) * 100
+print(f"eval_score: {score:.2f}")
+EOF
+
+# Initialize
+python scripts/setup_experiment.py --domain prompt --tag mar13
+```
+
+**Metric:** `eval_score` (0-100). Higher = better prompt.
+
+---
+
+## Domain 3: Code Performance
+
+**Best for:** Optimizing a specific hot module for speed
+
+**Requirements:**
+- A Python module with measurable performance
+- Existing tests (correctness must not regress)
+- A benchmark harness
+
+**Setup:**
+```bash
+cd my-project
+
+# Create benchmark.py (fixed — never modify)
+cat > benchmark.py << 'EOF'
+#!/usr/bin/env python3
+# Fixed benchmark — DO NOT MODIFY
+import time, statistics
+from src.module import your_function
+from tests.test_module import run_tests
+
+# Correctness check first
+if not run_tests():
+    print("TESTS FAILED")
+    sys.exit(1)
+
+# Benchmark
+data = generate_test_data(n=10000)
+times = []
+for _ in range(10):
+    t0 = time.perf_counter()
+    your_function(data)
+    times.append((time.perf_counter() - t0) * 1000)
+
+p50 = statistics.median(times)
+print(f"p50_ms: {p50:.2f}")
+print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
+EOF
+
+python scripts/setup_experiment.py --domain code \
+  --target src/module.py \
+  --tag mar13
+```
+
+**Metric:** `p50_ms` — median latency. Lower = faster.
+
+---
+
+## Domain 4: Agent Skill Optimization
+
+**Best for:** Improving the quality of claude-skills SKILL.md files
+
+**Requirements:**
+- A SKILL.md to optimize
+- A task evaluation suite (15-20 standardized tasks)
+- An LLM judge for scoring
+
+**Setup:**
+```bash
+# Create a new skill research project
+mkdir skill-research-{skill-name} && cd skill-research-{skill-name}
+git init
+
+# Copy the skill to optimize
+cp ~/.claude/skills/{skill-name}/SKILL.md .
+
+# Create evaluate.py
+cat > scripts/skill_evaluator.py << 'EOF'
+#!/usr/bin/env python3
+# Fixed evaluator — DO NOT MODIFY
+# Runs SKILL.md against 15 standardized tasks using LLM judge
+# Outputs: pass_rate: 0.80 (etc.)
+EOF
+
+python scripts/setup_experiment.py --domain skill --tag mar13
+```
+
+**Metric:** `pass_rate` (0-1). Higher = better skill.
+
+---
+
+## Choosing Your Domain
+
+| Question | Recommendation |
+|----------|---------------|
+| Do I have a GPU and want to improve an LLM? | ML Training |
+| Do I want to improve a prompt/system message? | Prompt Engineering |
+| Do I have slow Python code I want to speed up? | Code Performance |
+| Do I want to improve one of my claude-skills? | Skill Optimization |
+
+**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results.
--- a/engineering/autoresearch-agent/references/program-template.md
+++ b/engineering/autoresearch-agent/references/program-template.md
@@ -0,0 +1,170 @@
+# program.md Templates
+
+Copy the template for your domain and paste into your project root as `program.md`.
+
+---
+
+## ML Training (Karpathy-style)
+
+```markdown
+# autoresearch — ML Training
+
+## Goal
+Minimize val_bpb on the validation set. Lower is better.
+
+## What You Can Change (train.py only)
+- Model architecture (depth, width, attention heads, FFN ratio)
+- Optimizer (learning rate, warmup, scheduler, weight decay)
+- Training loop (batch size, gradient accumulation, clipping)
+- Regularization (dropout, weight tying, etc.)
+- Any self-contained improvement that doesn't require new packages
+
+## What You Cannot Change
+- prepare.py (fixed — contains evaluation harness)
+- Dependencies (pyproject.toml is locked)
+- Time budget (always 5 minutes, wall clock)
+- Evaluation metric (val_bpb is the ground truth)
+
+## Strategy
+1. First run: establish baseline. Do not change anything.
+2. Explore learning rate range (try 2x and 0.5x current)
+3. Try depth changes (±2 layers)
+4. Try optimizer changes (Muon vs. AdamW variants)
+5. If things improve, double down. If stuck, try something radical.
+
+## Simplicity Rule
+A small improvement with ugly code is NOT worth it.
+Equal performance with simpler code IS worth it.
+Removing code that gets same results is the best outcome.
+
+## Stop When
+val_bpb < 0.95 OR after 100 experiments, whichever comes first.
+```
+
+---
+
+## Prompt Engineering
+
+```markdown
+# autoresearch — Prompt Optimization
+
+## Goal
+Maximize eval_score on the test suite. Higher is better (0-100).
+
+## What You Can Change (prompt.md only)
+- System prompt instructions
+- Examples and few-shot demonstrations
+- Output format specifications
+- Chain-of-thought instructions
+- Persona and tone
+- Task decomposition strategies
+
+## What You Cannot Change
+- evaluate.py (fixed evaluation harness)
+- Test cases in tests/ (ground truth)
+- Model being evaluated (specified in evaluate.py)
+- Scoring criteria (defined in evaluate.py)
+
+## Strategy
+1. First run: baseline with current prompt (or empty)
+2. Add clear role/persona definition
+3. Add output format specification
+4. Add chain-of-thought instruction
+5. Add 2-3 diverse examples
+6. Refine based on failure modes from run.log
+
+## Evaluation
+- evaluate.py runs the prompt against 20 test cases
+- Each test case is scored 0-5 by GPT-4o
+- eval_score = average * 20 (maps to 0-100)
+- Run log shows which test cases failed
+
+## Stop When
+eval_score >= 85 OR after 50 experiments.
+```
+
+---
+
+## Code Performance
+
+```markdown
+# autoresearch — Performance Optimization
+
+## Goal
+Minimize p50_ms (median latency). Lower is better.
+
+## What You Can Change (src/module.py only)
+- Algorithm implementation
+- Data structures (use faster alternatives)
+- Caching and memoization
+- Vectorization (NumPy, etc.)
+- Loop optimization
+- I/O patterns
+- Memory allocation patterns
+
+## What You Cannot Change
+- benchmark.py (fixed benchmark harness)
+- Public API (function signatures must stay the same)
+- External dependencies (add nothing new)
+- Correctness tests (tests/ must still pass)
+
+## Constraints
+- Correctness is non-negotiable. benchmark.py runs tests first.
+- If tests fail → immediate crash status, no metric recorded.
+- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x.
+
+## Strategy
+1. Baseline: profile first, don't guess
+2. Check if there's any O(n²) → O(n log n) opportunity
+3. Try caching repeated computations
+4. Try NumPy vectorization for loops
+5. Try algorithm-level changes last (higher risk)
+
+## Stop When
+p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments.
+```
+
+---
+
+## Agent Skill Optimization
+
+```markdown
+# autoresearch — Skill Optimization
+
+## Goal
+Maximize pass_rate on the task evaluation suite. Higher is better (0-1).
+
+## What You Can Change (SKILL.md only)
+- Skill description and trigger phrases
+- Core workflow steps and ordering
+- Decision frameworks and rules
+- Output format specifications
+- Example inputs/outputs
+- Related skills disambiguation
+- Proactive trigger conditions
+
+## What You Cannot Change
+- scripts/skill_evaluator.py (fixed evaluation)
+- Test tasks in tests/ (ground truth benchmark)
+- Skill name (used for routing)
+- License or metadata
+
+## Evaluation
+- skill_evaluator.py runs SKILL.md against 15 standardized tasks
+- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass)
+- pass_rate = sum(scores) / 15
+
+## Strategy
+1. Baseline: run as-is
+2. Improve trigger description (better routing = more passes)
+3. Sharpen the core workflow (clearer = better execution)
+4. Add missing edge cases to the rules section
+5. Improve disambiguation (reduce false-positive routing)
+
+## Simplicity Rule
+A shorter SKILL.md that achieves the same score is better.
+Aim for 200-400 lines total.
+
+## Stop When
+pass_rate >= 0.90 OR after 30 experiments.
+```
--- a/engineering/autoresearch-agent/scripts/log_results.py
+++ b/engineering/autoresearch-agent/scripts/log_results.py
@@ -0,0 +1,126 @@
+#!/usr/bin/env python3
+"""
+autoresearch-agent: Results Logger
+
+View and analyze experiment results from results.tsv.
+
+Usage:
+    python scripts/log_results.py --summary          # Print progress table
+    python scripts/log_results.py --best             # Show best result
+    python scripts/log_results.py --history          # Full experiment history
+    python scripts/log_results.py --record commit val status desc  # Add entry manually
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+
+def load_results(path):
+    tsv = Path(path) / "results.tsv"
+    if not tsv.exists():
+        return []
+    lines = tsv.read_text().splitlines()[1:]  # skip header
+    results = []
+    for line in lines:
+        parts = line.split("\t")
+        if len(parts) >= 4:
+            try:
+                metric_val = float(parts[1]) if parts[1] != "N/A" else None
+            except ValueError:
+                metric_val = None
+            results.append({
+                "commit": parts[0],
+                "metric": metric_val,
+                "status": parts[2],
+                "description": parts[3]
+            })
+    return results
+
+
+def print_summary(results, metric_name="metric", direction="lower"):
+    if not results:
+        print("No experiments logged yet.")
+        return
+
+    keeps = [r for r in results if r["status"] == "keep"]
+    discards = [r for r in results if r["status"] == "discard"]
+    crashes = [r for r in results if r["status"] == "crash"]
+
+    print(f"\n{'─'*60}")
+    print(f"  autoresearch-agent — Results Summary")
+    print(f"{'─'*60}")
+    print(f"  Total experiments: {len(results)}")
+    print(f"  ✅ Keep:    {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)")
+    print(f"  ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)")
+    print(f"  💥 Crash:   {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
+
+    if keeps:
+        valid = [r for r in keeps if r["metric"] is not None]
+        if valid:
+            baseline = valid[0]["metric"]
+            best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid)
+            best_run = next(r for r in valid if r["metric"] == best)
+            improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
+
+            print(f"\n  {metric_name}:")
+            print(f"    Baseline: {baseline:.6f}")
+            print(f"    Best:     {best:.6f}  (commit: {best_run['commit']})")
+            print(f"    Change:   {improvement:+.2f}%")
+
+    print(f"{'─'*60}\n")
+
+
+def print_history(results):
+    if not results:
+        print("No experiments logged yet.")
+        return
+
+    print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION")
+    print("─" * 60)
+    for r in results:
+        metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash   "
+        status_icon = {"keep": "✅", "discard": "❌", "crash": "💥"}.get(r["status"], "?")
+        print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--summary", action="store_true")
+    parser.add_argument("--best", action="store_true")
+    parser.add_argument("--history", action="store_true")
+    parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC"))
+    parser.add_argument("--path", default=".")
+    parser.add_argument("--metric", default="metric")
+    parser.add_argument("--direction", default="lower", choices=["lower", "higher"])
+    args = parser.parse_args()
+
+    path = Path(args.path).resolve()
+
+    if args.record:
+        commit, metric, status, desc = args.record
+        tsv = path / "results.tsv"
+        if not tsv.exists():
+            tsv.write_text("commit\tmetric\tstatus\tdescription\n")
+        with open(tsv, "a") as f:
+            f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
+        print(f"✓ Logged: {commit} {metric} {status}")
+        return
+
+    results = load_results(path)
+
+    if args.history:
+        print_history(results)
+    elif args.best:
+        keeps = [r for r in results if r["status"] == "keep" and r["metric"]]
+        if not keeps:
+            print("No successful experiments yet.")
+            return
+        best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
+        print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}")
+    else:
+        print_summary(results, args.metric, args.direction)
+
+
+if __name__ == "__main__":
+    main()
--- a/engineering/autoresearch-agent/scripts/run_experiment.py
+++ b/engineering/autoresearch-agent/scripts/run_experiment.py
@@ -0,0 +1,310 @@
+#!/usr/bin/env python3
+"""
+autoresearch-agent: Experiment Runner
+
+Executes the autonomous experiment loop:
+- Reads .autoresearch.cfg for project config
+- Runs the target evaluation
+- Keeps improvements (git commit) or discards failures (git reset)
+- Logs everything to results.tsv
+- Loops indefinitely until interrupted
+
+Usage:
+    python scripts/run_experiment.py --loop      # Run forever
+    python scripts/run_experiment.py --single    # Run one experiment
+    python scripts/run_experiment.py --dry-run   # Show what would happen
+"""
+
+import argparse
+import os
+import signal
+import subprocess
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+
+
+def load_config(path):
+    """Load .autoresearch.cfg"""
+    cfg_file = Path(path) / ".autoresearch.cfg"
+    if not cfg_file.exists():
+        print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.")
+        sys.exit(1)
+    config = {}
+    for line in cfg_file.read_text().splitlines():
+        if ":" in line:
+            k, v = line.split(":", 1)
+            config[k.strip()] = v.strip()
+    return config
+
+
+def run_cmd(cmd, cwd=None, timeout=None):
+    """Run shell command, return (returncode, stdout, stderr)."""
+    result = subprocess.run(
+        cmd, shell=True, capture_output=True, text=True,
+        cwd=cwd, timeout=timeout
+    )
+    return result.returncode, result.stdout.strip(), result.stderr.strip()
+
+
+def get_current_commit(path):
+    _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
+    return commit
+
+
+def get_current_metric(path, metric_grep):
+    """Read the last recorded metric from results.tsv."""
+    tsv = Path(path) / "results.tsv"
+    if not tsv.exists():
+        return None
+    lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l]
+    if not lines:
+        return None
+    last = lines[-1].split("\t")
+    try:
+        return float(last[1])
+    except (ValueError, IndexError):
+        return None
+
+
+def run_evaluation(path, evaluate_cmd, time_budget_minutes):
+    """Run evaluation with time limit."""
+    hard_limit = time_budget_minutes * 60 * 2.5  # 2.5x as hard timeout
+    t0 = time.time()
+    try:
+        code, _, _ = run_cmd(
+            f"{evaluate_cmd} > run.log 2>&1",
+            cwd=path,
+            timeout=hard_limit
+        )
+        elapsed = time.time() - t0
+        return code, elapsed
+    except subprocess.TimeoutExpired:
+        elapsed = time.time() - t0
+        return -1, elapsed  # -1 = timeout
+
+
+def extract_metric(path, metric_grep):
+    """Extract metric value from run.log."""
+    code, out, _ = run_cmd(
+        f"grep '{metric_grep}' run.log | tail -1",
+        cwd=path
+    )
+    if not out:
+        return None
+    try:
+        return float(out.split(":")[-1].strip())
+    except ValueError:
+        return None
+
+
+def is_improvement(new_val, old_val, direction):
+    """Check if new result is better than old."""
+    if old_val is None:
+        return True  # First run always "improves"
+    if direction == "lower":
+        return new_val < old_val
+    else:
+        return new_val > old_val
+
+
+def log_result(path, commit, metric_val, status, description):
+    """Append result to results.tsv."""
+    tsv = Path(path) / "results.tsv"
+    metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
+    with open(tsv, "a") as f:
+        f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
+
+
+def get_experiment_count(path):
+    """Count experiments run so far."""
+    tsv = Path(path) / "results.tsv"
+    if not tsv.exists():
+        return 0
+    lines = tsv.read_text().splitlines()
+    return max(0, len(lines) - 1)  # subtract header
+
+
+def run_single_experiment(path, config, exp_num, dry_run=False):
+    """Run one experiment iteration."""
+    direction = config.get("metric_direction", "lower")
+    metric_grep = config.get("metric_grep", "^metric:")
+    evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py")
+    time_budget = int(config.get("time_budget_minutes", 5))
+    metric_name = config.get("metric", "metric")
+
+    best_so_far = get_current_metric(path, metric_grep)
+    ts = datetime.now().strftime("%H:%M:%S")
+
+    print(f"\n[{ts}] Experiment #{exp_num}")
+    print(f"  Best {metric_name} so far: {best_so_far}")
+
+    if dry_run:
+        print("  [DRY RUN] Would run evaluation and check metric")
+        return "dry_run"
+
+    # Save pre-experiment state for rollback
+    code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path)
+    if code != 0:
+        print("  ✗ Can't get git state. Is this a git repo with commits?")
+        return "error"
+
+    # Run evaluation
+    print(f"  Running: {evaluate_cmd} (budget: {time_budget} min)")
+    ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget)
+
+    # Handle timeout
+    if ret_code == -1:
+        print(f"  ✗ TIMEOUT after {elapsed:.0f}s — discarding")
+        run_cmd("git checkout -- .", cwd=path)  # revert uncommitted changes
+        # Commit was already made by the agent before evaluation
+        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        curr_commit = get_current_commit(path)
+        log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
+        return "crash"
+
+    # Handle non-zero exit
+    if ret_code != 0:
+        # Check if it crashed
+        code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path)
+        print(f"  ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
+        print(f"  Last output: {tail[:200]}")
+        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        curr_commit = get_current_commit(path)
+        log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
+        return "crash"
+
+    # Extract metric
+    metric_val = extract_metric(path, metric_grep)
+    if metric_val is None:
+        print(f"  ✗ Could not parse metric from run.log")
+        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        curr_commit = get_current_commit(path)
+        log_result(path, curr_commit, None, "crash", "metric_parse_failed")
+        return "crash"
+
+    curr_commit = get_current_commit(path)
+    delta = ""
+    if best_so_far is not None:
+        diff = metric_val - best_so_far
+        delta = f" (Δ{diff:+.4f})"
+
+    print(f"  {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
+
+    # Keep or discard
+    if is_improvement(metric_val, best_so_far, direction):
+        print(f"  ✅ KEEP — improvement confirmed")
+        log_result(path, curr_commit, metric_val, "keep",
+                   f"improvement_{metric_name}_{metric_val:.4f}")
+        return "keep"
+    else:
+        print(f"  ❌ DISCARD — no improvement")
+        run_cmd(f"git reset --hard {pre_commit}", cwd=path)
+        curr_commit = get_current_commit(path)
+        log_result(path, curr_commit, metric_val, "discard",
+                   f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}")
+        return "discard"
+
+
+def print_summary(path):
+    """Print experiment summary."""
+    tsv = Path(path) / "results.tsv"
+    if not tsv.exists():
+        return
+    lines = tsv.read_text().splitlines()[1:]  # skip header
+    if not lines:
+        return
+
+    keeps = [l for l in lines if "\tkeep\t" in l]
+    discards = [l for l in lines if "\tdiscard\t" in l]
+    crashes = [l for l in lines if "\tcrash\t" in l]
+
+    print(f"\n{'='*50}")
+    print(f"  Session Summary")
+    print(f"  Experiments: {len(lines)} total")
+    print(f"  ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}")
+
+    if keeps:
+        try:
+            first_metric = float(keeps[0].split("\t")[1])
+            last_metric = float(keeps[-1].split("\t")[1])
+            direction = "↓" if last_metric < first_metric else "↑"
+            print(f"  Best progress: {first_metric:.6f} → {last_metric:.6f} {direction}")
+        except (ValueError, IndexError):
+            pass
+    print(f"{'='*50}\n")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="autoresearch-agent runner")
+    parser.add_argument("--loop", action="store_true", help="Run forever")
+    parser.add_argument("--single", action="store_true", help="Run one experiment")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run only")
+    parser.add_argument("--path", default=".", help="Project root")
+    parser.add_argument("--max-experiments", type=int, default=0,
+                        help="Max experiments (0 = unlimited)")
+    args = parser.parse_args()
+
+    path = Path(args.path).resolve()
+    config = load_config(path)
+
+    print(f"\n🔬 autoresearch-agent")
+    print(f"   Project: {path}")
+    print(f"   Target: {config.get('target', '?')}")
+    print(f"   Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
+    print(f"   Budget: {config.get('time_budget_minutes', '?')} min/experiment")
+    print(f"   Mode: {'loop' if args.loop else 'single'}")
+
+    if args.single:
+        exp_num = get_experiment_count(path) + 1
+        run_single_experiment(path, config, exp_num, args.dry_run)
+        return
+
+    if not args.loop and not args.dry_run:
+        print("\nSpecify --loop (forever) or --single (one experiment)")
+        sys.exit(1)
+
+    # Setup graceful shutdown
+    def handle_interrupt(sig, frame):
+        print_summary(path)
+        print("\n⏹ Stopped by user.")
+        sys.exit(0)
+
+    signal.signal(signal.SIGINT, handle_interrupt)
+    signal.signal(signal.SIGTERM, handle_interrupt)
+
+    # Main loop
+    consecutive_crashes = 0
+    exp_num = get_experiment_count(path) + 1
+
+    print(f"\nStarting loop. Ctrl+C to stop and print summary.\n")
+
+    while True:
+        result = run_single_experiment(path, config, exp_num, args.dry_run)
+        exp_num += 1
+
+        if result == "crash":
+            consecutive_crashes += 1
+        else:
+            consecutive_crashes = 0
+
+        # Bail if 5 consecutive crashes
+        if consecutive_crashes >= 5:
+            print("\n⚠ 5 consecutive crashes. Pausing for investigation.")
+            print("  Check run.log for the last error.")
+            break
+
+        # Check max experiments
+        if args.max_experiments > 0 and exp_num > args.max_experiments:
+            print(f"\n✓ Reached max experiments ({args.max_experiments})")
+            break
+
+        if args.single:
+            break
+
+    print_summary(path)
+
+
+if __name__ == "__main__":
+    main()
--- a/engineering/autoresearch-agent/scripts/setup_experiment.py
+++ b/engineering/autoresearch-agent/scripts/setup_experiment.py
@@ -0,0 +1,255 @@
+#!/usr/bin/env python3
+"""
+autoresearch-agent: Setup Wizard
+
+Initializes a new research run:
+1. Validates the project structure
+2. Creates a git branch
+3. Runs the baseline experiment
+4. Initializes results.tsv
+
+Usage:
+    python scripts/setup_experiment.py [--config experiment.yaml]
+    python scripts/setup_experiment.py --domain ml|prompt|code|skill
+"""
+
+import argparse
+import os
+import subprocess
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+
+
+DOMAINS = {
+    "ml": {
+        "target": "train.py",
+        "evaluate_cmd": "uv run train.py",
+        "metric": "val_bpb",
+        "metric_direction": "lower",
+        "time_budget_minutes": 5,
+        "metric_grep": "^val_bpb:",
+    },
+    "prompt": {
+        "target": "prompt.md",
+        "evaluate_cmd": "python evaluate.py",
+        "metric": "eval_score",
+        "metric_direction": "higher",
+        "time_budget_minutes": 2,
+        "metric_grep": "^eval_score:",
+    },
+    "code": {
+        "target": "src/module.py",
+        "evaluate_cmd": "python benchmark.py",
+        "metric": "p50_ms",
+        "metric_direction": "lower",
+        "time_budget_minutes": 10,
+        "metric_grep": "^p50_ms:",
+    },
+    "skill": {
+        "target": "SKILL.md",
+        "evaluate_cmd": "python scripts/skill_evaluator.py",
+        "metric": "pass_rate",
+        "metric_direction": "higher",
+        "time_budget_minutes": 5,
+        "metric_grep": "^pass_rate:",
+    },
+}
+
+
+def run_cmd(cmd, cwd=None, timeout=None):
+    """Run a shell command and return (returncode, stdout, stderr)."""
+    result = subprocess.run(
+        cmd, shell=True, capture_output=True, text=True,
+        cwd=cwd, timeout=timeout
+    )
+    return result.returncode, result.stdout.strip(), result.stderr.strip()
+
+
+def check_git_repo(path):
+    """Verify we're in a git repo."""
+    code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path)
+    if code != 0:
+        print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'")
+        return False
+    print("✓ Git repository found")
+    return True
+
+
+def check_program_md(path):
+    """Check program.md exists and has content."""
+    pm = Path(path) / "program.md"
+    if not pm.exists():
+        print("⚠ program.md not found. Creating template...")
+        return False
+    content = pm.read_text()
+    if len(content) < 100:
+        print("⚠ program.md looks empty. Fill it out before running experiments.")
+        return False
+    print(f"✓ program.md found ({len(content)} chars)")
+    return True
+
+
+def check_target_file(path, target):
+    """Check target file exists."""
+    tf = Path(path) / target
+    if not tf.exists():
+        print(f"✗ Target file not found: {target}")
+        return False
+    print(f"✓ Target file found: {target}")
+    return True
+
+
+def check_evaluate_script(path):
+    """Check evaluate.py exists."""
+    ev = Path(path) / "evaluate.py"
+    if not ev.exists():
+        print("⚠ evaluate.py not found. You need a fixed evaluation function.")
+        print("  Create evaluate.py that outputs: metric_name: <value>")
+        return False
+    print("✓ evaluate.py found")
+    return True
+
+
+def create_branch(path, tag):
+    """Create and checkout the experiment branch."""
+    branch = f"autoresearch/{tag}"
+    code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path)
+    if code != 0:
+        if "already exists" in err:
+            print(f"✗ Branch '{branch}' already exists. Use a different tag.")
+        else:
+            print(f"✗ Failed to create branch: {err}")
+        return None
+    print(f"✓ Created branch: {branch}")
+    return branch
+
+
+def init_results_tsv(path):
+    """Create results.tsv with header."""
+    tsv = Path(path) / "results.tsv"
+    if tsv.exists():
+        print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
+        return
+    tsv.write_text("commit\tmetric\tstatus\tdescription\n")
+    print("✓ Created results.tsv")
+
+
+def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes):
+    """Run the baseline experiment."""
+    print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...")
+    timeout = time_budget_minutes * 60 * 2.5  # 2.5x budget as hard limit
+
+    t0 = time.time()
+    code, out, err = run_cmd(
+        f"{evaluate_cmd} > run.log 2>&1",
+        cwd=path,
+        timeout=timeout
+    )
+    elapsed = time.time() - t0
+
+    if code != 0:
+        print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log")
+        return None
+
+    # Extract metric
+    grep_code, grep_out, _ = run_cmd(
+        f"grep '{metric_grep}' run.log | tail -1",
+        cwd=path
+    )
+    if not grep_out:
+        print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
+        return None
+
+    metric_value = grep_out.split(":")[-1].strip()
+    print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
+    return metric_value
+
+
+def main():
+    parser = argparse.ArgumentParser(description="autoresearch-agent setup")
+    parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain")
+    parser.add_argument("--target", help="Target file to optimize")
+    parser.add_argument("--evaluate-cmd", help="Evaluation command")
+    parser.add_argument("--metric", help="Metric name")
+    parser.add_argument("--direction", choices=["lower", "higher"], default="lower")
+    parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes")
+    parser.add_argument("--tag", help="Run tag (used in branch name)")
+    parser.add_argument("--path", default=".", help="Project root path")
+    parser.add_argument("--skip-baseline", action="store_true")
+    args = parser.parse_args()
+
+    path = Path(args.path).resolve()
+    print(f"\n🔬 autoresearch-agent setup")
+    print(f"   Project: {path}")
+    print(f"   Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
+
+    # Get config from domain or args
+    if args.domain:
+        config = DOMAINS[args.domain].copy()
+    else:
+        config = {
+            "target": args.target or "target.py",
+            "evaluate_cmd": args.evaluate_cmd or "python evaluate.py",
+            "metric": args.metric or "score",
+            "metric_direction": args.direction,
+            "time_budget_minutes": args.budget,
+            "metric_grep": f"^{args.metric or 'score'}:",
+        }
+
+    tag = args.tag or datetime.now().strftime("%b%d").lower()
+
+    # Validation checks
+    checks = [
+        check_git_repo(path),
+        check_program_md(path),
+        check_target_file(path, config["target"]),
+        check_evaluate_script(path),
+    ]
+
+    if not all(checks):
+        print("\n⚠ Fix the above issues before running experiments.")
+        sys.exit(1)
+
+    # Create branch
+    branch = create_branch(path, tag)
+    if not branch:
+        sys.exit(1)
+
+    # Init results TSV
+    init_results_tsv(path)
+
+    # Save config for run_experiment.py
+    config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
+    (path / ".autoresearch.cfg").write_text(config_content + "\n")
+    print("✓ Saved .autoresearch.cfg")
+
+    # Run baseline
+    if not args.skip_baseline:
+        baseline = run_baseline(
+            path,
+            config["evaluate_cmd"],
+            config["metric_grep"],
+            config["time_budget_minutes"]
+        )
+        if baseline:
+            # Log baseline to TSV
+            code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
+            with open(path / "results.tsv", "a") as f:
+                f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
+            print(f"✓ Baseline logged to results.tsv")
+
+    print(f"\n✅ Setup complete!")
+    print(f"   Branch: {branch}")
+    print(f"   Target: {config['target']}")
+    print(f"   Metric: {config['metric']} ({config['metric_direction']} is better)")
+    print(f"   Budget: {config['time_budget_minutes']} min/experiment")
+    print(f"\nTo start the autonomous loop:")
+    print(f"   python scripts/run_experiment.py --loop")
+    print(f"\nOr run a single experiment:")
+    print(f"   python scripts/run_experiment.py --single")
+
+
+if __name__ == "__main__":
+    main()