Merge pull request #345 from alirezarezvani/feat/autoresearch-agent-clean

feat: autoresearch-agent — autonomous experiment loop
This commit is contained in:
Alireza Rezvani
2026-03-13 07:40:38 +01:00
committed by GitHub
6 changed files with 1282 additions and 0 deletions

View File

@@ -0,0 +1,246 @@
---
name: "autoresearch-agent"
description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo."
license: MIT
metadata:
version: 1.0.0
author: Alireza Rezvani
category: engineering
updated: 2026-03-13
---
# Autoresearch Agent
> You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop.
**Works for any domain with a measurable metric:**
- ML training optimization (original use case — optimize `train.py` by `val_bpb`)
- Prompt engineering (optimize system prompts by LLM-eval quality score)
- Code performance (optimize a module by benchmark runtime/score)
- Agent skill improvement (optimize `SKILL.md` by task completion rate)
---
## Before Starting
Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing.
If no `program.md` exists, run the **Setup Wizard** below.
---
## Setup Wizard
Answer these 5 questions to configure the experiment:
### 1. What are we optimizing?
The **target** — one file the agent is allowed to modify:
- `train.py` — ML training loop (Karpathy-style)
- `prompt.md` — system prompt for an LLM
- `src/module.py` — a specific code module
- `SKILL.md` — an agent skill definition
### 2. What's the metric?
A **single number** that defines success. Lower or higher = better:
- `val_bpb` — validation bits per byte (ML, lower is better)
- `eval_score` — LLM quality score 0-100 (higher is better)
- `p50_ms` — median latency in milliseconds (lower is better)
- `pass_rate` — test pass rate 0-1 (higher is better)
### 3. What's the time budget per experiment?
How long one experiment run takes:
- `5m` — fast iteration (Karpathy default, ~12 experiments/hour)
- `10m` — moderate (6/hour)
- `30m` — slow but thorough (2/hour)
### 4. What can the agent change?
Constraints on the target file:
- Architecture only? Hyperparameters only? Everything?
- What packages/imports are available?
- What's explicitly off-limits?
### 5. What's the evaluation function?
How we score each experiment:
- Fixed script that outputs the metric (e.g. `python evaluate.py`)
- API call that returns a score
- Test suite with a pass rate
Once answered, run: `python scripts/setup_experiment.py` to initialize.
---
## The Three Files
Every autoresearch project has the same structure:
```
project/
├── program.md ← Human writes this: objectives, constraints, strategy
├── target.* ← Agent modifies this: the thing being optimized
├── evaluate.py ← Fixed: the measurement function (never touch)
├── results.tsv ← Auto-generated: experiment log (git-tracked for continuity)
└── scripts/
├── setup_experiment.py ← Initialize a new run
├── run_experiment.py ← Execute one experiment iteration
└── log_results.py ← Record results to TSV
```
### `program.md` — Your Research Directions
Write this once. The agent reads it before every experiment. It should contain:
- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code)
- **Strategy:** What directions to explore first
- **Constraints:** What the agent cannot change
- **Stopping criteria:** When a result is "good enough"
See `references/program-template.md` for domain-specific templates.
### Target File — The Only File the Agent Edits
Whatever you're optimizing. Strict scope: **one file, one metric**.
### `evaluate.py` — Fixed Evaluation (Never Modified)
The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured.
---
## The Experiment Loop
Run: `python scripts/run_experiment.py --loop`
```
LOOP FOREVER:
1. Read program.md for current strategy
2. Review git history: what has been tried? What worked?
3. Propose ONE change to the target file
4. Apply the change
5. git commit (with descriptive message)
6. Run evaluation: python evaluate.py > run.log 2>&1
7. Parse metric from run.log
8. If metric improved → ADVANCE (keep commit, log "keep")
9. If metric equal/worse → REVERT (git reset, log "discard")
10. If crash → attempt fix, if unfixable log "crash" and revert
11. Update results.tsv
12. Go to 1
```
### Rules (from Karpathy's original)
- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted.
- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win.
- **One change per experiment** — don't change 5 things at once. You won't know what worked.
- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on.
- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash.
- **No new dependencies** — only use what's already available.
---
## Results Log
`results.tsv` (tab-separated, not git-tracked):
```
commit metric status description
a1b2c3d 0.9979 keep baseline
b2c3d4e 0.9932 keep increased learning rate
c3d4e5f 1.0050 discard switched to GeLU (worse)
d4e5f6g 0.0000 crash doubled model width (OOM)
```
Run `python scripts/log_results.py --summary` for a visual summary.
---
## Domain-Specific Configurations
### ML Training (Karpathy-style)
```yaml
target: train.py
evaluate: uv run prepare.py --eval-only
metric: val_bpb (lower is better)
time_budget: 5m
git_branch: autoresearch/{date}-{tag}
```
### Prompt Engineering
```yaml
target: prompt.md
evaluate: python evaluate.py --model gpt-4o --test-cases tests/
metric: eval_score (0-100, higher is better)
time_budget: 2m
git_branch: prompt-research/{date}
```
### Code Performance
```yaml
target: src/hot_module.py
evaluate: python benchmark.py --runs 5 --warmup 1
metric: p50_ms (lower is better)
time_budget: 10m
git_branch: perf-research/{date}
```
### Agent Skill Optimization
```yaml
target: SKILL.md
evaluate: python scripts/skill_evaluator.py --task-suite tests/
metric: pass_rate (0-1, higher is better)
time_budget: 5m
git_branch: skill-research/{date}
```
See `references/experiment-domains.md` for full setup guides per domain.
---
## Scripts
| Script | Purpose |
|--------|---------|
| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run |
| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) |
| `log_results.py` | Record results to TSV; `--summary` prints progress table |
---
## Installation
### One-liner (any tool)
```bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
```
### Multi-tool install
```bash
# Clone the repo, then use the convert script for your tool:
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
```
### OpenClaw
```bash
clawhub install autoresearch-agent
```
---
## Proactive Triggers
Flag these issues without being asked:
- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
- **Target file has no git history** → `git init` and commit baseline first.
- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
---
## Related Skills
- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics.
- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops.
- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops.
- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function.

View File

@@ -0,0 +1,175 @@
# Experiment Domains Guide
## Domain 1: ML Training (Karpathy-style)
**Best for:** LLM/neural net training optimization on a single GPU
**Requirements:**
- NVIDIA GPU (H100 recommended, A100/RTX also work)
- CUDA + PyTorch
- `uv` package manager
- ~50GB disk (training data)
**Setup:**
```bash
# Clone autoresearch repo (the ML training environment)
git clone https://github.com/karpathy/autoresearch my-ml-research
cd my-ml-research
uv sync
uv run prepare.py # one-time data download + tokenizer (~2 min)
# Initialize autoresearch skill
cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts
# Configure
python scripts/setup_experiment.py --domain ml --tag mar13
```
**Metric:** `val_bpb` — validation bits per byte. Lower = better model.
**What the agent can change in `train.py`:**
- Model depth, width, attention heads
- Learning rate, scheduler, warmup
- Optimizer (Muon, AdamW, variants)
- Batch size, gradient accumulation
- Architecture (attention patterns, FFN type)
**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):**
Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256.
---
## Domain 2: Prompt Engineering
**Best for:** Optimizing system prompts for quality/accuracy/tone
**Requirements:**
- LLM API access (OpenAI, Anthropic, etc.)
- Test cases with expected outputs
- An LLM judge for scoring (can be same model)
**Setup:**
```bash
mkdir my-prompt-research && cd my-prompt-research
git init
# Create prompt.md (the thing being optimized)
echo "You are a helpful assistant." > prompt.md
# Create evaluate.py (fixed — never modify)
cat > evaluate.py << 'EOF'
#!/usr/bin/env python3
# Fixed evaluation harness — DO NOT MODIFY
import json, sys
from pathlib import Path
PROMPT = Path("prompt.md").read_text()
# Load test cases
TEST_CASES = json.loads(Path("tests/cases.json").read_text())
# Run prompt against test cases, score with LLM judge
# ... (customize for your LLM + scoring logic)
total = sum(score_case(PROMPT, case) for case in TEST_CASES)
score = total / len(TEST_CASES) * 100
print(f"eval_score: {score:.2f}")
EOF
# Initialize
python scripts/setup_experiment.py --domain prompt --tag mar13
```
**Metric:** `eval_score` (0-100). Higher = better prompt.
---
## Domain 3: Code Performance
**Best for:** Optimizing a specific hot module for speed
**Requirements:**
- A Python module with measurable performance
- Existing tests (correctness must not regress)
- A benchmark harness
**Setup:**
```bash
cd my-project
# Create benchmark.py (fixed — never modify)
cat > benchmark.py << 'EOF'
#!/usr/bin/env python3
# Fixed benchmark — DO NOT MODIFY
import time, statistics
from src.module import your_function
from tests.test_module import run_tests
# Correctness check first
if not run_tests():
print("TESTS FAILED")
sys.exit(1)
# Benchmark
data = generate_test_data(n=10000)
times = []
for _ in range(10):
t0 = time.perf_counter()
your_function(data)
times.append((time.perf_counter() - t0) * 1000)
p50 = statistics.median(times)
print(f"p50_ms: {p50:.2f}")
print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
EOF
python scripts/setup_experiment.py --domain code \
--target src/module.py \
--tag mar13
```
**Metric:** `p50_ms` — median latency. Lower = faster.
---
## Domain 4: Agent Skill Optimization
**Best for:** Improving the quality of claude-skills SKILL.md files
**Requirements:**
- A SKILL.md to optimize
- A task evaluation suite (15-20 standardized tasks)
- An LLM judge for scoring
**Setup:**
```bash
# Create a new skill research project
mkdir skill-research-{skill-name} && cd skill-research-{skill-name}
git init
# Copy the skill to optimize
cp ~/.claude/skills/{skill-name}/SKILL.md .
# Create evaluate.py
cat > scripts/skill_evaluator.py << 'EOF'
#!/usr/bin/env python3
# Fixed evaluator — DO NOT MODIFY
# Runs SKILL.md against 15 standardized tasks using LLM judge
# Outputs: pass_rate: 0.80 (etc.)
EOF
python scripts/setup_experiment.py --domain skill --tag mar13
```
**Metric:** `pass_rate` (0-1). Higher = better skill.
---
## Choosing Your Domain
| Question | Recommendation |
|----------|---------------|
| Do I have a GPU and want to improve an LLM? | ML Training |
| Do I want to improve a prompt/system message? | Prompt Engineering |
| Do I have slow Python code I want to speed up? | Code Performance |
| Do I want to improve one of my claude-skills? | Skill Optimization |
**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results.

View File

@@ -0,0 +1,170 @@
# program.md Templates
Copy the template for your domain and paste into your project root as `program.md`.
---
## ML Training (Karpathy-style)
```markdown
# autoresearch — ML Training
## Goal
Minimize val_bpb on the validation set. Lower is better.
## What You Can Change (train.py only)
- Model architecture (depth, width, attention heads, FFN ratio)
- Optimizer (learning rate, warmup, scheduler, weight decay)
- Training loop (batch size, gradient accumulation, clipping)
- Regularization (dropout, weight tying, etc.)
- Any self-contained improvement that doesn't require new packages
## What You Cannot Change
- prepare.py (fixed — contains evaluation harness)
- Dependencies (pyproject.toml is locked)
- Time budget (always 5 minutes, wall clock)
- Evaluation metric (val_bpb is the ground truth)
## Strategy
1. First run: establish baseline. Do not change anything.
2. Explore learning rate range (try 2x and 0.5x current)
3. Try depth changes (±2 layers)
4. Try optimizer changes (Muon vs. AdamW variants)
5. If things improve, double down. If stuck, try something radical.
## Simplicity Rule
A small improvement with ugly code is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.
## Stop When
val_bpb < 0.95 OR after 100 experiments, whichever comes first.
```
---
## Prompt Engineering
```markdown
# autoresearch — Prompt Optimization
## Goal
Maximize eval_score on the test suite. Higher is better (0-100).
## What You Can Change (prompt.md only)
- System prompt instructions
- Examples and few-shot demonstrations
- Output format specifications
- Chain-of-thought instructions
- Persona and tone
- Task decomposition strategies
## What You Cannot Change
- evaluate.py (fixed evaluation harness)
- Test cases in tests/ (ground truth)
- Model being evaluated (specified in evaluate.py)
- Scoring criteria (defined in evaluate.py)
## Strategy
1. First run: baseline with current prompt (or empty)
2. Add clear role/persona definition
3. Add output format specification
4. Add chain-of-thought instruction
5. Add 2-3 diverse examples
6. Refine based on failure modes from run.log
## Evaluation
- evaluate.py runs the prompt against 20 test cases
- Each test case is scored 0-5 by GPT-4o
- eval_score = average * 20 (maps to 0-100)
- Run log shows which test cases failed
## Stop When
eval_score >= 85 OR after 50 experiments.
```
---
## Code Performance
```markdown
# autoresearch — Performance Optimization
## Goal
Minimize p50_ms (median latency). Lower is better.
## What You Can Change (src/module.py only)
- Algorithm implementation
- Data structures (use faster alternatives)
- Caching and memoization
- Vectorization (NumPy, etc.)
- Loop optimization
- I/O patterns
- Memory allocation patterns
## What You Cannot Change
- benchmark.py (fixed benchmark harness)
- Public API (function signatures must stay the same)
- External dependencies (add nothing new)
- Correctness tests (tests/ must still pass)
## Constraints
- Correctness is non-negotiable. benchmark.py runs tests first.
- If tests fail → immediate crash status, no metric recorded.
- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x.
## Strategy
1. Baseline: profile first, don't guess
2. Check if there's any O(n²) → O(n log n) opportunity
3. Try caching repeated computations
4. Try NumPy vectorization for loops
5. Try algorithm-level changes last (higher risk)
## Stop When
p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments.
```
---
## Agent Skill Optimization
```markdown
# autoresearch — Skill Optimization
## Goal
Maximize pass_rate on the task evaluation suite. Higher is better (0-1).
## What You Can Change (SKILL.md only)
- Skill description and trigger phrases
- Core workflow steps and ordering
- Decision frameworks and rules
- Output format specifications
- Example inputs/outputs
- Related skills disambiguation
- Proactive trigger conditions
## What You Cannot Change
- scripts/skill_evaluator.py (fixed evaluation)
- Test tasks in tests/ (ground truth benchmark)
- Skill name (used for routing)
- License or metadata
## Evaluation
- skill_evaluator.py runs SKILL.md against 15 standardized tasks
- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- pass_rate = sum(scores) / 15
## Strategy
1. Baseline: run as-is
2. Improve trigger description (better routing = more passes)
3. Sharpen the core workflow (clearer = better execution)
4. Add missing edge cases to the rules section
5. Improve disambiguation (reduce false-positive routing)
## Simplicity Rule
A shorter SKILL.md that achieves the same score is better.
Aim for 200-400 lines total.
## Stop When
pass_rate >= 0.90 OR after 30 experiments.
```

View File

@@ -0,0 +1,126 @@
#!/usr/bin/env python3
"""
autoresearch-agent: Results Logger
View and analyze experiment results from results.tsv.
Usage:
python scripts/log_results.py --summary # Print progress table
python scripts/log_results.py --best # Show best result
python scripts/log_results.py --history # Full experiment history
python scripts/log_results.py --record commit val status desc # Add entry manually
"""
import argparse
import sys
from pathlib import Path
def load_results(path):
tsv = Path(path) / "results.tsv"
if not tsv.exists():
return []
lines = tsv.read_text().splitlines()[1:] # skip header
results = []
for line in lines:
parts = line.split("\t")
if len(parts) >= 4:
try:
metric_val = float(parts[1]) if parts[1] != "N/A" else None
except ValueError:
metric_val = None
results.append({
"commit": parts[0],
"metric": metric_val,
"status": parts[2],
"description": parts[3]
})
return results
def print_summary(results, metric_name="metric", direction="lower"):
if not results:
print("No experiments logged yet.")
return
keeps = [r for r in results if r["status"] == "keep"]
discards = [r for r in results if r["status"] == "discard"]
crashes = [r for r in results if r["status"] == "crash"]
print(f"\n{''*60}")
print(f" autoresearch-agent — Results Summary")
print(f"{''*60}")
print(f" Total experiments: {len(results)}")
print(f" ✅ Keep: {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)")
print(f" ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)")
print(f" 💥 Crash: {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
if keeps:
valid = [r for r in keeps if r["metric"] is not None]
if valid:
baseline = valid[0]["metric"]
best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid)
best_run = next(r for r in valid if r["metric"] == best)
improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
print(f"\n {metric_name}:")
print(f" Baseline: {baseline:.6f}")
print(f" Best: {best:.6f} (commit: {best_run['commit']})")
print(f" Change: {improvement:+.2f}%")
print(f"{''*60}\n")
def print_history(results):
if not results:
print("No experiments logged yet.")
return
print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION")
print("" * 60)
for r in results:
metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash "
status_icon = {"keep": "", "discard": "", "crash": "💥"}.get(r["status"], "?")
print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--summary", action="store_true")
parser.add_argument("--best", action="store_true")
parser.add_argument("--history", action="store_true")
parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC"))
parser.add_argument("--path", default=".")
parser.add_argument("--metric", default="metric")
parser.add_argument("--direction", default="lower", choices=["lower", "higher"])
args = parser.parse_args()
path = Path(args.path).resolve()
if args.record:
commit, metric, status, desc = args.record
tsv = path / "results.tsv"
if not tsv.exists():
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
with open(tsv, "a") as f:
f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
print(f"✓ Logged: {commit} {metric} {status}")
return
results = load_results(path)
if args.history:
print_history(results)
elif args.best:
keeps = [r for r in results if r["status"] == "keep" and r["metric"]]
if not keeps:
print("No successful experiments yet.")
return
best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}")
else:
print_summary(results, args.metric, args.direction)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,310 @@
#!/usr/bin/env python3
"""
autoresearch-agent: Experiment Runner
Executes the autonomous experiment loop:
- Reads .autoresearch.cfg for project config
- Runs the target evaluation
- Keeps improvements (git commit) or discards failures (git reset)
- Logs everything to results.tsv
- Loops indefinitely until interrupted
Usage:
python scripts/run_experiment.py --loop # Run forever
python scripts/run_experiment.py --single # Run one experiment
python scripts/run_experiment.py --dry-run # Show what would happen
"""
import argparse
import os
import signal
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
def load_config(path):
"""Load .autoresearch.cfg"""
cfg_file = Path(path) / ".autoresearch.cfg"
if not cfg_file.exists():
print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.")
sys.exit(1)
config = {}
for line in cfg_file.read_text().splitlines():
if ":" in line:
k, v = line.split(":", 1)
config[k.strip()] = v.strip()
return config
def run_cmd(cmd, cwd=None, timeout=None):
"""Run shell command, return (returncode, stdout, stderr)."""
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True,
cwd=cwd, timeout=timeout
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
def get_current_commit(path):
_, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
return commit
def get_current_metric(path, metric_grep):
"""Read the last recorded metric from results.tsv."""
tsv = Path(path) / "results.tsv"
if not tsv.exists():
return None
lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l]
if not lines:
return None
last = lines[-1].split("\t")
try:
return float(last[1])
except (ValueError, IndexError):
return None
def run_evaluation(path, evaluate_cmd, time_budget_minutes):
"""Run evaluation with time limit."""
hard_limit = time_budget_minutes * 60 * 2.5 # 2.5x as hard timeout
t0 = time.time()
try:
code, _, _ = run_cmd(
f"{evaluate_cmd} > run.log 2>&1",
cwd=path,
timeout=hard_limit
)
elapsed = time.time() - t0
return code, elapsed
except subprocess.TimeoutExpired:
elapsed = time.time() - t0
return -1, elapsed # -1 = timeout
def extract_metric(path, metric_grep):
"""Extract metric value from run.log."""
code, out, _ = run_cmd(
f"grep '{metric_grep}' run.log | tail -1",
cwd=path
)
if not out:
return None
try:
return float(out.split(":")[-1].strip())
except ValueError:
return None
def is_improvement(new_val, old_val, direction):
"""Check if new result is better than old."""
if old_val is None:
return True # First run always "improves"
if direction == "lower":
return new_val < old_val
else:
return new_val > old_val
def log_result(path, commit, metric_val, status, description):
"""Append result to results.tsv."""
tsv = Path(path) / "results.tsv"
metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
with open(tsv, "a") as f:
f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
def get_experiment_count(path):
"""Count experiments run so far."""
tsv = Path(path) / "results.tsv"
if not tsv.exists():
return 0
lines = tsv.read_text().splitlines()
return max(0, len(lines) - 1) # subtract header
def run_single_experiment(path, config, exp_num, dry_run=False):
"""Run one experiment iteration."""
direction = config.get("metric_direction", "lower")
metric_grep = config.get("metric_grep", "^metric:")
evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py")
time_budget = int(config.get("time_budget_minutes", 5))
metric_name = config.get("metric", "metric")
best_so_far = get_current_metric(path, metric_grep)
ts = datetime.now().strftime("%H:%M:%S")
print(f"\n[{ts}] Experiment #{exp_num}")
print(f" Best {metric_name} so far: {best_so_far}")
if dry_run:
print(" [DRY RUN] Would run evaluation and check metric")
return "dry_run"
# Save pre-experiment state for rollback
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path)
if code != 0:
print(" ✗ Can't get git state. Is this a git repo with commits?")
return "error"
# Run evaluation
print(f" Running: {evaluate_cmd} (budget: {time_budget} min)")
ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget)
# Handle timeout
if ret_code == -1:
print(f" ✗ TIMEOUT after {elapsed:.0f}s — discarding")
run_cmd("git checkout -- .", cwd=path) # revert uncommitted changes
# Commit was already made by the agent before evaluation
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
return "crash"
# Handle non-zero exit
if ret_code != 0:
# Check if it crashed
code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path)
print(f" ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
print(f" Last output: {tail[:200]}")
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
return "crash"
# Extract metric
metric_val = extract_metric(path, metric_grep)
if metric_val is None:
print(f" ✗ Could not parse metric from run.log")
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", "metric_parse_failed")
return "crash"
curr_commit = get_current_commit(path)
delta = ""
if best_so_far is not None:
diff = metric_val - best_so_far
delta = f"{diff:+.4f})"
print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
# Keep or discard
if is_improvement(metric_val, best_so_far, direction):
print(f" ✅ KEEP — improvement confirmed")
log_result(path, curr_commit, metric_val, "keep",
f"improvement_{metric_name}_{metric_val:.4f}")
return "keep"
else:
print(f" ❌ DISCARD — no improvement")
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, metric_val, "discard",
f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}")
return "discard"
def print_summary(path):
"""Print experiment summary."""
tsv = Path(path) / "results.tsv"
if not tsv.exists():
return
lines = tsv.read_text().splitlines()[1:] # skip header
if not lines:
return
keeps = [l for l in lines if "\tkeep\t" in l]
discards = [l for l in lines if "\tdiscard\t" in l]
crashes = [l for l in lines if "\tcrash\t" in l]
print(f"\n{'='*50}")
print(f" Session Summary")
print(f" Experiments: {len(lines)} total")
print(f" ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}")
if keeps:
try:
first_metric = float(keeps[0].split("\t")[1])
last_metric = float(keeps[-1].split("\t")[1])
direction = "" if last_metric < first_metric else ""
print(f" Best progress: {first_metric:.6f}{last_metric:.6f} {direction}")
except (ValueError, IndexError):
pass
print(f"{'='*50}\n")
def main():
parser = argparse.ArgumentParser(description="autoresearch-agent runner")
parser.add_argument("--loop", action="store_true", help="Run forever")
parser.add_argument("--single", action="store_true", help="Run one experiment")
parser.add_argument("--dry-run", action="store_true", help="Dry run only")
parser.add_argument("--path", default=".", help="Project root")
parser.add_argument("--max-experiments", type=int, default=0,
help="Max experiments (0 = unlimited)")
args = parser.parse_args()
path = Path(args.path).resolve()
config = load_config(path)
print(f"\n🔬 autoresearch-agent")
print(f" Project: {path}")
print(f" Target: {config.get('target', '?')}")
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
print(f" Mode: {'loop' if args.loop else 'single'}")
if args.single:
exp_num = get_experiment_count(path) + 1
run_single_experiment(path, config, exp_num, args.dry_run)
return
if not args.loop and not args.dry_run:
print("\nSpecify --loop (forever) or --single (one experiment)")
sys.exit(1)
# Setup graceful shutdown
def handle_interrupt(sig, frame):
print_summary(path)
print("\n⏹ Stopped by user.")
sys.exit(0)
signal.signal(signal.SIGINT, handle_interrupt)
signal.signal(signal.SIGTERM, handle_interrupt)
# Main loop
consecutive_crashes = 0
exp_num = get_experiment_count(path) + 1
print(f"\nStarting loop. Ctrl+C to stop and print summary.\n")
while True:
result = run_single_experiment(path, config, exp_num, args.dry_run)
exp_num += 1
if result == "crash":
consecutive_crashes += 1
else:
consecutive_crashes = 0
# Bail if 5 consecutive crashes
if consecutive_crashes >= 5:
print("\n⚠ 5 consecutive crashes. Pausing for investigation.")
print(" Check run.log for the last error.")
break
# Check max experiments
if args.max_experiments > 0 and exp_num > args.max_experiments:
print(f"\n✓ Reached max experiments ({args.max_experiments})")
break
if args.single:
break
print_summary(path)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,255 @@
#!/usr/bin/env python3
"""
autoresearch-agent: Setup Wizard
Initializes a new research run:
1. Validates the project structure
2. Creates a git branch
3. Runs the baseline experiment
4. Initializes results.tsv
Usage:
python scripts/setup_experiment.py [--config experiment.yaml]
python scripts/setup_experiment.py --domain ml|prompt|code|skill
"""
import argparse
import os
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
DOMAINS = {
"ml": {
"target": "train.py",
"evaluate_cmd": "uv run train.py",
"metric": "val_bpb",
"metric_direction": "lower",
"time_budget_minutes": 5,
"metric_grep": "^val_bpb:",
},
"prompt": {
"target": "prompt.md",
"evaluate_cmd": "python evaluate.py",
"metric": "eval_score",
"metric_direction": "higher",
"time_budget_minutes": 2,
"metric_grep": "^eval_score:",
},
"code": {
"target": "src/module.py",
"evaluate_cmd": "python benchmark.py",
"metric": "p50_ms",
"metric_direction": "lower",
"time_budget_minutes": 10,
"metric_grep": "^p50_ms:",
},
"skill": {
"target": "SKILL.md",
"evaluate_cmd": "python scripts/skill_evaluator.py",
"metric": "pass_rate",
"metric_direction": "higher",
"time_budget_minutes": 5,
"metric_grep": "^pass_rate:",
},
}
def run_cmd(cmd, cwd=None, timeout=None):
"""Run a shell command and return (returncode, stdout, stderr)."""
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True,
cwd=cwd, timeout=timeout
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
def check_git_repo(path):
"""Verify we're in a git repo."""
code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path)
if code != 0:
print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'")
return False
print("✓ Git repository found")
return True
def check_program_md(path):
"""Check program.md exists and has content."""
pm = Path(path) / "program.md"
if not pm.exists():
print("⚠ program.md not found. Creating template...")
return False
content = pm.read_text()
if len(content) < 100:
print("⚠ program.md looks empty. Fill it out before running experiments.")
return False
print(f"✓ program.md found ({len(content)} chars)")
return True
def check_target_file(path, target):
"""Check target file exists."""
tf = Path(path) / target
if not tf.exists():
print(f"✗ Target file not found: {target}")
return False
print(f"✓ Target file found: {target}")
return True
def check_evaluate_script(path):
"""Check evaluate.py exists."""
ev = Path(path) / "evaluate.py"
if not ev.exists():
print("⚠ evaluate.py not found. You need a fixed evaluation function.")
print(" Create evaluate.py that outputs: metric_name: <value>")
return False
print("✓ evaluate.py found")
return True
def create_branch(path, tag):
"""Create and checkout the experiment branch."""
branch = f"autoresearch/{tag}"
code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path)
if code != 0:
if "already exists" in err:
print(f"✗ Branch '{branch}' already exists. Use a different tag.")
else:
print(f"✗ Failed to create branch: {err}")
return None
print(f"✓ Created branch: {branch}")
return branch
def init_results_tsv(path):
"""Create results.tsv with header."""
tsv = Path(path) / "results.tsv"
if tsv.exists():
print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
return
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
print("✓ Created results.tsv")
def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes):
"""Run the baseline experiment."""
print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...")
timeout = time_budget_minutes * 60 * 2.5 # 2.5x budget as hard limit
t0 = time.time()
code, out, err = run_cmd(
f"{evaluate_cmd} > run.log 2>&1",
cwd=path,
timeout=timeout
)
elapsed = time.time() - t0
if code != 0:
print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log")
return None
# Extract metric
grep_code, grep_out, _ = run_cmd(
f"grep '{metric_grep}' run.log | tail -1",
cwd=path
)
if not grep_out:
print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
return None
metric_value = grep_out.split(":")[-1].strip()
print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
return metric_value
def main():
parser = argparse.ArgumentParser(description="autoresearch-agent setup")
parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain")
parser.add_argument("--target", help="Target file to optimize")
parser.add_argument("--evaluate-cmd", help="Evaluation command")
parser.add_argument("--metric", help="Metric name")
parser.add_argument("--direction", choices=["lower", "higher"], default="lower")
parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes")
parser.add_argument("--tag", help="Run tag (used in branch name)")
parser.add_argument("--path", default=".", help="Project root path")
parser.add_argument("--skip-baseline", action="store_true")
args = parser.parse_args()
path = Path(args.path).resolve()
print(f"\n🔬 autoresearch-agent setup")
print(f" Project: {path}")
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
# Get config from domain or args
if args.domain:
config = DOMAINS[args.domain].copy()
else:
config = {
"target": args.target or "target.py",
"evaluate_cmd": args.evaluate_cmd or "python evaluate.py",
"metric": args.metric or "score",
"metric_direction": args.direction,
"time_budget_minutes": args.budget,
"metric_grep": f"^{args.metric or 'score'}:",
}
tag = args.tag or datetime.now().strftime("%b%d").lower()
# Validation checks
checks = [
check_git_repo(path),
check_program_md(path),
check_target_file(path, config["target"]),
check_evaluate_script(path),
]
if not all(checks):
print("\n⚠ Fix the above issues before running experiments.")
sys.exit(1)
# Create branch
branch = create_branch(path, tag)
if not branch:
sys.exit(1)
# Init results TSV
init_results_tsv(path)
# Save config for run_experiment.py
config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
(path / ".autoresearch.cfg").write_text(config_content + "\n")
print("✓ Saved .autoresearch.cfg")
# Run baseline
if not args.skip_baseline:
baseline = run_baseline(
path,
config["evaluate_cmd"],
config["metric_grep"],
config["time_budget_minutes"]
)
if baseline:
# Log baseline to TSV
code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
with open(path / "results.tsv", "a") as f:
f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
print(f"✓ Baseline logged to results.tsv")
print(f"\n✅ Setup complete!")
print(f" Branch: {branch}")
print(f" Target: {config['target']}")
print(f" Metric: {config['metric']} ({config['metric_direction']} is better)")
print(f" Budget: {config['time_budget_minutes']} min/experiment")
print(f"\nTo start the autonomous loop:")
print(f" python scripts/run_experiment.py --loop")
print(f"\nOr run a single experiment:")
print(f" python scripts/run_experiment.py --single")
if __name__ == "__main__":
main()