refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo.

Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation

New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed

Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output

Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view

SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy
This commit is contained in:
Leo
2026-03-13 08:22:14 +01:00
parent c834d71a44
commit 12591282da
13 changed files with 1744 additions and 702 deletions

View File

@@ -1,9 +1,9 @@
--- ---
name: "autoresearch-agent" name: "autoresearch-agent"
description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo." description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
license: MIT license: MIT
metadata: metadata:
version: 1.0.0 version: 2.0.0
author: Alireza Rezvani author: Alireza Rezvani
category: engineering category: engineering
updated: 2026-03-13 updated: 2026-03-13
@@ -13,194 +13,233 @@ metadata:
> You sleep. The agent experiments. You wake up to results. > You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop. Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
**Works for any domain with a measurable metric:** Not one guess — fifty measured attempts, compounding.
- ML training optimization (original use case — optimize `train.py` by `val_bpb`)
- Prompt engineering (optimize system prompts by LLM-eval quality score)
- Code performance (optimize a module by benchmark runtime/score)
- Agent skill improvement (optimize `SKILL.md` by task completion rate)
--- ---
## Before Starting ## When This Skill Activates
Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing. Recognize these patterns from the user:
If no `program.md` exists, run the **Setup Wizard** below. - "Make this faster / smaller / better"
- "Optimize [file] for [metric]"
- "Improve my [headlines / copy / prompts]"
- "Run experiments overnight"
- "I want to get [metric] from X to Y"
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
If the user describes a target file + a way to measure success → this skill applies.
--- ---
## Setup Wizard ## Setup
Answer these 5 questions to configure the experiment: ### First Time — Create the Experiment
### 1. What are we optimizing? Run the setup script. The user decides where experiments live:
The **target** — one file the agent is allowed to modify:
- `train.py` — ML training loop (Karpathy-style)
- `prompt.md` — system prompt for an LLM
- `src/module.py` — a specific code module
- `SKILL.md` — an agent skill definition
### 2. What's the metric? **Project-level** (inside repo, git-tracked, shareable with team):
A **single number** that defines success. Lower or higher = better: ```bash
- `val_bpb` — validation bits per byte (ML, lower is better) python scripts/setup_experiment.py \
- `eval_score` — LLM quality score 0-100 (higher is better) --domain engineering \
- `p50_ms` — median latency in milliseconds (lower is better) --name api-speed \
- `pass_rate` — test pass rate 0-1 (higher is better) --target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
### 3. What's the time budget per experiment? --metric p50_ms \
How long one experiment run takes: --direction lower \
- `5m` — fast iteration (Karpathy default, ~12 experiments/hour) --scope project
- `10m` — moderate (6/hour)
- `30m` — slow but thorough (2/hour)
### 4. What can the agent change?
Constraints on the target file:
- Architecture only? Hyperparameters only? Everything?
- What packages/imports are available?
- What's explicitly off-limits?
### 5. What's the evaluation function?
How we score each experiment:
- Fixed script that outputs the metric (e.g. `python evaluate.py`)
- API call that returns a score
- Test suite with a pass rate
Once answered, run: `python scripts/setup_experiment.py` to initialize.
---
## The Three Files
Every autoresearch project has the same structure:
```
project/
├── program.md ← Human writes this: objectives, constraints, strategy
├── target.* ← Agent modifies this: the thing being optimized
├── evaluate.py ← Fixed: the measurement function (never touch)
├── results.tsv ← Auto-generated: experiment log (git-tracked for continuity)
└── scripts/
├── setup_experiment.py ← Initialize a new run
├── run_experiment.py ← Execute one experiment iteration
└── log_results.py ← Record results to TSV
``` ```
### `program.md` — Your Research Directions **User-level** (personal, in `~/.autoresearch/`):
Write this once. The agent reads it before every experiment. It should contain: ```bash
- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code) python scripts/setup_experiment.py \
- **Strategy:** What directions to explore first --domain marketing \
- **Constraints:** What the agent cannot change --name medium-ctr \
- **Stopping criteria:** When a result is "good enough" --target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope user
```
See `references/program-template.md` for domain-specific templates. The `--scope` flag determines where `.autoresearch/` lives:
- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
- `user``~/.autoresearch/` in the home directory. Everything is personal.
### Target File — The Only File the Agent Edits ### What Setup Creates
Whatever you're optimizing. Strict scope: **one file, one metric**.
### `evaluate.py` — Fixed Evaluation (Never Modified) ```
The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured. .autoresearch/
├── config.yaml ← Global settings
├── .gitignore ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← Objectives, constraints, strategy
├── config.cfg ← Target, eval cmd, metric, direction
├── results.tsv ← Experiment log (gitignored)
└── evaluate.py ← Evaluation script (if --evaluator used)
```
### Domains
| Domain | Use Cases |
|--------|-----------|
| `engineering` | Code speed, memory, bundle size, test pass rate, build time |
| `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
| `content` | Article structure, SEO descriptions, readability, CTR |
| `prompts` | System prompts, chatbot tone, agent instructions |
| `custom` | Anything else with a measurable metric |
### If `program.md` Already Exists
The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
--- ---
## The Experiment Loop ## The Experiment Loop
Run: `python scripts/run_experiment.py --loop` ### Starting an Experiment
```bash
# Run specific experiment
python scripts/run_experiment.py --experiment engineering/api-speed --loop
# Single iteration (test setup)
python scripts/run_experiment.py --experiment engineering/api-speed --single
# Resume last active experiment
python scripts/run_experiment.py --resume --loop
# Dry run (show what would happen)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
```
### The Loop Protocol
``` ```
LOOP FOREVER: LOOP FOREVER:
1. Read program.md for current strategy 1. Read program.md for current strategy and constraints
2. Review git history: what has been tried? What worked? 2. Review git log: what has been tried? What worked? What crashed?
3. Propose ONE change to the target file 3. Review results.tsv: current best metric, trend, recent failures
4. Apply the change 4. Propose ONE change to the target file
5. git commit (with descriptive message) 5. Apply the change
6. Run evaluation: python evaluate.py > run.log 2>&1 6. git commit -m "experiment: [short description of what changed]"
7. Parse metric from run.log 7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
8. If metric improved → ADVANCE (keep commit, log "keep") 8. Parse metric from run.log (grep for metric_name: value)
9. If metric equal/worse → REVERT (git reset, log "discard") 9. Decision:
10. If crash → attempt fix, if unfixable log "crash" and revert - Metric improved → KEEP (advance branch, log "keep")
11. Update results.tsv - Metric equal or worse → REVERT (git reset --hard, log "discard")
12. Go to 1 - Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
10. Append result to results.tsv
11. Go to 1
``` ```
### Rules (from Karpathy's original) ### Rules
- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted. - **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win. - **One change per experiment.** Don't change 5 things at once. You won't know what worked.
- **One change per experiment** — don't change 5 things at once. You won't know what worked. - **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on. - **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash. - **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
- **No new dependencies** — only use what's already available. - **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
- **No new dependencies.** Only use what's already available in the project.
--- ---
## Results Log ## Evaluators
`results.tsv` (tab-separated, not git-tracked): Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.
### Free Evaluators (no API cost)
| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
### LLM Judge Evaluators (uses your subscription)
| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
The user's existing subscription covers the cost:
- Claude Code Max → unlimited Claude calls for evaluation
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
- Gemini CLI (free tier) → free evaluation calls
### Custom Evaluators
If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.
```python
#!/usr/bin/env python3
# My custom evaluator — DO NOT MODIFY after experiment starts
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
# Parse and output
print(f"my_metric: {parse_score(result.stdout)}")
``` ```
commit metric status description
a1b2c3d 0.9979 keep baseline
b2c3d4e 0.9932 keep increased learning rate
c3d4e5f 1.0050 discard switched to GeLU (worse)
d4e5f6g 0.0000 crash doubled model width (OOM)
```
Run `python scripts/log_results.py --summary` for a visual summary.
--- ---
## Domain-Specific Configurations ## Viewing Results
### ML Training (Karpathy-style) ```bash
```yaml # Single experiment
target: train.py python scripts/log_results.py --experiment engineering/api-speed
evaluate: uv run prepare.py --eval-only
metric: val_bpb (lower is better) # All experiments in a domain
time_budget: 5m python scripts/log_results.py --domain engineering
git_branch: autoresearch/{date}-{tag}
# Cross-experiment dashboard
python scripts/log_results.py --dashboard
# Export formats
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
``` ```
### Prompt Engineering ### Dashboard Output
```yaml
target: prompt.md ```
evaluate: python evaluate.py --model gpt-4o --test-cases tests/ DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
metric: eval_score (0-100, higher is better) engineering api-speed 47 14 185ms -76.9% active
time_budget: 2m engineering bundle-size 23 8 412KB -58.3% paused
git_branch: prompt-research/{date} marketing medium-ctr 31 11 8.4/10 +68.0% active
prompts support-tone 15 6 82/100 +46.4% done
``` ```
### Code Performance ### Export Formats
```yaml
target: src/hot_module.py
evaluate: python benchmark.py --runs 5 --warmup 1
metric: p50_ms (lower is better)
time_budget: 10m
git_branch: perf-research/{date}
```
### Agent Skill Optimization - **TSV** — default, tab-separated (compatible with spreadsheets)
```yaml - **CSV** — comma-separated, with proper quoting
target: SKILL.md - **Markdown** — formatted table, readable in GitHub/docs
evaluate: python scripts/skill_evaluator.py --task-suite tests/
metric: pass_rate (0-1, higher is better)
time_budget: 5m
git_branch: skill-research/{date}
```
See `references/experiment-domains.md` for full setup guides per domain.
--- ---
## Scripts ## Proactive Triggers
| Script | Purpose | Flag these without being asked:
|--------|---------|
| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run | - **No evaluation command works** → Test it before starting the loop. Run once, verify output.
| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) | - **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
| `log_results.py` | Record results to TSV; `--summary` prints progress table | - **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
- **Time budget too short** → If eval takes longer than budget, every run crashes.
- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.
--- ---
@@ -214,7 +253,6 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
### Multi-tool install ### Multi-tool install
```bash ```bash
# Clone the repo, then use the convert script for your tool:
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw ./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
``` ```
@@ -225,22 +263,9 @@ clawhub install autoresearch-agent
--- ---
## Proactive Triggers
Flag these issues without being asked:
- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
- **Target file has no git history** → `git init` and commit baseline first.
- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
---
## Related Skills ## Related Skills
- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics. - **self-improving-agent** improves an agent's own memory/rules over time. NOT for structured experiment loops.
- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops. - **senior-ml-engineer** ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops. - **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function. - **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.

View File

@@ -0,0 +1,56 @@
#!/usr/bin/env python3
"""Measure file, bundle, or Docker image size.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import os
import subprocess
import sys
# --- CONFIGURE ONE OF THESE ---
# Option 1: File size
TARGET_FILE = "dist/main.js"
# Option 2: Directory size (uncomment to use)
# TARGET_DIR = "dist/"
# Option 3: Docker image (uncomment to use)
# DOCKER_IMAGE = "myapp:latest"
# DOCKER_BUILD_CMD = "docker build -t myapp:latest ."
# Option 4: Build first, then measure (uncomment to use)
# BUILD_CMD = "npm run build"
# --- END CONFIG ---
# Build if needed
if "BUILD_CMD" in dir() or "BUILD_CMD" in globals():
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)
if result.returncode != 0:
print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr)
sys.exit(1)
# Measure
if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
if "DOCKER_BUILD_CMD" in dir():
subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
result = subprocess.run(
f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
shell=True, capture_output=True, text=True
)
size_bytes = int(result.stdout.strip())
elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
size_bytes = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, fns in os.walk(TARGET_DIR) for f in fns
)
elif os.path.exists(TARGET_FILE):
size_bytes = os.path.getsize(TARGET_FILE)
else:
print(f"Target not found: {TARGET_FILE}", file=sys.stderr)
sys.exit(1)
size_kb = size_bytes / 1024
size_mb = size_bytes / (1024 * 1024)
print(f"size_bytes: {size_bytes}")
print(f"size_kb: {size_kb:.1f}")
print(f"size_mb: {size_mb:.2f}")

View File

@@ -0,0 +1,40 @@
#!/usr/bin/env python3
"""Measure execution speed of a target function or command.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import statistics
import subprocess
import sys
import time
# --- CONFIGURE THESE ---
COMMAND = "python src/module.py" # Command to benchmark
RUNS = 5 # Number of runs
WARMUP = 1 # Warmup runs (not counted)
# --- END CONFIG ---
times = []
# Warmup
for _ in range(WARMUP):
subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
# Benchmark
for i in range(RUNS):
t0 = time.perf_counter()
result = subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
elapsed = (time.perf_counter() - t0) * 1000 # ms
if result.returncode != 0:
print(f"Run {i+1} failed (exit {result.returncode})", file=sys.stderr)
print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
sys.exit(1)
times.append(elapsed)
p50 = statistics.median(times)
p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 5 else max(times)
print(f"p50_ms: {p50:.2f}")
print(f"p95_ms: {p95:.2f}")
print(f"runs: {RUNS}")

View File

@@ -0,0 +1,39 @@
#!/usr/bin/env python3
"""Measure build/compile time.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import subprocess
import sys
import time
# --- CONFIGURE THESE ---
BUILD_CMD = "npm run build" # or: docker build -t test .
CLEAN_CMD = "" # optional: npm run clean (run before each build)
RUNS = 3 # Number of builds to average
# --- END CONFIG ---
times = []
for i in range(RUNS):
# Clean if configured
if CLEAN_CMD:
subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)
t0 = time.perf_counter()
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
elapsed = time.perf_counter() - t0
if result.returncode != 0:
print(f"Build {i+1} failed (exit {result.returncode})", file=sys.stderr)
print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
sys.exit(1)
times.append(elapsed)
import statistics
avg = statistics.mean(times)
median = statistics.median(times)
print(f"build_seconds: {median:.2f}")
print(f"build_avg: {avg:.2f}")
print(f"runs: {RUNS}")

View File

@@ -0,0 +1,72 @@
#!/usr/bin/env python3
"""LLM judge for content quality (headlines, titles, descriptions).
Uses the user's existing CLI tool (claude, codex, gemini) for evaluation.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import subprocess
import sys
from pathlib import Path
# --- CONFIGURE THESE ---
TARGET_FILE = "content/titles.md" # File being optimized
CLI_TOOL = "claude" # or: codex, gemini
# --- END CONFIG ---
# The judge prompt is FIXED — the agent cannot change how it's evaluated
JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly.
Criteria (each scored 1-10):
1. CURIOSITY GAP — Does this make you want to click? Is there an information gap
that can only be resolved by reading? Generic titles score 1-3. Specific,
intriguing titles score 7-10.
2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved
performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9.
3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out,
or recognition? Flat titles score 1-3. Emotionally charged score 7-10.
4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or
search results? Would they pause on this headline? Rate honestly.
5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally?
Keyword-stuffed = 3. Natural integration of search terms = 8-10.
Output EXACTLY this format (nothing else):
curiosity: <score>
specificity: <score>
emotional: <score>
scroll_stop: <score>
seo: <score>
ctr_score: <average of all 5 scores>
Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
content = Path(TARGET_FILE).read_text()
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
# Call the user's CLI tool
result = subprocess.run(
[CLI_TOOL, "-p", full_prompt],
capture_output=True, text=True, timeout=120
)
if result.returncode != 0:
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
sys.exit(1)
# Parse output — look for ctr_score line
output = result.stdout
for line in output.splitlines():
line = line.strip()
if line.startswith("ctr_score:"):
print(line)
elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")):
print(line)
# Verify ctr_score was found
if "ctr_score:" not in output:
print("Could not parse ctr_score from LLM output", file=sys.stderr)
print(f"Raw output: {output[:500]}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,88 @@
#!/usr/bin/env python3
"""LLM judge for marketing copy (social posts, ads, emails).
Uses the user's existing CLI tool for evaluation.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import subprocess
import sys
from pathlib import Path
# --- CONFIGURE THESE ---
TARGET_FILE = "posts.md" # Copy being optimized
CLI_TOOL = "claude" # or: codex, gemini
PLATFORM = "twitter" # twitter, linkedin, instagram, email, ad
# --- END CONFIG ---
JUDGE_PROMPTS = {
"twitter": """Score this Twitter/X post strictly:
1. HOOK (1-10) — Does the first line stop the scroll?
2. VALUE (1-10) — Does it provide insight, entertainment, or utility?
3. ENGAGEMENT (1-10) — Would people reply, retweet, or like?
4. BREVITY (1-10) — Is every word earning its place? No filler?
5. CTA (1-10) — Is there a clear next action (even implicit)?""",
"linkedin": """Score this LinkedIn post strictly:
1. HOOK (1-10) — Does the first line make you click "see more"?
2. STORYTELLING (1-10) — Is there a narrative arc or just statements?
3. CREDIBILITY (1-10) — Does it demonstrate expertise without bragging?
4. ENGAGEMENT (1-10) — Would professionals comment or share?
5. CTA (1-10) — Does it invite discussion or action?""",
"instagram": """Score this Instagram caption strictly:
1. HOOK (1-10) — Does the first line grab attention?
2. RELATABILITY (1-10) — Does the audience see themselves in this?
3. VISUAL MATCH (1-10) — Does the copy complement visual content?
4. HASHTAG STRATEGY (1-10) — Are hashtags relevant and not spammy?
5. CTA (1-10) — Does it encourage saves, shares, or comments?""",
"email": """Score this email subject + preview strictly:
1. OPEN INCENTIVE (1-10) — Would you open this in a crowded inbox?
2. SPECIFICITY (1-10) — Is it concrete or vague?
3. URGENCY (1-10) — Is there a reason to open now vs later?
4. PERSONALIZATION (1-10) — Does it feel written for someone, not everyone?
5. PREVIEW SYNC (1-10) — Does the preview text complement the subject?""",
"ad": """Score this ad copy strictly:
1. ATTENTION (1-10) — Does it stop someone scrolling past ads?
2. DESIRE (1-10) — Does it create want for the product/service?
3. PROOF (1-10) — Is there credibility (numbers, social proof)?
4. ACTION (1-10) — Is the CTA clear and compelling?
5. OBJECTION HANDLING (1-10) — Does it preempt "why not"?""",
}
platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])
JUDGE_PROMPT = f"""{platform_prompt}
Output EXACTLY this format:
criterion_1: <score>
criterion_2: <score>
criterion_3: <score>
criterion_4: <score>
criterion_5: <score>
engagement_score: <average of all 5>
Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""
content = Path(TARGET_FILE).read_text()
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"
result = subprocess.run(
[CLI_TOOL, "-p", full_prompt],
capture_output=True, text=True, timeout=120
)
if result.returncode != 0:
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
sys.exit(1)
output = result.stdout
for line in output.splitlines():
line = line.strip()
if line.startswith("engagement_score:") or line.startswith("criterion_"):
print(line)
if "engagement_score:" not in output:
print("Could not parse engagement_score from LLM output", file=sys.stderr)
print(f"Raw: {output[:500]}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,99 @@
#!/usr/bin/env python3
"""LLM judge for prompt/instruction quality.
Uses the user's existing CLI tool for evaluation.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import json
import subprocess
import sys
from pathlib import Path
# --- CONFIGURE THESE ---
TARGET_FILE = "prompt.md" # Prompt being optimized
TEST_CASES_FILE = "tests/cases.json" # Test cases: [{"input": "...", "expected": "..."}]
CLI_TOOL = "claude" # or: codex, gemini
# --- END CONFIG ---
JUDGE_PROMPT_TEMPLATE = """You are evaluating a system prompt's effectiveness.
SYSTEM PROMPT BEING TESTED:
{prompt}
TEST INPUT:
{input}
EXPECTED OUTPUT (reference):
{expected}
ACTUAL OUTPUT:
{actual}
Score the actual output on these criteria (each 1-10):
1. ACCURACY — Does it match the expected output's intent and facts?
2. COMPLETENESS — Does it cover all required elements?
3. CLARITY — Is it well-structured and easy to understand?
4. INSTRUCTION_FOLLOWING — Does it follow the system prompt's guidelines?
Output EXACTLY: quality_score: <average of all 4>
Nothing else."""
prompt = Path(TARGET_FILE).read_text()
test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
scores = []
for i, case in enumerate(test_cases):
# Generate output using the prompt
gen_prompt = f"{prompt}\n\n{case['input']}"
gen_result = subprocess.run(
[CLI_TOOL, "-p", gen_prompt],
capture_output=True, text=True, timeout=60
)
if gen_result.returncode != 0:
print(f"Generation failed for case {i+1}", file=sys.stderr)
scores.append(0)
continue
actual = gen_result.stdout.strip()
# Judge the output
judge_prompt = JUDGE_PROMPT_TEMPLATE.format(
prompt=prompt[:500],
input=case["input"],
expected=case.get("expected", "N/A"),
actual=actual[:500]
)
judge_result = subprocess.run(
[CLI_TOOL, "-p", judge_prompt],
capture_output=True, text=True, timeout=60
)
if judge_result.returncode != 0:
scores.append(0)
continue
# Parse score
for line in judge_result.stdout.splitlines():
if "quality_score:" in line:
try:
score = float(line.split(":")[-1].strip())
scores.append(score)
except ValueError:
scores.append(0)
break
else:
scores.append(0)
print(f" Case {i+1}/{len(test_cases)}: {scores[-1]:.1f}", file=sys.stderr)
if not scores:
print("No test cases evaluated", file=sys.stderr)
sys.exit(1)
avg = sum(scores) / len(scores)
quality = avg * 10 # Scale to 0-100
print(f"quality_score: {quality:.2f}")
print(f"cases_tested: {len(scores)}")
print(f"avg_per_case: {avg:.2f}")

View File

@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""Measure peak memory usage of a command.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import os
import platform
import subprocess
import sys
# --- CONFIGURE THESE ---
COMMAND = "python src/module.py" # Command to measure
# --- END CONFIG ---
system = platform.system()
if system == "Linux":
# Use /usr/bin/time for peak RSS
result = subprocess.run(
f"/usr/bin/time -v {COMMAND}",
shell=True, capture_output=True, text=True, timeout=300
)
output = result.stderr
for line in output.splitlines():
if "Maximum resident set size" in line:
kb = int(line.split(":")[-1].strip())
mb = kb / 1024
print(f"peak_mb: {mb:.1f}")
print(f"peak_kb: {kb}")
sys.exit(0)
print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
sys.exit(1)
elif system == "Darwin":
# macOS: use /usr/bin/time -l
result = subprocess.run(
f"/usr/bin/time -l {COMMAND}",
shell=True, capture_output=True, text=True, timeout=300
)
output = result.stderr
for line in output.splitlines():
if "maximum resident set size" in line.lower():
# macOS reports in bytes
val = int(line.strip().split()[0])
mb = val / (1024 * 1024)
print(f"peak_mb: {mb:.1f}")
sys.exit(0)
print("Could not parse memory from time output", file=sys.stderr)
sys.exit(1)
else:
print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,55 @@
#!/usr/bin/env python3
"""Measure test suite pass rate.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import re
import subprocess
import sys
# --- CONFIGURE THESE ---
TEST_CMD = "pytest tests/ --tb=no -q" # Test command
# --- END CONFIG ---
result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
output = result.stdout + "\n" + result.stderr
# Try to parse pytest output: "X passed, Y failed, Z errors"
passed = failed = errors = 0
# pytest short format: "5 passed, 2 failed in 1.23s"
match = re.search(r"(\d+) passed", output)
if match:
passed = int(match.group(1))
match = re.search(r"(\d+) failed", output)
if match:
failed = int(match.group(1))
match = re.search(r"(\d+) error", output)
if match:
errors = int(match.group(1))
total = passed + failed + errors
if total == 0:
# Try unittest format: "Ran X tests"
match = re.search(r"Ran (\d+) test", output)
if match:
total = int(match.group(1))
if result.returncode == 0:
passed = total
else:
# Count failures from output
fail_match = re.search(r"FAILED \(failures=(\d+)", output)
if fail_match:
failed = int(fail_match.group(1))
passed = total - failed
if total == 0:
print("Could not parse test results", file=sys.stderr)
print(f"Output: {output[:500]}", file=sys.stderr)
sys.exit(1)
rate = passed / total
print(f"pass_rate: {rate:.4f}")
print(f"passed: {passed}")
print(f"failed: {failed}")
print(f"total: {total}")

View File

@@ -1,175 +1,255 @@
# Experiment Domains Guide # Experiment Domains Guide
## Domain 1: ML Training (Karpathy-style) ## Domain: Engineering
**Best for:** LLM/neural net training optimization on a single GPU ### Code Speed Optimization
**Requirements:**
- NVIDIA GPU (H100 recommended, A100/RTX also work)
- CUDA + PyTorch
- `uv` package manager
- ~50GB disk (training data)
**Setup:**
```bash ```bash
# Clone autoresearch repo (the ML training environment) python scripts/setup_experiment.py \
git clone https://github.com/karpathy/autoresearch my-ml-research --domain engineering \
cd my-ml-research --name api-speed \
uv sync --target src/api/search.py \
uv run prepare.py # one-time data download + tokenizer (~2 min) --eval "python -m pytest tests/bench_search.py --tb=no -q" \
--metric p50_ms \
# Initialize autoresearch skill --direction lower \
cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts --evaluator benchmark_speed
# Configure
python scripts/setup_experiment.py --domain ml --tag mar13
``` ```
**Metric:** `val_bpb` — validation bits per byte. Lower = better model. **What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O.
**Cost:** Free — just runs benchmarks.
**Speed:** ~5 min/experiment, ~12/hour, ~100 overnight.
**What the agent can change in `train.py`:** ### Bundle Size Reduction
- Model depth, width, attention heads
- Learning rate, scheduler, warmup
- Optimizer (Muon, AdamW, variants)
- Batch size, gradient accumulation
- Architecture (attention patterns, FFN type)
**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):** ```bash
Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256. python scripts/setup_experiment.py \
--domain engineering \
--name bundle-size \
--target webpack.config.js \
--eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
--metric size_bytes \
--direction lower \
--evaluator benchmark_size
```
Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`.
### Test Pass Rate
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name fix-flaky-tests \
--target src/utils/parser.py \
--eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
--metric pass_rate \
--direction higher \
--evaluator test_pass_rate
```
### Docker Build Speed
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name docker-build \
--target Dockerfile \
--eval "python .autoresearch/engineering/docker-build/evaluate.py" \
--metric build_seconds \
--direction lower \
--evaluator build_speed
```
### Memory Optimization
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name memory-usage \
--target src/processor.py \
--eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
--metric peak_mb \
--direction lower \
--evaluator memory_usage
```
### ML Training (Karpathy-style)
Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch).
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name ml-training \
--target train.py \
--eval "uv run train.py" \
--metric val_bpb \
--direction lower \
--time-budget 5
```
--- ---
## Domain 2: Prompt Engineering ## Domain: Marketing
**Best for:** Optimizing system prompts for quality/accuracy/tone ### Medium Article Headlines
**Requirements:**
- LLM API access (OpenAI, Anthropic, etc.)
- Test cases with expected outputs
- An LLM judge for scoring (can be same model)
**Setup:**
```bash ```bash
mkdir my-prompt-research && cd my-prompt-research python scripts/setup_experiment.py \
git init --domain marketing \
--name medium-ctr \
# Create prompt.md (the thing being optimized) --target content/titles.md \
echo "You are a helpful assistant." > prompt.md --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
--metric ctr_score \
# Create evaluate.py (fixed — never modify) --direction higher \
cat > evaluate.py << 'EOF' --evaluator llm_judge_content
#!/usr/bin/env python3
# Fixed evaluation harness — DO NOT MODIFY
import json, sys
from pathlib import Path
PROMPT = Path("prompt.md").read_text()
# Load test cases
TEST_CASES = json.loads(Path("tests/cases.json").read_text())
# Run prompt against test cases, score with LLM judge
# ... (customize for your LLM + scoring logic)
total = sum(score_case(PROMPT, case) for case in TEST_CASES)
score = total / len(TEST_CASES) * 100
print(f"eval_score: {score:.2f}")
EOF
# Initialize
python scripts/setup_experiment.py --domain prompt --tag mar13
``` ```
**Metric:** `eval_score` (0-100). Higher = better prompt. Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`.
**What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers.
**Cost:** Uses your CLI subscription (Claude Max = unlimited).
**Speed:** ~2 min/experiment, ~30/hour.
### Social Media Copy
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name twitter-engagement \
--target social/tweets.md \
--eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
--metric engagement_score \
--direction higher \
--evaluator llm_judge_copy
```
Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram).
### Email Subject Lines
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name email-open-rate \
--target emails/subjects.md \
--eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
--metric engagement_score \
--direction higher \
--evaluator llm_judge_copy
```
Edit `evaluate.py`: set `PLATFORM = "email"`.
### Ad Copy
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name ad-copy-q2 \
--target ads/google-search.md \
--eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
--metric engagement_score \
--direction higher \
--evaluator llm_judge_copy
```
Edit `evaluate.py`: set `PLATFORM = "ad"`.
--- ---
## Domain 3: Code Performance ## Domain: Content
**Best for:** Optimizing a specific hot module for speed ### Article Structure & Readability
**Requirements:**
- A Python module with measurable performance
- Existing tests (correctness must not regress)
- A benchmark harness
**Setup:**
```bash ```bash
cd my-project python scripts/setup_experiment.py \
--domain content \
# Create benchmark.py (fixed — never modify) --name article-structure \
cat > benchmark.py << 'EOF' --target drafts/my-article.md \
#!/usr/bin/env python3 --eval "python .autoresearch/content/article-structure/evaluate.py" \
# Fixed benchmark — DO NOT MODIFY --metric ctr_score \
import time, statistics --direction higher \
from src.module import your_function --evaluator llm_judge_content
from tests.test_module import run_tests
# Correctness check first
if not run_tests():
print("TESTS FAILED")
sys.exit(1)
# Benchmark
data = generate_test_data(n=10000)
times = []
for _ in range(10):
t0 = time.perf_counter()
your_function(data)
times.append((time.perf_counter() - t0) * 1000)
p50 = statistics.median(times)
print(f"p50_ms: {p50:.2f}")
print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
EOF
python scripts/setup_experiment.py --domain code \
--target src/module.py \
--tag mar13
``` ```
**Metric:** `p50_ms` — median latency. Lower = faster. ### SEO Descriptions
```bash
python scripts/setup_experiment.py \
--domain content \
--name seo-meta \
--target seo/descriptions.md \
--eval "python .autoresearch/content/seo-meta/evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content
```
--- ---
## Domain 4: Agent Skill Optimization ## Domain: Prompts
**Best for:** Improving the quality of claude-skills SKILL.md files ### System Prompt Optimization
**Requirements:**
- A SKILL.md to optimize
- A task evaluation suite (15-20 standardized tasks)
- An LLM judge for scoring
**Setup:**
```bash ```bash
# Create a new skill research project python scripts/setup_experiment.py \
mkdir skill-research-{skill-name} && cd skill-research-{skill-name} --domain prompts \
git init --name support-bot \
--target prompts/support-system.md \
# Copy the skill to optimize --eval "python .autoresearch/prompts/support-bot/evaluate.py" \
cp ~/.claude/skills/{skill-name}/SKILL.md . --metric quality_score \
--direction higher \
# Create evaluate.py --evaluator llm_judge_prompt
cat > scripts/skill_evaluator.py << 'EOF'
#!/usr/bin/env python3
# Fixed evaluator — DO NOT MODIFY
# Runs SKILL.md against 15 standardized tasks using LLM judge
# Outputs: pass_rate: 0.80 (etc.)
EOF
python scripts/setup_experiment.py --domain skill --tag mar13
``` ```
**Metric:** `pass_rate` (0-1). Higher = better skill. Requires `tests/cases.json` with test inputs and expected outputs:
```json
[
{
"input": "I can't log in to my account",
"expected": "Ask for email, check account status, offer password reset"
},
{
"input": "How do I cancel my subscription?",
"expected": "Empathetic response, explain cancellation steps, offer retention"
}
]
```
### Agent Skill Optimization
```bash
python scripts/setup_experiment.py \
--domain prompts \
--name skill-improvement \
--target SKILL.md \
--eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
--metric quality_score \
--direction higher \
--evaluator llm_judge_prompt
```
--- ---
## Choosing Your Domain ## Choosing Your Domain
| Question | Recommendation | | I want to... | Domain | Evaluator | Cost |
|----------|---------------| |-------------|--------|-----------|------|
| Do I have a GPU and want to improve an LLM? | ML Training | | Speed up my code | engineering | benchmark_speed | Free |
| Do I want to improve a prompt/system message? | Prompt Engineering | | Shrink my bundle | engineering | benchmark_size | Free |
| Do I have slow Python code I want to speed up? | Code Performance | | Fix flaky tests | engineering | test_pass_rate | Free |
| Do I want to improve one of my claude-skills? | Skill Optimization | | Speed up Docker builds | engineering | build_speed | Free |
| Reduce memory usage | engineering | memory_usage | Free |
| Train ML models | engineering | (custom) | Free + GPU |
| Write better headlines | marketing | llm_judge_content | Subscription |
| Improve social posts | marketing | llm_judge_copy | Subscription |
| Optimize email subjects | marketing | llm_judge_copy | Subscription |
| Improve ad copy | marketing | llm_judge_copy | Subscription |
| Optimize article structure | content | llm_judge_content | Subscription |
| Improve SEO descriptions | content | llm_judge_content | Subscription |
| Optimize system prompts | prompts | llm_judge_prompt | Subscription |
| Improve agent skills | prompts | llm_judge_prompt | Subscription |
**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results. **First time?** Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.

View File

@@ -1,125 +1,389 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
autoresearch-agent: Results Logger autoresearch-agent: Results Viewer
View and analyze experiment results from results.tsv. View experiment results in multiple formats: terminal, CSV, Markdown.
Supports single experiment, domain, or cross-experiment dashboard.
Usage: Usage:
python scripts/log_results.py --summary # Print progress table python scripts/log_results.py --experiment engineering/api-speed
python scripts/log_results.py --best # Show best result python scripts/log_results.py --domain engineering
python scripts/log_results.py --history # Full experiment history python scripts/log_results.py --dashboard
python scripts/log_results.py --record commit val status desc # Add entry manually python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
""" """
import argparse import argparse
import csv
import io
import sys import sys
from pathlib import Path from pathlib import Path
def load_results(path): def find_autoresearch_root():
tsv = Path(path) / "results.tsv" """Find .autoresearch/ in project or user home."""
project_root = Path(".").resolve() / ".autoresearch"
if project_root.exists():
return project_root
user_root = Path.home() / ".autoresearch"
if user_root.exists():
return user_root
return None
def load_config(experiment_dir):
"""Load config.cfg."""
cfg_file = experiment_dir / "config.cfg"
config = {}
if cfg_file.exists():
for line in cfg_file.read_text().splitlines():
if ":" in line:
k, v = line.split(":", 1)
config[k.strip()] = v.strip()
return config
def load_results(experiment_dir):
"""Load results.tsv into list of dicts."""
tsv = experiment_dir / "results.tsv"
if not tsv.exists(): if not tsv.exists():
return [] return []
lines = tsv.read_text().splitlines()[1:] # skip header
results = [] results = []
for line in lines: for line in tsv.read_text().splitlines()[1:]:
parts = line.split("\t") parts = line.split("\t")
if len(parts) >= 4: if len(parts) >= 4:
try: try:
metric_val = float(parts[1]) if parts[1] != "N/A" else None metric = float(parts[1]) if parts[1] != "N/A" else None
except ValueError: except ValueError:
metric_val = None metric = None
results.append({ results.append({
"commit": parts[0], "commit": parts[0],
"metric": metric_val, "metric": metric,
"status": parts[2], "status": parts[2],
"description": parts[3] "description": parts[3],
}) })
return results return results
def print_summary(results, metric_name="metric", direction="lower"): def compute_stats(results, direction):
if not results: """Compute statistics from results."""
print("No experiments logged yet.")
return
keeps = [r for r in results if r["status"] == "keep"] keeps = [r for r in results if r["status"] == "keep"]
discards = [r for r in results if r["status"] == "discard"] discards = [r for r in results if r["status"] == "discard"]
crashes = [r for r in results if r["status"] == "crash"] crashes = [r for r in results if r["status"] == "crash"]
print(f"\n{''*60}") valid_keeps = [r for r in keeps if r["metric"] is not None]
print(f" autoresearch-agent — Results Summary") baseline = valid_keeps[0]["metric"] if valid_keeps else None
print(f"{''*60}") if valid_keeps:
print(f" Total experiments: {len(results)}") best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps)
print(f" ✅ Keep: {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)") else:
print(f" ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)") best = None
print(f" 💥 Crash: {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
if keeps: pct_change = None
valid = [r for r in keeps if r["metric"] is not None] if baseline and best and baseline != 0:
if valid: if direction == "lower":
baseline = valid[0]["metric"] pct_change = (baseline - best) / baseline * 100
best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid) else:
best_run = next(r for r in valid if r["metric"] == best) pct_change = (best - baseline) / baseline * 100
improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
print(f"\n {metric_name}:") return {
print(f" Baseline: {baseline:.6f}") "total": len(results),
print(f" Best: {best:.6f} (commit: {best_run['commit']})") "keeps": len(keeps),
print(f" Change: {improvement:+.2f}%") "discards": len(discards),
"crashes": len(crashes),
print(f"{''*60}\n") "baseline": baseline,
"best": best,
"pct_change": pct_change,
}
def print_history(results): # --- Terminal Output ---
def print_experiment(experiment_dir, experiment_path):
"""Print single experiment results to terminal."""
config = load_config(experiment_dir)
results = load_results(experiment_dir)
direction = config.get("metric_direction", "lower")
metric_name = config.get("metric", "metric")
if not results: if not results:
print("No experiments logged yet.") print(f"No results for {experiment_path}")
return return
print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION") stats = compute_stats(results, direction)
print("" * 60)
for r in results:
metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash "
status_icon = {"keep": "", "discard": "", "crash": "💥"}.get(r["status"], "?")
print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
print(f"\n{'' * 65}")
print(f" {experiment_path}")
print(f" Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})")
print(f"{'' * 65}")
print(f" Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}")
if stats["baseline"] is not None and stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
print(f" Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}")
print(f"\n {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION")
print(f" {'' * 60}")
for r in results:
m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A "
icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?")
print(f" {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}")
print()
def print_dashboard(root):
"""Print cross-experiment dashboard."""
experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
config = load_config(exp_dir)
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
# Determine status
status = "idle"
if stats["total"] > 0:
tsv = exp_dir / "results.tsv"
if tsv.exists():
import time
age_hours = (time.time() - tsv.stat().st_mtime) / 3600
status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"
best_str = f"{stats['best']:.4f}" if stats["best"] is not None else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
experiments.append({
"domain": domain_dir.name,
"name": exp_dir.name,
"runs": stats["total"],
"kept": stats["keeps"],
"best": best_str,
"change": pct_str,
"status": status,
"metric": config.get("metric", "?"),
})
if not experiments:
print("No experiments found.")
return experiments
print(f"\n{'' * 90}")
print(f" autoresearch — Dashboard")
print(f"{'' * 90}")
print(f" {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}")
print(f" {'' * 85}")
for e in experiments:
print(f" {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}")
print()
return experiments
# --- CSV Export ---
def export_experiment_csv(experiment_dir, experiment_path):
"""Export single experiment as CSV string."""
config = load_config(experiment_dir)
results = load_results(experiment_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
buf = io.StringIO()
writer = csv.writer(buf)
# Header with metadata
writer.writerow(["# Experiment", experiment_path])
writer.writerow(["# Target", config.get("target", "")])
writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"])
if stats["baseline"] is not None:
writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
if stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
writer.writerow(["# Total", stats["total"]])
writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
writer.writerow([])
writer.writerow(["Commit", "Metric", "Status", "Description"])
for r in results:
m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A"
writer.writerow([r["commit"], m, r["status"], r["description"]])
return buf.getvalue()
def export_dashboard_csv(root):
"""Export dashboard as CSV string."""
experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
config = load_config(exp_dir)
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best_str = f"{stats['best']:.6f}" if stats["best"] else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
experiments.append([
domain_dir.name, exp_dir.name, config.get("metric", ""),
stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
best_str, pct_str
])
buf = io.StringIO()
writer = csv.writer(buf)
writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"])
for e in experiments:
writer.writerow(e)
return buf.getvalue()
# --- Markdown Export ---
def export_experiment_markdown(experiment_dir, experiment_path):
"""Export single experiment as Markdown string."""
config = load_config(experiment_dir)
results = load_results(experiment_dir)
direction = config.get("metric_direction", "lower")
metric_name = config.get("metric", "metric")
stats = compute_stats(results, direction)
lines = []
lines.append(f"# Autoresearch: {experiment_path}\n")
lines.append(f"**Target:** `{config.get('target', '?')}` ")
lines.append(f"**Metric:** `{metric_name}` ({direction} is better) ")
lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")
if stats["baseline"] is not None and stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")
lines.append(f"| Commit | Metric | Status | Description |")
lines.append(f"|--------|--------|--------|-------------|")
for r in results:
m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A"
lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |")
lines.append("")
return "\n".join(lines)
def export_dashboard_markdown(root):
"""Export dashboard as Markdown string."""
lines = []
lines.append("# Autoresearch Dashboard\n")
lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |")
lines.append("|--------|-----------|--------|------|------|------|--------|--------|")
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
config = load_config(exp_dir)
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best = f"`{stats['best']:.4f}`" if stats["best"] else ""
pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
import time
tsv = exp_dir / "results.tsv"
status = "idle"
if tsv.exists() and stats["total"] > 0:
age_h = (time.time() - tsv.stat().st_mtime) / 3600
status = "active" if age_h < 1 else "paused" if age_h < 24 else "done"
lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |")
lines.append("")
return "\n".join(lines)
# --- Main ---
def main(): def main():
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser(description="autoresearch-agent results viewer")
parser.add_argument("--summary", action="store_true") parser.add_argument("--experiment", help="Show one experiment: domain/name")
parser.add_argument("--best", action="store_true") parser.add_argument("--domain", help="Show all experiments in a domain")
parser.add_argument("--history", action="store_true") parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard")
parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC")) parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal",
parser.add_argument("--path", default=".") help="Output format (default: terminal)")
parser.add_argument("--metric", default="metric") parser.add_argument("--output", "-o", help="Write to file instead of stdout")
parser.add_argument("--direction", default="lower", choices=["lower", "higher"]) parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)")
args = parser.parse_args() args = parser.parse_args()
path = Path(args.path).resolve() root = find_autoresearch_root()
if root is None:
print("No .autoresearch/ found. Run setup_experiment.py first.")
sys.exit(1)
if args.record: output_text = None
commit, metric, status, desc = args.record
tsv = path / "results.tsv"
if not tsv.exists():
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
with open(tsv, "a") as f:
f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
print(f"✓ Logged: {commit} {metric} {status}")
return
results = load_results(path) # Single experiment
if args.experiment:
experiment_dir = root / args.experiment
if not experiment_dir.exists():
print(f"Experiment not found: {args.experiment}")
sys.exit(1)
if args.history: if args.format == "csv":
print_history(results) output_text = export_experiment_csv(experiment_dir, args.experiment)
elif args.best: elif args.format == "markdown":
keeps = [r for r in results if r["status"] == "keep" and r["metric"]] output_text = export_experiment_markdown(experiment_dir, args.experiment)
if not keeps: else:
print("No successful experiments yet.") print_experiment(experiment_dir, args.experiment)
return return
best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}") # Domain
elif args.domain:
domain_dir = root / args.domain
if not domain_dir.exists():
print(f"Domain not found: {args.domain}")
sys.exit(1)
for exp_dir in sorted(domain_dir.iterdir()):
if exp_dir.is_dir() and (exp_dir / "config.cfg").exists():
if args.format == "terminal":
print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}")
# For CSV/MD, fall through to dashboard with domain filter
if args.format != "terminal":
# Use dashboard export filtered to domain
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
else:
return
# Dashboard
elif args.dashboard or args.all:
if args.format == "csv":
output_text = export_dashboard_csv(root)
elif args.format == "markdown":
output_text = export_dashboard_markdown(root)
else:
print_dashboard(root)
return
else: else:
print_summary(results, args.metric, args.direction) # Default: dashboard
if args.format == "terminal":
print_dashboard(root)
return
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
# Write output
if output_text:
if args.output:
Path(args.output).write_text(output_text)
print(f"Written to {args.output}")
else:
print(output_text)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -2,17 +2,15 @@
""" """
autoresearch-agent: Experiment Runner autoresearch-agent: Experiment Runner
Executes the autonomous experiment loop: Executes the autonomous experiment loop for a specific experiment.
- Reads .autoresearch.cfg for project config Reads config from .autoresearch/{domain}/{name}/config.cfg.
- Runs the target evaluation
- Keeps improvements (git commit) or discards failures (git reset)
- Logs everything to results.tsv
- Loops indefinitely until interrupted
Usage: Usage:
python scripts/run_experiment.py --loop # Run forever python scripts/run_experiment.py --experiment engineering/api-speed --loop
python scripts/run_experiment.py --single # Run one experiment python scripts/run_experiment.py --experiment engineering/api-speed --single
python scripts/run_experiment.py --dry-run # Show what would happen python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
python scripts/run_experiment.py --resume --loop
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
""" """
import argparse import argparse
@@ -25,11 +23,22 @@ from datetime import datetime
from pathlib import Path from pathlib import Path
def load_config(path): def find_autoresearch_root():
"""Load .autoresearch.cfg""" """Find .autoresearch/ in project or user home."""
cfg_file = Path(path) / ".autoresearch.cfg" project_root = Path(".").resolve() / ".autoresearch"
if project_root.exists():
return project_root
user_root = Path.home() / ".autoresearch"
if user_root.exists():
return user_root
return None
def load_config(experiment_dir):
"""Load config.cfg from experiment directory."""
cfg_file = experiment_dir / "config.cfg"
if not cfg_file.exists(): if not cfg_file.exists():
print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.") print(f" Error: no config.cfg in {experiment_dir}")
sys.exit(1) sys.exit(1)
config = {} config = {}
for line in cfg_file.read_text().splitlines(): for line in cfg_file.read_text().splitlines():
@@ -49,239 +58,293 @@ def run_cmd(cmd, cwd=None, timeout=None):
def get_current_commit(path): def get_current_commit(path):
"""Get short hash of current HEAD."""
_, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path) _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
return commit return commit
def get_current_metric(path, metric_grep): def get_best_metric(experiment_dir, direction):
"""Read the last recorded metric from results.tsv.""" """Read the best metric from results.tsv."""
tsv = Path(path) / "results.tsv" tsv = experiment_dir / "results.tsv"
if not tsv.exists(): if not tsv.exists():
return None return None
lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l] lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l]
if not lines: if not lines:
return None return None
last = lines[-1].split("\t") metrics = []
try: for line in lines:
return float(last[1]) parts = line.split("\t")
except (ValueError, IndexError): try:
if parts[1] != "N/A":
metrics.append(float(parts[1]))
except (ValueError, IndexError):
continue
if not metrics:
return None return None
return min(metrics) if direction == "lower" else max(metrics)
def run_evaluation(path, evaluate_cmd, time_budget_minutes): def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
"""Run evaluation with time limit.""" """Run evaluation with time limit. Output goes to log_file."""
hard_limit = time_budget_minutes * 60 * 2.5 # 2.5x as hard timeout hard_limit = time_budget_minutes * 60 * 2.5
t0 = time.time() t0 = time.time()
try: try:
code, _, _ = run_cmd( code, _, _ = run_cmd(
f"{evaluate_cmd} > run.log 2>&1", f"{eval_cmd} > {log_file} 2>&1",
cwd=path, cwd=str(project_root),
timeout=hard_limit timeout=hard_limit
) )
elapsed = time.time() - t0 elapsed = time.time() - t0
return code, elapsed return code, elapsed
except subprocess.TimeoutExpired: except subprocess.TimeoutExpired:
elapsed = time.time() - t0 elapsed = time.time() - t0
return -1, elapsed # -1 = timeout return -1, elapsed
def extract_metric(path, metric_grep): def extract_metric(log_file, metric_grep):
"""Extract metric value from run.log.""" """Extract metric value from log file."""
code, out, _ = run_cmd( log_path = Path(log_file)
f"grep '{metric_grep}' run.log | tail -1", if not log_path.exists():
cwd=path
)
if not out:
return None
try:
return float(out.split(":")[-1].strip())
except ValueError:
return None return None
for line in reversed(log_path.read_text().splitlines()):
stripped = line.strip()
if stripped.startswith(metric_grep.lstrip("^")):
try:
return float(stripped.split(":")[-1].strip())
except ValueError:
continue
return None
def is_improvement(new_val, old_val, direction): def is_improvement(new_val, old_val, direction):
"""Check if new result is better than old.""" """Check if new result is better than old."""
if old_val is None: if old_val is None:
return True # First run always "improves" return True
if direction == "lower": if direction == "lower":
return new_val < old_val return new_val < old_val
else: return new_val > old_val
return new_val > old_val
def log_result(path, commit, metric_val, status, description): def log_result(experiment_dir, commit, metric_val, status, description):
"""Append result to results.tsv.""" """Append result to results.tsv."""
tsv = Path(path) / "results.tsv" tsv = experiment_dir / "results.tsv"
metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A" metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
with open(tsv, "a") as f: with open(tsv, "a") as f:
f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n") f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
def get_experiment_count(path): def get_experiment_count(experiment_dir):
"""Count experiments run so far.""" """Count experiments run so far."""
tsv = Path(path) / "results.tsv" tsv = experiment_dir / "results.tsv"
if not tsv.exists(): if not tsv.exists():
return 0 return 0
lines = tsv.read_text().splitlines() return max(0, len(tsv.read_text().splitlines()) - 1)
return max(0, len(lines) - 1) # subtract header
def run_single_experiment(path, config, exp_num, dry_run=False): def get_last_active(root):
"""Find the most recently modified experiment."""
latest = None
latest_time = 0
for domain_dir in root.iterdir():
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in domain_dir.iterdir():
if not exp_dir.is_dir():
continue
cfg = exp_dir / "config.cfg"
if cfg.exists() and cfg.stat().st_mtime > latest_time:
latest_time = cfg.stat().st_mtime
latest = f"{domain_dir.name}/{exp_dir.name}"
return latest
def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
"""Run one experiment iteration.""" """Run one experiment iteration."""
direction = config.get("metric_direction", "lower") direction = config.get("metric_direction", "lower")
metric_grep = config.get("metric_grep", "^metric:") metric_grep = config.get("metric_grep", "^metric:")
evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py") eval_cmd = config.get("evaluate_cmd", "python evaluate.py")
time_budget = int(config.get("time_budget_minutes", 5)) time_budget = int(config.get("time_budget_minutes", 5))
metric_name = config.get("metric", "metric") metric_name = config.get("metric", "metric")
log_file = str(experiment_dir / "run.log")
best_so_far = get_current_metric(path, metric_grep) best = get_best_metric(experiment_dir, direction)
ts = datetime.now().strftime("%H:%M:%S") ts = datetime.now().strftime("%H:%M:%S")
print(f"\n[{ts}] Experiment #{exp_num}") print(f"\n[{ts}] Experiment #{exp_num}")
print(f" Best {metric_name} so far: {best_so_far}") print(f" Best {metric_name}: {best}")
if dry_run: if dry_run:
print(" [DRY RUN] Would run evaluation and check metric") print(" [DRY RUN] Would run evaluation and check metric")
return "dry_run" return "dry_run"
# Save pre-experiment state for rollback # Save state for rollback
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path) code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
if code != 0: if code != 0:
print(" ✗ Can't get git state. Is this a git repo with commits?") print(" Error: can't get git state")
return "error" return "error"
# Run evaluation # Run evaluation
print(f" Running: {evaluate_cmd} (budget: {time_budget} min)") print(f" Running: {eval_cmd} (budget: {time_budget}m)")
ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget) ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file)
# Handle timeout commit = get_current_commit(str(project_root))
# Timeout
if ret_code == -1: if ret_code == -1:
print(f" TIMEOUT after {elapsed:.0f}s — discarding") print(f" TIMEOUT after {elapsed:.0f}s — discarding")
run_cmd("git checkout -- .", cwd=path) # revert uncommitted changes run_cmd("git checkout -- .", cwd=str(project_root))
# Commit was already made by the agent before evaluation run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
run_cmd(f"git reset --hard {pre_commit}", cwd=path) log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
return "crash" return "crash"
# Handle non-zero exit # Crash
if ret_code != 0: if ret_code != 0:
# Check if it crashed _, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path) print(f" CRASH (exit {ret_code}) after {elapsed:.0f}s")
print(f" ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
print(f" Last output: {tail[:200]}") print(f" Last output: {tail[:200]}")
run_cmd(f"git reset --hard {pre_commit}", cwd=path) run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
curr_commit = get_current_commit(path) log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
return "crash" return "crash"
# Extract metric # Extract metric
metric_val = extract_metric(path, metric_grep) metric_val = extract_metric(log_file, metric_grep)
if metric_val is None: if metric_val is None:
print(f" Could not parse metric from run.log") print(f" Could not parse {metric_name} from run.log")
run_cmd(f"git reset --hard {pre_commit}", cwd=path) run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
curr_commit = get_current_commit(path) log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
log_result(path, curr_commit, None, "crash", "metric_parse_failed")
return "crash" return "crash"
curr_commit = get_current_commit(path)
delta = "" delta = ""
if best_so_far is not None: if best is not None:
diff = metric_val - best_so_far diff = metric_val - best
delta = f" (Δ{diff:+.4f})" delta = f" (delta {diff:+.4f})"
print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s") print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
# Keep or discard # Keep or discard
if is_improvement(metric_val, best_so_far, direction): if is_improvement(metric_val, best, direction):
print(f" KEEP — improvement confirmed") print(f" KEEP — improvement")
log_result(path, curr_commit, metric_val, "keep", log_result(experiment_dir, commit, metric_val, "keep",
f"improvement_{metric_name}_{metric_val:.4f}") f"improved_{metric_name}_{metric_val:.4f}")
return "keep" return "keep"
else: else:
print(f" DISCARD — no improvement") print(f" DISCARD — no improvement")
run_cmd(f"git reset --hard {pre_commit}", cwd=path) run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
curr_commit = get_current_commit(path) best_str = f"{best:.4f}" if best else "?"
log_result(path, curr_commit, metric_val, "discard", log_result(experiment_dir, commit, metric_val, "discard",
f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}") f"no_improvement_{metric_val:.4f}_vs_{best_str}")
return "discard" return "discard"
def print_summary(path): def print_summary(experiment_dir, config):
"""Print experiment summary.""" """Print session summary."""
tsv = Path(path) / "results.tsv" tsv = experiment_dir / "results.tsv"
if not tsv.exists(): if not tsv.exists():
return return
lines = tsv.read_text().splitlines()[1:] # skip header lines = tsv.read_text().splitlines()[1:]
if not lines: if not lines:
return return
keeps = [l for l in lines if "\tkeep\t" in l] keeps = [l for l in lines if "\tkeep\t" in l]
discards = [l for l in lines if "\tdiscard\t" in l] discards = [l for l in lines if "\tdiscard\t" in l]
crashes = [l for l in lines if "\tcrash\t" in l] crashes = [l for l in lines if "\tcrash\t" in l]
metric_name = config.get("metric", "metric")
direction = config.get("metric_direction", "lower")
print(f"\n{'='*50}") print(f"\n{'=' * 55}")
print(f" Session Summary") print(f" autoresearch — Session Summary")
print(f" Experiments: {len(lines)} total") print(f" Experiments: {len(lines)} total")
print(f" Keep: {len(keeps)} | Discard: {len(discards)} | 💥 Crash: {len(crashes)}") print(f" Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")
if keeps: if keeps:
try: try:
first_metric = float(keeps[0].split("\t")[1]) valid = []
last_metric = float(keeps[-1].split("\t")[1]) for l in keeps:
direction = "" if last_metric < first_metric else "" parts = l.split("\t")
print(f" Best progress: {first_metric:.6f}{last_metric:.6f} {direction}") if parts[1] != "N/A":
valid.append(float(parts[1]))
if len(valid) >= 2:
first, last = valid[0], valid[-1]
best = min(valid) if direction == "lower" else max(valid)
pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
print(f" {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
except (ValueError, IndexError): except (ValueError, IndexError):
pass pass
print(f"{'='*50}\n") print(f"{'=' * 55}\n")
def main(): def main():
parser = argparse.ArgumentParser(description="autoresearch-agent runner") parser = argparse.ArgumentParser(description="autoresearch-agent runner")
parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
parser.add_argument("--loop", action="store_true", help="Run forever") parser.add_argument("--loop", action="store_true", help="Run forever")
parser.add_argument("--single", action="store_true", help="Run one experiment") parser.add_argument("--single", action="store_true", help="Run one experiment")
parser.add_argument("--dry-run", action="store_true", help="Dry run only") parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
parser.add_argument("--path", default=".", help="Project root") parser.add_argument("--path", default=".", help="Project root")
parser.add_argument("--max-experiments", type=int, default=0,
help="Max experiments (0 = unlimited)")
args = parser.parse_args() args = parser.parse_args()
path = Path(args.path).resolve() project_root = Path(args.path).resolve()
config = load_config(path) root = find_autoresearch_root()
print(f"\n🔬 autoresearch-agent") if root is None:
print(f" Project: {path}") print("No .autoresearch/ found. Run setup_experiment.py first.")
print(f" Target: {config.get('target', '?')}") sys.exit(1)
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
print(f" Mode: {'loop' if args.loop else 'single'}")
if args.single: # Resolve experiment
exp_num = get_experiment_count(path) + 1 experiment_path = args.experiment
run_single_experiment(path, config, exp_num, args.dry_run) if args.resume:
experiment_path = get_last_active(root)
if not experiment_path:
print("No experiments found to resume.")
sys.exit(1)
print(f"Resuming: {experiment_path}")
if not experiment_path:
print("Specify --experiment domain/name or --resume")
sys.exit(1)
experiment_dir = root / experiment_path
if not experiment_dir.exists():
print(f"Experiment not found: {experiment_dir}")
print("Run: python scripts/setup_experiment.py --list")
sys.exit(1)
config = load_config(experiment_dir)
domain, name = experiment_path.split("/", 1)
print(f"\n autoresearch-agent")
print(f" Experiment: {experiment_path}")
print(f" Target: {config.get('target', '?')}")
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
print(f" Mode: {'loop' if args.loop else 'single'}")
if args.single or args.dry_run:
exp_num = get_experiment_count(experiment_dir) + 1
run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
return return
if not args.loop and not args.dry_run: if not args.loop:
print("\nSpecify --loop (forever) or --single (one experiment)") print("\nSpecify --loop (forever) or --single (one experiment)")
sys.exit(1) sys.exit(1)
# Setup graceful shutdown # Graceful shutdown
def handle_interrupt(sig, frame): def handle_interrupt(sig, frame):
print_summary(path) print_summary(experiment_dir, config)
print("\nStopped by user.") print("\nStopped by user.")
sys.exit(0) sys.exit(0)
signal.signal(signal.SIGINT, handle_interrupt) signal.signal(signal.SIGINT, handle_interrupt)
signal.signal(signal.SIGTERM, handle_interrupt) signal.signal(signal.SIGTERM, handle_interrupt)
# Main loop
consecutive_crashes = 0 consecutive_crashes = 0
exp_num = get_experiment_count(path) + 1 exp_num = get_experiment_count(experiment_dir) + 1
print(f"\nStarting loop. Ctrl+C to stop and print summary.\n") print(f"\nStarting loop. Ctrl+C to stop.\n")
while True: while True:
result = run_single_experiment(path, config, exp_num, args.dry_run) result = run_single(project_root, experiment_dir, config, exp_num, False)
exp_num += 1 exp_num += 1
if result == "crash": if result == "crash":
@@ -289,21 +352,16 @@ def main():
else: else:
consecutive_crashes = 0 consecutive_crashes = 0
# Bail if 5 consecutive crashes
if consecutive_crashes >= 5: if consecutive_crashes >= 5:
print("\n 5 consecutive crashes. Pausing for investigation.") print("\n 5 consecutive crashes. Pausing.")
print(" Check run.log for the last error.") print(" Check .autoresearch/{}/run.log".format(experiment_path))
break break
# Check max experiments if 0 < args.max_experiments < exp_num:
if args.max_experiments > 0 and exp_num > args.max_experiments: print(f"\n Reached max experiments ({args.max_experiments})")
print(f"\n✓ Reached max experiments ({args.max_experiments})")
break break
if args.single: print_summary(experiment_dir, config)
break
print_summary(path)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -1,65 +1,52 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
autoresearch-agent: Setup Wizard autoresearch-agent: Setup Experiment
Initializes a new research run: Initialize a new experiment with domain, target, evaluator, and git branch.
1. Validates the project structure Creates the .autoresearch/{domain}/{name}/ directory structure.
2. Creates a git branch
3. Runs the baseline experiment
4. Initializes results.tsv
Usage: Usage:
python scripts/setup_experiment.py [--config experiment.yaml] python scripts/setup_experiment.py --domain engineering --name api-speed \
python scripts/setup_experiment.py --domain ml|prompt|code|skill --target src/api/search.py --eval "pytest bench.py" \
--metric p50_ms --direction lower
python scripts/setup_experiment.py --domain marketing --name medium-ctr \
--target content/titles.md --eval "python evaluate.py" \
--metric ctr_score --direction higher --evaluator llm_judge_content
python scripts/setup_experiment.py --list # List all experiments
python scripts/setup_experiment.py --list-evaluators # List available evaluators
""" """
import argparse import argparse
import os import os
import shutil
import subprocess import subprocess
import sys import sys
import time import time
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"]
DOMAINS = { EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators"
"ml": {
"target": "train.py", DEFAULT_CONFIG = """# autoresearch global config
"evaluate_cmd": "uv run train.py", default_time_budget_minutes: 5
"metric": "val_bpb", default_scope: project
"metric_direction": "lower", dashboard_format: markdown
"time_budget_minutes": 5, """
"metric_grep": "^val_bpb:",
}, GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state
"prompt": { **/results.tsv
"target": "prompt.md", **/run.log
"evaluate_cmd": "python evaluate.py", **/run.*.log
"metric": "eval_score", config.yaml
"metric_direction": "higher", """
"time_budget_minutes": 2,
"metric_grep": "^eval_score:",
},
"code": {
"target": "src/module.py",
"evaluate_cmd": "python benchmark.py",
"metric": "p50_ms",
"metric_direction": "lower",
"time_budget_minutes": 10,
"metric_grep": "^p50_ms:",
},
"skill": {
"target": "SKILL.md",
"evaluate_cmd": "python scripts/skill_evaluator.py",
"metric": "pass_rate",
"metric_direction": "higher",
"time_budget_minutes": 5,
"metric_grep": "^pass_rate:",
},
}
def run_cmd(cmd, cwd=None, timeout=None): def run_cmd(cmd, cwd=None, timeout=None):
"""Run a shell command and return (returncode, stdout, stderr).""" """Run shell command, return (returncode, stdout, stderr)."""
result = subprocess.run( result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, cmd, shell=True, capture_output=True, text=True,
cwd=cwd, timeout=timeout cwd=cwd, timeout=timeout
@@ -67,188 +54,315 @@ def run_cmd(cmd, cwd=None, timeout=None):
return result.returncode, result.stdout.strip(), result.stderr.strip() return result.returncode, result.stdout.strip(), result.stderr.strip()
def check_git_repo(path): def get_autoresearch_root(scope, project_root=None):
"""Verify we're in a git repo.""" """Get the .autoresearch root directory based on scope."""
code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path) if scope == "user":
if code != 0: return Path.home() / ".autoresearch"
print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'") return Path(project_root or ".") / ".autoresearch"
def init_root(root):
"""Initialize .autoresearch root if it doesn't exist."""
created = False
if not root.exists():
root.mkdir(parents=True)
created = True
print(f" Created {root}/")
config_file = root / "config.yaml"
if not config_file.exists():
config_file.write_text(DEFAULT_CONFIG)
print(f" Created {config_file}")
gitignore = root / ".gitignore"
if not gitignore.exists():
gitignore.write_text(GITIGNORE_CONTENT)
print(f" Created {gitignore}")
return created
def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""):
"""Generate a program.md template for the experiment."""
direction_word = "Minimize" if direction == "lower" else "Maximize"
content = f"""# autoresearch — {name}
## Goal
{direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better.
## What the Agent Can Change
- Only `{target}` — this is the single file being optimized.
- Everything inside that file is fair game unless constrained below.
## What the Agent Cannot Change
- The evaluation script (`evaluate.py` or the eval command). It is read-only.
- Dependencies — do not add new packages or imports that aren't already available.
- Any other files in the project unless explicitly noted here.
{f"- Additional constraints: {constraints}" if constraints else ""}
## Strategy
1. First run: establish baseline. Do not change anything.
2. Profile/analyze the current state — understand why the metric is what it is.
3. Try the most obvious improvement first (low-hanging fruit).
4. If that works, push further in the same direction.
5. If stuck, try something orthogonal or radical.
6. Read the git log of previous experiments. Don't repeat failed approaches.
## Simplicity Rule
A small improvement that adds ugly complexity is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.
## Stop When
You don't stop. The human will interrupt you when they're satisfied.
If no improvement in 20+ consecutive runs, change strategy drastically.
"""
(experiment_dir / "program.md").write_text(content)
def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget):
"""Write experiment config."""
content = f"""target: {target}
evaluate_cmd: {eval_cmd}
metric: {metric}
metric_direction: {direction}
metric_grep: ^{metric}:
time_budget_minutes: {time_budget}
created: {datetime.now().strftime('%Y-%m-%d %H:%M')}
"""
(experiment_dir / "config.cfg").write_text(content)
def init_results_tsv(experiment_dir):
"""Create results.tsv with header."""
tsv = experiment_dir / "results.tsv"
if tsv.exists():
print(f" results.tsv already exists ({tsv.stat().st_size} bytes)")
return
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
print(" Created results.tsv")
def copy_evaluator(experiment_dir, evaluator_name):
"""Copy a built-in evaluator to the experiment directory."""
source = EVALUATOR_DIR / f"{evaluator_name}.py"
if not source.exists():
print(f" Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}")
print(f" Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}")
return False return False
print("✓ Git repository found") dest = experiment_dir / "evaluate.py"
shutil.copy2(source, dest)
print(f" Copied evaluator: {evaluator_name}.py -> evaluate.py")
return True return True
def check_program_md(path): def create_branch(path, domain, name):
"""Check program.md exists and has content."""
pm = Path(path) / "program.md"
if not pm.exists():
print("⚠ program.md not found. Creating template...")
return False
content = pm.read_text()
if len(content) < 100:
print("⚠ program.md looks empty. Fill it out before running experiments.")
return False
print(f"✓ program.md found ({len(content)} chars)")
return True
def check_target_file(path, target):
"""Check target file exists."""
tf = Path(path) / target
if not tf.exists():
print(f"✗ Target file not found: {target}")
return False
print(f"✓ Target file found: {target}")
return True
def check_evaluate_script(path):
"""Check evaluate.py exists."""
ev = Path(path) / "evaluate.py"
if not ev.exists():
print("⚠ evaluate.py not found. You need a fixed evaluation function.")
print(" Create evaluate.py that outputs: metric_name: <value>")
return False
print("✓ evaluate.py found")
return True
def create_branch(path, tag):
"""Create and checkout the experiment branch.""" """Create and checkout the experiment branch."""
branch = f"autoresearch/{tag}" branch = f"autoresearch/{domain}/{name}"
code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path) code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
if code != 0: if code != 0:
if "already exists" in err: if "already exists" in err:
print(f" Branch '{branch}' already exists. Use a different tag.") print(f" Branch '{branch}' already exists. Checking out...")
else: run_cmd(f"git checkout {branch}", cwd=path)
print(f"✗ Failed to create branch: {err}") return branch
print(f" Warning: could not create branch: {err}")
return None return None
print(f" Created branch: {branch}") print(f" Created branch: {branch}")
return branch return branch
def init_results_tsv(path): def list_experiments(root):
"""Create results.tsv with header.""" """List all experiments across all domains."""
tsv = Path(path) / "results.tsv" if not root.exists():
if tsv.exists(): print("No experiments found. Run setup to create your first experiment.")
print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
return return
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
print("✓ Created results.tsv") experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir():
continue
cfg_file = exp_dir / "config.cfg"
if not cfg_file.exists():
continue
config = {}
for line in cfg_file.read_text().splitlines():
if ":" in line:
k, v = line.split(":", 1)
config[k.strip()] = v.strip()
# Count results
tsv = exp_dir / "results.tsv"
runs = 0
if tsv.exists():
runs = max(0, len(tsv.read_text().splitlines()) - 1)
experiments.append({
"domain": domain_dir.name,
"name": exp_dir.name,
"target": config.get("target", "?"),
"metric": config.get("metric", "?"),
"runs": runs,
})
if not experiments:
print("No experiments found.")
return
print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}")
print("-" * 95)
for e in experiments:
print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}")
print(f"\nTotal: {len(experiments)} experiments")
def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes): def list_evaluators():
"""Run the baseline experiment.""" """List available built-in evaluators."""
print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...") if not EVALUATOR_DIR.exists():
timeout = time_budget_minutes * 60 * 2.5 # 2.5x budget as hard limit print("No evaluators directory found.")
return
t0 = time.time() print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n")
code, out, err = run_cmd( for f in sorted(EVALUATOR_DIR.glob("*.py")):
f"{evaluate_cmd} > run.log 2>&1", # Read first docstring line
cwd=path, desc = ""
timeout=timeout for line in f.read_text().splitlines():
) if line.strip().startswith('"""') or line.strip().startswith("'''"):
elapsed = time.time() - t0 continue
if line.strip() and not line.startswith("#!"):
if code != 0: desc = line.strip().strip('"').strip("'")
print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log") break
return None print(f" {f.stem:<25} {desc}")
# Extract metric
grep_code, grep_out, _ = run_cmd(
f"grep '{metric_grep}' run.log | tail -1",
cwd=path
)
if not grep_out:
print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
return None
metric_value = grep_out.split(":")[-1].strip()
print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
return metric_value
def main(): def main():
parser = argparse.ArgumentParser(description="autoresearch-agent setup") parser = argparse.ArgumentParser(description="autoresearch-agent setup")
parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain") parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain")
parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)")
parser.add_argument("--target", help="Target file to optimize") parser.add_argument("--target", help="Target file to optimize")
parser.add_argument("--evaluate-cmd", help="Evaluation command") parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command")
parser.add_argument("--metric", help="Metric name") parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')")
parser.add_argument("--direction", choices=["lower", "higher"], default="lower") parser.add_argument("--direction", choices=["lower", "higher"], default="lower",
parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes") help="Is lower or higher better?")
parser.add_argument("--tag", help="Run tag (used in branch name)") parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)")
parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)")
parser.add_argument("--scope", choices=["project", "user"], default="project",
help="Where to store experiments: project (./) or user (~/)")
parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
parser.add_argument("--path", default=".", help="Project root path") parser.add_argument("--path", default=".", help="Project root path")
parser.add_argument("--skip-baseline", action="store_true") parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
parser.add_argument("--list", action="store_true", help="List all experiments")
parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
args = parser.parse_args() args = parser.parse_args()
path = Path(args.path).resolve() project_root = Path(args.path).resolve()
print(f"\n🔬 autoresearch-agent setup")
print(f" Project: {path}")
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
# Get config from domain or args # List mode
if args.domain: if args.list:
config = DOMAINS[args.domain].copy() root = get_autoresearch_root("project", project_root)
list_experiments(root)
user_root = get_autoresearch_root("user")
if user_root.exists() and user_root != root:
print(f"\n--- User-level experiments ({user_root}) ---")
list_experiments(user_root)
return
if args.list_evaluators:
list_evaluators()
return
# Validate required args for setup
if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]):
parser.error("Required: --domain, --name, --target, --eval, --metric")
root = get_autoresearch_root(args.scope, project_root)
print(f"\n autoresearch-agent setup")
print(f" Project: {project_root}")
print(f" Scope: {args.scope}")
print(f" Domain: {args.domain}")
print(f" Experiment: {args.name}")
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
# Check git
code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
if code != 0:
print(" Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
sys.exit(1)
print(" Git repository found")
# Check target file
target_path = project_root / args.target
if not target_path.exists():
print(f" Error: target file not found: {args.target}")
sys.exit(1)
print(f" Target file found: {args.target}")
# Init root
init_root(root)
# Create experiment directory
experiment_dir = root / args.domain / args.name
if experiment_dir.exists():
print(f" Warning: experiment '{args.domain}/{args.name}' already exists.")
print(f" Use --name with a different name, or delete {experiment_dir}")
sys.exit(1)
experiment_dir.mkdir(parents=True)
print(f" Created {experiment_dir}/")
# Create files
create_program_md(experiment_dir, args.domain, args.name,
args.target, args.metric, args.direction, args.constraints)
print(" Created program.md")
create_config(experiment_dir, args.target, args.eval_cmd,
args.metric, args.direction, args.time_budget)
print(" Created config.cfg")
init_results_tsv(experiment_dir)
# Copy evaluator if specified
if args.evaluator:
copy_evaluator(experiment_dir, args.evaluator)
# Create git branch
if not args.skip_branch:
create_branch(str(project_root), args.domain, args.name)
# Test evaluation command
print(f"\n Testing evaluation: {args.eval_cmd}")
code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60)
if code != 0:
print(f" Warning: eval command failed (exit {code})")
if err:
print(f" stderr: {err[:200]}")
print(" Fix the eval command before running the experiment loop.")
else: else:
config = { # Check metric is parseable
"target": args.target or "target.py", full_output = out + "\n" + err
"evaluate_cmd": args.evaluate_cmd or "python evaluate.py", metric_found = False
"metric": args.metric or "score", for line in full_output.splitlines():
"metric_direction": args.direction, if line.strip().startswith(f"{args.metric}:"):
"time_budget_minutes": args.budget, metric_found = True
"metric_grep": f"^{args.metric or 'score'}:", print(f" Eval works. Baseline: {line.strip()}")
} break
if not metric_found:
print(f" Warning: eval ran but '{args.metric}:' not found in output.")
print(f" Make sure your eval command outputs: {args.metric}: <value>")
tag = args.tag or datetime.now().strftime("%b%d").lower() # Summary
print(f"\n Setup complete!")
# Validation checks print(f" Experiment: {args.domain}/{args.name}")
checks = [ print(f" Target: {args.target}")
check_git_repo(path), print(f" Metric: {args.metric} ({args.direction} is better)")
check_program_md(path), print(f" Budget: {args.time_budget} min/experiment")
check_target_file(path, config["target"]), if not args.skip_branch:
check_evaluate_script(path), print(f" Branch: autoresearch/{args.domain}/{args.name}")
] print(f"\n To start:")
print(f" python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")
if not all(checks):
print("\n⚠ Fix the above issues before running experiments.")
sys.exit(1)
# Create branch
branch = create_branch(path, tag)
if not branch:
sys.exit(1)
# Init results TSV
init_results_tsv(path)
# Save config for run_experiment.py
config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
(path / ".autoresearch.cfg").write_text(config_content + "\n")
print("✓ Saved .autoresearch.cfg")
# Run baseline
if not args.skip_baseline:
baseline = run_baseline(
path,
config["evaluate_cmd"],
config["metric_grep"],
config["time_budget_minutes"]
)
if baseline:
# Log baseline to TSV
code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
with open(path / "results.tsv", "a") as f:
f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
print(f"✓ Baseline logged to results.tsv")
print(f"\n✅ Setup complete!")
print(f" Branch: {branch}")
print(f" Target: {config['target']}")
print(f" Metric: {config['metric']} ({config['metric_direction']} is better)")
print(f" Budget: {config['time_budget_minutes']} min/experiment")
print(f"\nTo start the autonomous loop:")
print(f" python scripts/run_experiment.py --loop")
print(f"\nOr run a single experiment:")
print(f" python scripts/run_experiment.py --single")
if __name__ == "__main__": if __name__ == "__main__":