refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo.

Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation

New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed

Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output

Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view

SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy
This commit is contained in:
Leo
2026-03-13 08:22:14 +01:00
parent c834d71a44
commit 12591282da
13 changed files with 1744 additions and 702 deletions

View File

@@ -1,9 +1,9 @@
---
name: "autoresearch-agent"
description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo."
description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
license: MIT
metadata:
version: 1.0.0
version: 2.0.0
author: Alireza Rezvani
category: engineering
updated: 2026-03-13
@@ -13,194 +13,233 @@ metadata:
> You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop.
Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
**Works for any domain with a measurable metric:**
- ML training optimization (original use case — optimize `train.py` by `val_bpb`)
- Prompt engineering (optimize system prompts by LLM-eval quality score)
- Code performance (optimize a module by benchmark runtime/score)
- Agent skill improvement (optimize `SKILL.md` by task completion rate)
Not one guess — fifty measured attempts, compounding.
---
## Before Starting
## When This Skill Activates
Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing.
Recognize these patterns from the user:
If no `program.md` exists, run the **Setup Wizard** below.
- "Make this faster / smaller / better"
- "Optimize [file] for [metric]"
- "Improve my [headlines / copy / prompts]"
- "Run experiments overnight"
- "I want to get [metric] from X to Y"
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
If the user describes a target file + a way to measure success → this skill applies.
---
## Setup Wizard
## Setup
Answer these 5 questions to configure the experiment:
### First Time — Create the Experiment
### 1. What are we optimizing?
The **target** — one file the agent is allowed to modify:
- `train.py` — ML training loop (Karpathy-style)
- `prompt.md` — system prompt for an LLM
- `src/module.py` — a specific code module
- `SKILL.md` — an agent skill definition
Run the setup script. The user decides where experiments live:
### 2. What's the metric?
A **single number** that defines success. Lower or higher = better:
- `val_bpb` — validation bits per byte (ML, lower is better)
- `eval_score` — LLM quality score 0-100 (higher is better)
- `p50_ms` — median latency in milliseconds (lower is better)
- `pass_rate` — test pass rate 0-1 (higher is better)
### 3. What's the time budget per experiment?
How long one experiment run takes:
- `5m` — fast iteration (Karpathy default, ~12 experiments/hour)
- `10m` — moderate (6/hour)
- `30m` — slow but thorough (2/hour)
### 4. What can the agent change?
Constraints on the target file:
- Architecture only? Hyperparameters only? Everything?
- What packages/imports are available?
- What's explicitly off-limits?
### 5. What's the evaluation function?
How we score each experiment:
- Fixed script that outputs the metric (e.g. `python evaluate.py`)
- API call that returns a score
- Test suite with a pass rate
Once answered, run: `python scripts/setup_experiment.py` to initialize.
---
## The Three Files
Every autoresearch project has the same structure:
```
project/
├── program.md ← Human writes this: objectives, constraints, strategy
├── target.* ← Agent modifies this: the thing being optimized
├── evaluate.py ← Fixed: the measurement function (never touch)
├── results.tsv ← Auto-generated: experiment log (git-tracked for continuity)
└── scripts/
├── setup_experiment.py ← Initialize a new run
├── run_experiment.py ← Execute one experiment iteration
└── log_results.py ← Record results to TSV
**Project-level** (inside repo, git-tracked, shareable with team):
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--scope project
```
### `program.md` — Your Research Directions
Write this once. The agent reads it before every experiment. It should contain:
- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code)
- **Strategy:** What directions to explore first
- **Constraints:** What the agent cannot change
- **Stopping criteria:** When a result is "good enough"
**User-level** (personal, in `~/.autoresearch/`):
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope user
```
See `references/program-template.md` for domain-specific templates.
The `--scope` flag determines where `.autoresearch/` lives:
- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
- `user``~/.autoresearch/` in the home directory. Everything is personal.
### Target File — The Only File the Agent Edits
Whatever you're optimizing. Strict scope: **one file, one metric**.
### What Setup Creates
### `evaluate.py` — Fixed Evaluation (Never Modified)
The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured.
```
.autoresearch/
├── config.yaml ← Global settings
├── .gitignore ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← Objectives, constraints, strategy
├── config.cfg ← Target, eval cmd, metric, direction
├── results.tsv ← Experiment log (gitignored)
└── evaluate.py ← Evaluation script (if --evaluator used)
```
### Domains
| Domain | Use Cases |
|--------|-----------|
| `engineering` | Code speed, memory, bundle size, test pass rate, build time |
| `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
| `content` | Article structure, SEO descriptions, readability, CTR |
| `prompts` | System prompts, chatbot tone, agent instructions |
| `custom` | Anything else with a measurable metric |
### If `program.md` Already Exists
The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
---
## The Experiment Loop
Run: `python scripts/run_experiment.py --loop`
### Starting an Experiment
```bash
# Run specific experiment
python scripts/run_experiment.py --experiment engineering/api-speed --loop
# Single iteration (test setup)
python scripts/run_experiment.py --experiment engineering/api-speed --single
# Resume last active experiment
python scripts/run_experiment.py --resume --loop
# Dry run (show what would happen)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
```
### The Loop Protocol
```
LOOP FOREVER:
1. Read program.md for current strategy
2. Review git history: what has been tried? What worked?
3. Propose ONE change to the target file
4. Apply the change
5. git commit (with descriptive message)
6. Run evaluation: python evaluate.py > run.log 2>&1
7. Parse metric from run.log
8. If metric improved → ADVANCE (keep commit, log "keep")
9. If metric equal/worse → REVERT (git reset, log "discard")
10. If crash → attempt fix, if unfixable log "crash" and revert
11. Update results.tsv
12. Go to 1
1. Read program.md for current strategy and constraints
2. Review git log: what has been tried? What worked? What crashed?
3. Review results.tsv: current best metric, trend, recent failures
4. Propose ONE change to the target file
5. Apply the change
6. git commit -m "experiment: [short description of what changed]"
7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
8. Parse metric from run.log (grep for metric_name: value)
9. Decision:
- Metric improved → KEEP (advance branch, log "keep")
- Metric equal or worse → REVERT (git reset --hard, log "discard")
- Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
10. Append result to results.tsv
11. Go to 1
```
### Rules (from Karpathy's original)
### Rules
- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted.
- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win.
- **One change per experiment** — don't change 5 things at once. You won't know what worked.
- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on.
- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash.
- **No new dependencies** — only use what's already available.
- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
- **No new dependencies.** Only use what's already available in the project.
---
## Results Log
## Evaluators
`results.tsv` (tab-separated, not git-tracked):
Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.
### Free Evaluators (no API cost)
| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
### LLM Judge Evaluators (uses your subscription)
| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
The user's existing subscription covers the cost:
- Claude Code Max → unlimited Claude calls for evaluation
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
- Gemini CLI (free tier) → free evaluation calls
### Custom Evaluators
If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.
```python
#!/usr/bin/env python3
# My custom evaluator — DO NOT MODIFY after experiment starts
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
# Parse and output
print(f"my_metric: {parse_score(result.stdout)}")
```
commit metric status description
a1b2c3d 0.9979 keep baseline
b2c3d4e 0.9932 keep increased learning rate
c3d4e5f 1.0050 discard switched to GeLU (worse)
d4e5f6g 0.0000 crash doubled model width (OOM)
```
Run `python scripts/log_results.py --summary` for a visual summary.
---
## Domain-Specific Configurations
## Viewing Results
### ML Training (Karpathy-style)
```yaml
target: train.py
evaluate: uv run prepare.py --eval-only
metric: val_bpb (lower is better)
time_budget: 5m
git_branch: autoresearch/{date}-{tag}
```bash
# Single experiment
python scripts/log_results.py --experiment engineering/api-speed
# All experiments in a domain
python scripts/log_results.py --domain engineering
# Cross-experiment dashboard
python scripts/log_results.py --dashboard
# Export formats
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
```
### Prompt Engineering
```yaml
target: prompt.md
evaluate: python evaluate.py --model gpt-4o --test-cases tests/
metric: eval_score (0-100, higher is better)
time_budget: 2m
git_branch: prompt-research/{date}
### Dashboard Output
```
DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
engineering api-speed 47 14 185ms -76.9% active
engineering bundle-size 23 8 412KB -58.3% paused
marketing medium-ctr 31 11 8.4/10 +68.0% active
prompts support-tone 15 6 82/100 +46.4% done
```
### Code Performance
```yaml
target: src/hot_module.py
evaluate: python benchmark.py --runs 5 --warmup 1
metric: p50_ms (lower is better)
time_budget: 10m
git_branch: perf-research/{date}
```
### Export Formats
### Agent Skill Optimization
```yaml
target: SKILL.md
evaluate: python scripts/skill_evaluator.py --task-suite tests/
metric: pass_rate (0-1, higher is better)
time_budget: 5m
git_branch: skill-research/{date}
```
See `references/experiment-domains.md` for full setup guides per domain.
- **TSV** — default, tab-separated (compatible with spreadsheets)
- **CSV** — comma-separated, with proper quoting
- **Markdown** — formatted table, readable in GitHub/docs
---
## Scripts
## Proactive Triggers
| Script | Purpose |
|--------|---------|
| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run |
| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) |
| `log_results.py` | Record results to TSV; `--summary` prints progress table |
Flag these without being asked:
- **No evaluation command works** → Test it before starting the loop. Run once, verify output.
- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
- **Time budget too short** → If eval takes longer than budget, every run crashes.
- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.
---
@@ -214,7 +253,6 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
### Multi-tool install
```bash
# Clone the repo, then use the convert script for your tool:
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
```
@@ -225,22 +263,9 @@ clawhub install autoresearch-agent
---
## Proactive Triggers
Flag these issues without being asked:
- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
- **Target file has no git history** → `git init` and commit baseline first.
- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
---
## Related Skills
- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics.
- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops.
- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops.
- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function.
- **self-improving-agent** improves an agent's own memory/rules over time. NOT for structured experiment loops.
- **senior-ml-engineer** ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.

View File

@@ -0,0 +1,56 @@
#!/usr/bin/env python3
"""Measure file, bundle, or Docker image size.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import os
import subprocess
import sys
# --- CONFIGURE ONE OF THESE ---
# Option 1: File size
TARGET_FILE = "dist/main.js"
# Option 2: Directory size (uncomment to use)
# TARGET_DIR = "dist/"
# Option 3: Docker image (uncomment to use)
# DOCKER_IMAGE = "myapp:latest"
# DOCKER_BUILD_CMD = "docker build -t myapp:latest ."
# Option 4: Build first, then measure (uncomment to use)
# BUILD_CMD = "npm run build"
# --- END CONFIG ---
# Build if needed
if "BUILD_CMD" in dir() or "BUILD_CMD" in globals():
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)
if result.returncode != 0:
print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr)
sys.exit(1)
# Measure
if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
if "DOCKER_BUILD_CMD" in dir():
subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
result = subprocess.run(
f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
shell=True, capture_output=True, text=True
)
size_bytes = int(result.stdout.strip())
elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
size_bytes = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, fns in os.walk(TARGET_DIR) for f in fns
)
elif os.path.exists(TARGET_FILE):
size_bytes = os.path.getsize(TARGET_FILE)
else:
print(f"Target not found: {TARGET_FILE}", file=sys.stderr)
sys.exit(1)
size_kb = size_bytes / 1024
size_mb = size_bytes / (1024 * 1024)
print(f"size_bytes: {size_bytes}")
print(f"size_kb: {size_kb:.1f}")
print(f"size_mb: {size_mb:.2f}")

View File

@@ -0,0 +1,40 @@
#!/usr/bin/env python3
"""Measure execution speed of a target function or command.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import statistics
import subprocess
import sys
import time
# --- CONFIGURE THESE ---
COMMAND = "python src/module.py" # Command to benchmark
RUNS = 5 # Number of runs
WARMUP = 1 # Warmup runs (not counted)
# --- END CONFIG ---
times = []
# Warmup
for _ in range(WARMUP):
subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
# Benchmark
for i in range(RUNS):
t0 = time.perf_counter()
result = subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
elapsed = (time.perf_counter() - t0) * 1000 # ms
if result.returncode != 0:
print(f"Run {i+1} failed (exit {result.returncode})", file=sys.stderr)
print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
sys.exit(1)
times.append(elapsed)
p50 = statistics.median(times)
p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 5 else max(times)
print(f"p50_ms: {p50:.2f}")
print(f"p95_ms: {p95:.2f}")
print(f"runs: {RUNS}")

View File

@@ -0,0 +1,39 @@
#!/usr/bin/env python3
"""Measure build/compile time.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import subprocess
import sys
import time
# --- CONFIGURE THESE ---
BUILD_CMD = "npm run build" # or: docker build -t test .
CLEAN_CMD = "" # optional: npm run clean (run before each build)
RUNS = 3 # Number of builds to average
# --- END CONFIG ---
times = []
for i in range(RUNS):
# Clean if configured
if CLEAN_CMD:
subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)
t0 = time.perf_counter()
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
elapsed = time.perf_counter() - t0
if result.returncode != 0:
print(f"Build {i+1} failed (exit {result.returncode})", file=sys.stderr)
print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
sys.exit(1)
times.append(elapsed)
import statistics
avg = statistics.mean(times)
median = statistics.median(times)
print(f"build_seconds: {median:.2f}")
print(f"build_avg: {avg:.2f}")
print(f"runs: {RUNS}")

View File

@@ -0,0 +1,72 @@
#!/usr/bin/env python3
"""LLM judge for content quality (headlines, titles, descriptions).
Uses the user's existing CLI tool (claude, codex, gemini) for evaluation.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import subprocess
import sys
from pathlib import Path
# --- CONFIGURE THESE ---
TARGET_FILE = "content/titles.md" # File being optimized
CLI_TOOL = "claude" # or: codex, gemini
# --- END CONFIG ---
# The judge prompt is FIXED — the agent cannot change how it's evaluated
JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly.
Criteria (each scored 1-10):
1. CURIOSITY GAP — Does this make you want to click? Is there an information gap
that can only be resolved by reading? Generic titles score 1-3. Specific,
intriguing titles score 7-10.
2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved
performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9.
3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out,
or recognition? Flat titles score 1-3. Emotionally charged score 7-10.
4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or
search results? Would they pause on this headline? Rate honestly.
5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally?
Keyword-stuffed = 3. Natural integration of search terms = 8-10.
Output EXACTLY this format (nothing else):
curiosity: <score>
specificity: <score>
emotional: <score>
scroll_stop: <score>
seo: <score>
ctr_score: <average of all 5 scores>
Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
content = Path(TARGET_FILE).read_text()
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
# Call the user's CLI tool
result = subprocess.run(
[CLI_TOOL, "-p", full_prompt],
capture_output=True, text=True, timeout=120
)
if result.returncode != 0:
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
sys.exit(1)
# Parse output — look for ctr_score line
output = result.stdout
for line in output.splitlines():
line = line.strip()
if line.startswith("ctr_score:"):
print(line)
elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")):
print(line)
# Verify ctr_score was found
if "ctr_score:" not in output:
print("Could not parse ctr_score from LLM output", file=sys.stderr)
print(f"Raw output: {output[:500]}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,88 @@
#!/usr/bin/env python3
"""LLM judge for marketing copy (social posts, ads, emails).
Uses the user's existing CLI tool for evaluation.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import subprocess
import sys
from pathlib import Path
# --- CONFIGURE THESE ---
TARGET_FILE = "posts.md" # Copy being optimized
CLI_TOOL = "claude" # or: codex, gemini
PLATFORM = "twitter" # twitter, linkedin, instagram, email, ad
# --- END CONFIG ---
JUDGE_PROMPTS = {
"twitter": """Score this Twitter/X post strictly:
1. HOOK (1-10) — Does the first line stop the scroll?
2. VALUE (1-10) — Does it provide insight, entertainment, or utility?
3. ENGAGEMENT (1-10) — Would people reply, retweet, or like?
4. BREVITY (1-10) — Is every word earning its place? No filler?
5. CTA (1-10) — Is there a clear next action (even implicit)?""",
"linkedin": """Score this LinkedIn post strictly:
1. HOOK (1-10) — Does the first line make you click "see more"?
2. STORYTELLING (1-10) — Is there a narrative arc or just statements?
3. CREDIBILITY (1-10) — Does it demonstrate expertise without bragging?
4. ENGAGEMENT (1-10) — Would professionals comment or share?
5. CTA (1-10) — Does it invite discussion or action?""",
"instagram": """Score this Instagram caption strictly:
1. HOOK (1-10) — Does the first line grab attention?
2. RELATABILITY (1-10) — Does the audience see themselves in this?
3. VISUAL MATCH (1-10) — Does the copy complement visual content?
4. HASHTAG STRATEGY (1-10) — Are hashtags relevant and not spammy?
5. CTA (1-10) — Does it encourage saves, shares, or comments?""",
"email": """Score this email subject + preview strictly:
1. OPEN INCENTIVE (1-10) — Would you open this in a crowded inbox?
2. SPECIFICITY (1-10) — Is it concrete or vague?
3. URGENCY (1-10) — Is there a reason to open now vs later?
4. PERSONALIZATION (1-10) — Does it feel written for someone, not everyone?
5. PREVIEW SYNC (1-10) — Does the preview text complement the subject?""",
"ad": """Score this ad copy strictly:
1. ATTENTION (1-10) — Does it stop someone scrolling past ads?
2. DESIRE (1-10) — Does it create want for the product/service?
3. PROOF (1-10) — Is there credibility (numbers, social proof)?
4. ACTION (1-10) — Is the CTA clear and compelling?
5. OBJECTION HANDLING (1-10) — Does it preempt "why not"?""",
}
platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])
JUDGE_PROMPT = f"""{platform_prompt}
Output EXACTLY this format:
criterion_1: <score>
criterion_2: <score>
criterion_3: <score>
criterion_4: <score>
criterion_5: <score>
engagement_score: <average of all 5>
Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""
content = Path(TARGET_FILE).read_text()
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"
result = subprocess.run(
[CLI_TOOL, "-p", full_prompt],
capture_output=True, text=True, timeout=120
)
if result.returncode != 0:
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
sys.exit(1)
output = result.stdout
for line in output.splitlines():
line = line.strip()
if line.startswith("engagement_score:") or line.startswith("criterion_"):
print(line)
if "engagement_score:" not in output:
print("Could not parse engagement_score from LLM output", file=sys.stderr)
print(f"Raw: {output[:500]}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,99 @@
#!/usr/bin/env python3
"""LLM judge for prompt/instruction quality.
Uses the user's existing CLI tool for evaluation.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import json
import subprocess
import sys
from pathlib import Path
# --- CONFIGURE THESE ---
TARGET_FILE = "prompt.md" # Prompt being optimized
TEST_CASES_FILE = "tests/cases.json" # Test cases: [{"input": "...", "expected": "..."}]
CLI_TOOL = "claude" # or: codex, gemini
# --- END CONFIG ---
JUDGE_PROMPT_TEMPLATE = """You are evaluating a system prompt's effectiveness.
SYSTEM PROMPT BEING TESTED:
{prompt}
TEST INPUT:
{input}
EXPECTED OUTPUT (reference):
{expected}
ACTUAL OUTPUT:
{actual}
Score the actual output on these criteria (each 1-10):
1. ACCURACY — Does it match the expected output's intent and facts?
2. COMPLETENESS — Does it cover all required elements?
3. CLARITY — Is it well-structured and easy to understand?
4. INSTRUCTION_FOLLOWING — Does it follow the system prompt's guidelines?
Output EXACTLY: quality_score: <average of all 4>
Nothing else."""
prompt = Path(TARGET_FILE).read_text()
test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
scores = []
for i, case in enumerate(test_cases):
# Generate output using the prompt
gen_prompt = f"{prompt}\n\n{case['input']}"
gen_result = subprocess.run(
[CLI_TOOL, "-p", gen_prompt],
capture_output=True, text=True, timeout=60
)
if gen_result.returncode != 0:
print(f"Generation failed for case {i+1}", file=sys.stderr)
scores.append(0)
continue
actual = gen_result.stdout.strip()
# Judge the output
judge_prompt = JUDGE_PROMPT_TEMPLATE.format(
prompt=prompt[:500],
input=case["input"],
expected=case.get("expected", "N/A"),
actual=actual[:500]
)
judge_result = subprocess.run(
[CLI_TOOL, "-p", judge_prompt],
capture_output=True, text=True, timeout=60
)
if judge_result.returncode != 0:
scores.append(0)
continue
# Parse score
for line in judge_result.stdout.splitlines():
if "quality_score:" in line:
try:
score = float(line.split(":")[-1].strip())
scores.append(score)
except ValueError:
scores.append(0)
break
else:
scores.append(0)
print(f" Case {i+1}/{len(test_cases)}: {scores[-1]:.1f}", file=sys.stderr)
if not scores:
print("No test cases evaluated", file=sys.stderr)
sys.exit(1)
avg = sum(scores) / len(scores)
quality = avg * 10 # Scale to 0-100
print(f"quality_score: {quality:.2f}")
print(f"cases_tested: {len(scores)}")
print(f"avg_per_case: {avg:.2f}")

View File

@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""Measure peak memory usage of a command.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import os
import platform
import subprocess
import sys
# --- CONFIGURE THESE ---
COMMAND = "python src/module.py" # Command to measure
# --- END CONFIG ---
system = platform.system()
if system == "Linux":
# Use /usr/bin/time for peak RSS
result = subprocess.run(
f"/usr/bin/time -v {COMMAND}",
shell=True, capture_output=True, text=True, timeout=300
)
output = result.stderr
for line in output.splitlines():
if "Maximum resident set size" in line:
kb = int(line.split(":")[-1].strip())
mb = kb / 1024
print(f"peak_mb: {mb:.1f}")
print(f"peak_kb: {kb}")
sys.exit(0)
print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
sys.exit(1)
elif system == "Darwin":
# macOS: use /usr/bin/time -l
result = subprocess.run(
f"/usr/bin/time -l {COMMAND}",
shell=True, capture_output=True, text=True, timeout=300
)
output = result.stderr
for line in output.splitlines():
if "maximum resident set size" in line.lower():
# macOS reports in bytes
val = int(line.strip().split()[0])
mb = val / (1024 * 1024)
print(f"peak_mb: {mb:.1f}")
sys.exit(0)
print("Could not parse memory from time output", file=sys.stderr)
sys.exit(1)
else:
print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,55 @@
#!/usr/bin/env python3
"""Measure test suite pass rate.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import re
import subprocess
import sys
# --- CONFIGURE THESE ---
TEST_CMD = "pytest tests/ --tb=no -q" # Test command
# --- END CONFIG ---
result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
output = result.stdout + "\n" + result.stderr
# Try to parse pytest output: "X passed, Y failed, Z errors"
passed = failed = errors = 0
# pytest short format: "5 passed, 2 failed in 1.23s"
match = re.search(r"(\d+) passed", output)
if match:
passed = int(match.group(1))
match = re.search(r"(\d+) failed", output)
if match:
failed = int(match.group(1))
match = re.search(r"(\d+) error", output)
if match:
errors = int(match.group(1))
total = passed + failed + errors
if total == 0:
# Try unittest format: "Ran X tests"
match = re.search(r"Ran (\d+) test", output)
if match:
total = int(match.group(1))
if result.returncode == 0:
passed = total
else:
# Count failures from output
fail_match = re.search(r"FAILED \(failures=(\d+)", output)
if fail_match:
failed = int(fail_match.group(1))
passed = total - failed
if total == 0:
print("Could not parse test results", file=sys.stderr)
print(f"Output: {output[:500]}", file=sys.stderr)
sys.exit(1)
rate = passed / total
print(f"pass_rate: {rate:.4f}")
print(f"passed: {passed}")
print(f"failed: {failed}")
print(f"total: {total}")

View File

@@ -1,175 +1,255 @@
# Experiment Domains Guide
## Domain 1: ML Training (Karpathy-style)
## Domain: Engineering
**Best for:** LLM/neural net training optimization on a single GPU
### Code Speed Optimization
**Requirements:**
- NVIDIA GPU (H100 recommended, A100/RTX also work)
- CUDA + PyTorch
- `uv` package manager
- ~50GB disk (training data)
**Setup:**
```bash
# Clone autoresearch repo (the ML training environment)
git clone https://github.com/karpathy/autoresearch my-ml-research
cd my-ml-research
uv sync
uv run prepare.py # one-time data download + tokenizer (~2 min)
# Initialize autoresearch skill
cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts
# Configure
python scripts/setup_experiment.py --domain ml --tag mar13
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "python -m pytest tests/bench_search.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--evaluator benchmark_speed
```
**Metric:** `val_bpb` — validation bits per byte. Lower = better model.
**What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O.
**Cost:** Free — just runs benchmarks.
**Speed:** ~5 min/experiment, ~12/hour, ~100 overnight.
**What the agent can change in `train.py`:**
- Model depth, width, attention heads
- Learning rate, scheduler, warmup
- Optimizer (Muon, AdamW, variants)
- Batch size, gradient accumulation
- Architecture (attention patterns, FFN type)
### Bundle Size Reduction
**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):**
Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256.
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name bundle-size \
--target webpack.config.js \
--eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
--metric size_bytes \
--direction lower \
--evaluator benchmark_size
```
Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`.
### Test Pass Rate
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name fix-flaky-tests \
--target src/utils/parser.py \
--eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
--metric pass_rate \
--direction higher \
--evaluator test_pass_rate
```
### Docker Build Speed
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name docker-build \
--target Dockerfile \
--eval "python .autoresearch/engineering/docker-build/evaluate.py" \
--metric build_seconds \
--direction lower \
--evaluator build_speed
```
### Memory Optimization
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name memory-usage \
--target src/processor.py \
--eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
--metric peak_mb \
--direction lower \
--evaluator memory_usage
```
### ML Training (Karpathy-style)
Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch).
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name ml-training \
--target train.py \
--eval "uv run train.py" \
--metric val_bpb \
--direction lower \
--time-budget 5
```
---
## Domain 2: Prompt Engineering
## Domain: Marketing
**Best for:** Optimizing system prompts for quality/accuracy/tone
### Medium Article Headlines
**Requirements:**
- LLM API access (OpenAI, Anthropic, etc.)
- Test cases with expected outputs
- An LLM judge for scoring (can be same model)
**Setup:**
```bash
mkdir my-prompt-research && cd my-prompt-research
git init
# Create prompt.md (the thing being optimized)
echo "You are a helpful assistant." > prompt.md
# Create evaluate.py (fixed — never modify)
cat > evaluate.py << 'EOF'
#!/usr/bin/env python3
# Fixed evaluation harness — DO NOT MODIFY
import json, sys
from pathlib import Path
PROMPT = Path("prompt.md").read_text()
# Load test cases
TEST_CASES = json.loads(Path("tests/cases.json").read_text())
# Run prompt against test cases, score with LLM judge
# ... (customize for your LLM + scoring logic)
total = sum(score_case(PROMPT, case) for case in TEST_CASES)
score = total / len(TEST_CASES) * 100
print(f"eval_score: {score:.2f}")
EOF
# Initialize
python scripts/setup_experiment.py --domain prompt --tag mar13
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content
```
**Metric:** `eval_score` (0-100). Higher = better prompt.
Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`.
**What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers.
**Cost:** Uses your CLI subscription (Claude Max = unlimited).
**Speed:** ~2 min/experiment, ~30/hour.
### Social Media Copy
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name twitter-engagement \
--target social/tweets.md \
--eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
--metric engagement_score \
--direction higher \
--evaluator llm_judge_copy
```
Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram).
### Email Subject Lines
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name email-open-rate \
--target emails/subjects.md \
--eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
--metric engagement_score \
--direction higher \
--evaluator llm_judge_copy
```
Edit `evaluate.py`: set `PLATFORM = "email"`.
### Ad Copy
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name ad-copy-q2 \
--target ads/google-search.md \
--eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
--metric engagement_score \
--direction higher \
--evaluator llm_judge_copy
```
Edit `evaluate.py`: set `PLATFORM = "ad"`.
---
## Domain 3: Code Performance
## Domain: Content
**Best for:** Optimizing a specific hot module for speed
### Article Structure & Readability
**Requirements:**
- A Python module with measurable performance
- Existing tests (correctness must not regress)
- A benchmark harness
**Setup:**
```bash
cd my-project
# Create benchmark.py (fixed — never modify)
cat > benchmark.py << 'EOF'
#!/usr/bin/env python3
# Fixed benchmark — DO NOT MODIFY
import time, statistics
from src.module import your_function
from tests.test_module import run_tests
# Correctness check first
if not run_tests():
print("TESTS FAILED")
sys.exit(1)
# Benchmark
data = generate_test_data(n=10000)
times = []
for _ in range(10):
t0 = time.perf_counter()
your_function(data)
times.append((time.perf_counter() - t0) * 1000)
p50 = statistics.median(times)
print(f"p50_ms: {p50:.2f}")
print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
EOF
python scripts/setup_experiment.py --domain code \
--target src/module.py \
--tag mar13
python scripts/setup_experiment.py \
--domain content \
--name article-structure \
--target drafts/my-article.md \
--eval "python .autoresearch/content/article-structure/evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content
```
**Metric:** `p50_ms` — median latency. Lower = faster.
### SEO Descriptions
```bash
python scripts/setup_experiment.py \
--domain content \
--name seo-meta \
--target seo/descriptions.md \
--eval "python .autoresearch/content/seo-meta/evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content
```
---
## Domain 4: Agent Skill Optimization
## Domain: Prompts
**Best for:** Improving the quality of claude-skills SKILL.md files
### System Prompt Optimization
**Requirements:**
- A SKILL.md to optimize
- A task evaluation suite (15-20 standardized tasks)
- An LLM judge for scoring
**Setup:**
```bash
# Create a new skill research project
mkdir skill-research-{skill-name} && cd skill-research-{skill-name}
git init
# Copy the skill to optimize
cp ~/.claude/skills/{skill-name}/SKILL.md .
# Create evaluate.py
cat > scripts/skill_evaluator.py << 'EOF'
#!/usr/bin/env python3
# Fixed evaluator — DO NOT MODIFY
# Runs SKILL.md against 15 standardized tasks using LLM judge
# Outputs: pass_rate: 0.80 (etc.)
EOF
python scripts/setup_experiment.py --domain skill --tag mar13
python scripts/setup_experiment.py \
--domain prompts \
--name support-bot \
--target prompts/support-system.md \
--eval "python .autoresearch/prompts/support-bot/evaluate.py" \
--metric quality_score \
--direction higher \
--evaluator llm_judge_prompt
```
**Metric:** `pass_rate` (0-1). Higher = better skill.
Requires `tests/cases.json` with test inputs and expected outputs:
```json
[
{
"input": "I can't log in to my account",
"expected": "Ask for email, check account status, offer password reset"
},
{
"input": "How do I cancel my subscription?",
"expected": "Empathetic response, explain cancellation steps, offer retention"
}
]
```
### Agent Skill Optimization
```bash
python scripts/setup_experiment.py \
--domain prompts \
--name skill-improvement \
--target SKILL.md \
--eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
--metric quality_score \
--direction higher \
--evaluator llm_judge_prompt
```
---
## Choosing Your Domain
| Question | Recommendation |
|----------|---------------|
| Do I have a GPU and want to improve an LLM? | ML Training |
| Do I want to improve a prompt/system message? | Prompt Engineering |
| Do I have slow Python code I want to speed up? | Code Performance |
| Do I want to improve one of my claude-skills? | Skill Optimization |
| I want to... | Domain | Evaluator | Cost |
|-------------|--------|-----------|------|
| Speed up my code | engineering | benchmark_speed | Free |
| Shrink my bundle | engineering | benchmark_size | Free |
| Fix flaky tests | engineering | test_pass_rate | Free |
| Speed up Docker builds | engineering | build_speed | Free |
| Reduce memory usage | engineering | memory_usage | Free |
| Train ML models | engineering | (custom) | Free + GPU |
| Write better headlines | marketing | llm_judge_content | Subscription |
| Improve social posts | marketing | llm_judge_copy | Subscription |
| Optimize email subjects | marketing | llm_judge_copy | Subscription |
| Improve ad copy | marketing | llm_judge_copy | Subscription |
| Optimize article structure | content | llm_judge_content | Subscription |
| Improve SEO descriptions | content | llm_judge_content | Subscription |
| Optimize system prompts | prompts | llm_judge_prompt | Subscription |
| Improve agent skills | prompts | llm_judge_prompt | Subscription |
**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results.
**First time?** Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.

View File

@@ -1,125 +1,389 @@
#!/usr/bin/env python3
"""
autoresearch-agent: Results Logger
autoresearch-agent: Results Viewer
View and analyze experiment results from results.tsv.
View experiment results in multiple formats: terminal, CSV, Markdown.
Supports single experiment, domain, or cross-experiment dashboard.
Usage:
python scripts/log_results.py --summary # Print progress table
python scripts/log_results.py --best # Show best result
python scripts/log_results.py --history # Full experiment history
python scripts/log_results.py --record commit val status desc # Add entry manually
python scripts/log_results.py --experiment engineering/api-speed
python scripts/log_results.py --domain engineering
python scripts/log_results.py --dashboard
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
"""
import argparse
import csv
import io
import sys
from pathlib import Path
def load_results(path):
tsv = Path(path) / "results.tsv"
def find_autoresearch_root():
"""Find .autoresearch/ in project or user home."""
project_root = Path(".").resolve() / ".autoresearch"
if project_root.exists():
return project_root
user_root = Path.home() / ".autoresearch"
if user_root.exists():
return user_root
return None
def load_config(experiment_dir):
"""Load config.cfg."""
cfg_file = experiment_dir / "config.cfg"
config = {}
if cfg_file.exists():
for line in cfg_file.read_text().splitlines():
if ":" in line:
k, v = line.split(":", 1)
config[k.strip()] = v.strip()
return config
def load_results(experiment_dir):
"""Load results.tsv into list of dicts."""
tsv = experiment_dir / "results.tsv"
if not tsv.exists():
return []
lines = tsv.read_text().splitlines()[1:] # skip header
results = []
for line in lines:
for line in tsv.read_text().splitlines()[1:]:
parts = line.split("\t")
if len(parts) >= 4:
try:
metric_val = float(parts[1]) if parts[1] != "N/A" else None
metric = float(parts[1]) if parts[1] != "N/A" else None
except ValueError:
metric_val = None
metric = None
results.append({
"commit": parts[0],
"metric": metric_val,
"metric": metric,
"status": parts[2],
"description": parts[3]
"description": parts[3],
})
return results
def print_summary(results, metric_name="metric", direction="lower"):
if not results:
print("No experiments logged yet.")
return
def compute_stats(results, direction):
"""Compute statistics from results."""
keeps = [r for r in results if r["status"] == "keep"]
discards = [r for r in results if r["status"] == "discard"]
crashes = [r for r in results if r["status"] == "crash"]
print(f"\n{''*60}")
print(f" autoresearch-agent — Results Summary")
print(f"{''*60}")
print(f" Total experiments: {len(results)}")
print(f" ✅ Keep: {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)")
print(f" ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)")
print(f" 💥 Crash: {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
valid_keeps = [r for r in keeps if r["metric"] is not None]
baseline = valid_keeps[0]["metric"] if valid_keeps else None
if valid_keeps:
best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps)
else:
best = None
if keeps:
valid = [r for r in keeps if r["metric"] is not None]
if valid:
baseline = valid[0]["metric"]
best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid)
best_run = next(r for r in valid if r["metric"] == best)
improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
pct_change = None
if baseline and best and baseline != 0:
if direction == "lower":
pct_change = (baseline - best) / baseline * 100
else:
pct_change = (best - baseline) / baseline * 100
print(f"\n {metric_name}:")
print(f" Baseline: {baseline:.6f}")
print(f" Best: {best:.6f} (commit: {best_run['commit']})")
print(f" Change: {improvement:+.2f}%")
print(f"{''*60}\n")
return {
"total": len(results),
"keeps": len(keeps),
"discards": len(discards),
"crashes": len(crashes),
"baseline": baseline,
"best": best,
"pct_change": pct_change,
}
def print_history(results):
# --- Terminal Output ---
def print_experiment(experiment_dir, experiment_path):
"""Print single experiment results to terminal."""
config = load_config(experiment_dir)
results = load_results(experiment_dir)
direction = config.get("metric_direction", "lower")
metric_name = config.get("metric", "metric")
if not results:
print("No experiments logged yet.")
print(f"No results for {experiment_path}")
return
print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION")
print("" * 60)
for r in results:
metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash "
status_icon = {"keep": "", "discard": "", "crash": "💥"}.get(r["status"], "?")
print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
stats = compute_stats(results, direction)
print(f"\n{'' * 65}")
print(f" {experiment_path}")
print(f" Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})")
print(f"{'' * 65}")
print(f" Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}")
if stats["baseline"] is not None and stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
print(f" Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}")
print(f"\n {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION")
print(f" {'' * 60}")
for r in results:
m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A "
icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?")
print(f" {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}")
print()
def print_dashboard(root):
"""Print cross-experiment dashboard."""
experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
config = load_config(exp_dir)
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
# Determine status
status = "idle"
if stats["total"] > 0:
tsv = exp_dir / "results.tsv"
if tsv.exists():
import time
age_hours = (time.time() - tsv.stat().st_mtime) / 3600
status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"
best_str = f"{stats['best']:.4f}" if stats["best"] is not None else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
experiments.append({
"domain": domain_dir.name,
"name": exp_dir.name,
"runs": stats["total"],
"kept": stats["keeps"],
"best": best_str,
"change": pct_str,
"status": status,
"metric": config.get("metric", "?"),
})
if not experiments:
print("No experiments found.")
return experiments
print(f"\n{'' * 90}")
print(f" autoresearch — Dashboard")
print(f"{'' * 90}")
print(f" {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}")
print(f" {'' * 85}")
for e in experiments:
print(f" {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}")
print()
return experiments
# --- CSV Export ---
def export_experiment_csv(experiment_dir, experiment_path):
"""Export single experiment as CSV string."""
config = load_config(experiment_dir)
results = load_results(experiment_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
buf = io.StringIO()
writer = csv.writer(buf)
# Header with metadata
writer.writerow(["# Experiment", experiment_path])
writer.writerow(["# Target", config.get("target", "")])
writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"])
if stats["baseline"] is not None:
writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
if stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
writer.writerow(["# Total", stats["total"]])
writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
writer.writerow([])
writer.writerow(["Commit", "Metric", "Status", "Description"])
for r in results:
m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A"
writer.writerow([r["commit"], m, r["status"], r["description"]])
return buf.getvalue()
def export_dashboard_csv(root):
"""Export dashboard as CSV string."""
experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
config = load_config(exp_dir)
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best_str = f"{stats['best']:.6f}" if stats["best"] else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
experiments.append([
domain_dir.name, exp_dir.name, config.get("metric", ""),
stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
best_str, pct_str
])
buf = io.StringIO()
writer = csv.writer(buf)
writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"])
for e in experiments:
writer.writerow(e)
return buf.getvalue()
# --- Markdown Export ---
def export_experiment_markdown(experiment_dir, experiment_path):
"""Export single experiment as Markdown string."""
config = load_config(experiment_dir)
results = load_results(experiment_dir)
direction = config.get("metric_direction", "lower")
metric_name = config.get("metric", "metric")
stats = compute_stats(results, direction)
lines = []
lines.append(f"# Autoresearch: {experiment_path}\n")
lines.append(f"**Target:** `{config.get('target', '?')}` ")
lines.append(f"**Metric:** `{metric_name}` ({direction} is better) ")
lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")
if stats["baseline"] is not None and stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")
lines.append(f"| Commit | Metric | Status | Description |")
lines.append(f"|--------|--------|--------|-------------|")
for r in results:
m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A"
lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |")
lines.append("")
return "\n".join(lines)
def export_dashboard_markdown(root):
"""Export dashboard as Markdown string."""
lines = []
lines.append("# Autoresearch Dashboard\n")
lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |")
lines.append("|--------|-----------|--------|------|------|------|--------|--------|")
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
config = load_config(exp_dir)
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best = f"`{stats['best']:.4f}`" if stats["best"] else ""
pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
import time
tsv = exp_dir / "results.tsv"
status = "idle"
if tsv.exists() and stats["total"] > 0:
age_h = (time.time() - tsv.stat().st_mtime) / 3600
status = "active" if age_h < 1 else "paused" if age_h < 24 else "done"
lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |")
lines.append("")
return "\n".join(lines)
# --- Main ---
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--summary", action="store_true")
parser.add_argument("--best", action="store_true")
parser.add_argument("--history", action="store_true")
parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC"))
parser.add_argument("--path", default=".")
parser.add_argument("--metric", default="metric")
parser.add_argument("--direction", default="lower", choices=["lower", "higher"])
parser = argparse.ArgumentParser(description="autoresearch-agent results viewer")
parser.add_argument("--experiment", help="Show one experiment: domain/name")
parser.add_argument("--domain", help="Show all experiments in a domain")
parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard")
parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal",
help="Output format (default: terminal)")
parser.add_argument("--output", "-o", help="Write to file instead of stdout")
parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)")
args = parser.parse_args()
path = Path(args.path).resolve()
root = find_autoresearch_root()
if root is None:
print("No .autoresearch/ found. Run setup_experiment.py first.")
sys.exit(1)
if args.record:
commit, metric, status, desc = args.record
tsv = path / "results.tsv"
if not tsv.exists():
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
with open(tsv, "a") as f:
f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
print(f"✓ Logged: {commit} {metric} {status}")
return
output_text = None
results = load_results(path)
# Single experiment
if args.experiment:
experiment_dir = root / args.experiment
if not experiment_dir.exists():
print(f"Experiment not found: {args.experiment}")
sys.exit(1)
if args.history:
print_history(results)
elif args.best:
keeps = [r for r in results if r["status"] == "keep" and r["metric"]]
if not keeps:
print("No successful experiments yet.")
if args.format == "csv":
output_text = export_experiment_csv(experiment_dir, args.experiment)
elif args.format == "markdown":
output_text = export_experiment_markdown(experiment_dir, args.experiment)
else:
print_experiment(experiment_dir, args.experiment)
return
best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}")
# Domain
elif args.domain:
domain_dir = root / args.domain
if not domain_dir.exists():
print(f"Domain not found: {args.domain}")
sys.exit(1)
for exp_dir in sorted(domain_dir.iterdir()):
if exp_dir.is_dir() and (exp_dir / "config.cfg").exists():
if args.format == "terminal":
print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}")
# For CSV/MD, fall through to dashboard with domain filter
if args.format != "terminal":
# Use dashboard export filtered to domain
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
else:
return
# Dashboard
elif args.dashboard or args.all:
if args.format == "csv":
output_text = export_dashboard_csv(root)
elif args.format == "markdown":
output_text = export_dashboard_markdown(root)
else:
print_dashboard(root)
return
else:
print_summary(results, args.metric, args.direction)
# Default: dashboard
if args.format == "terminal":
print_dashboard(root)
return
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
# Write output
if output_text:
if args.output:
Path(args.output).write_text(output_text)
print(f"Written to {args.output}")
else:
print(output_text)
if __name__ == "__main__":

View File

@@ -2,17 +2,15 @@
"""
autoresearch-agent: Experiment Runner
Executes the autonomous experiment loop:
- Reads .autoresearch.cfg for project config
- Runs the target evaluation
- Keeps improvements (git commit) or discards failures (git reset)
- Logs everything to results.tsv
- Loops indefinitely until interrupted
Executes the autonomous experiment loop for a specific experiment.
Reads config from .autoresearch/{domain}/{name}/config.cfg.
Usage:
python scripts/run_experiment.py --loop # Run forever
python scripts/run_experiment.py --single # Run one experiment
python scripts/run_experiment.py --dry-run # Show what would happen
python scripts/run_experiment.py --experiment engineering/api-speed --loop
python scripts/run_experiment.py --experiment engineering/api-speed --single
python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
python scripts/run_experiment.py --resume --loop
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
"""
import argparse
@@ -25,11 +23,22 @@ from datetime import datetime
from pathlib import Path
def load_config(path):
"""Load .autoresearch.cfg"""
cfg_file = Path(path) / ".autoresearch.cfg"
def find_autoresearch_root():
"""Find .autoresearch/ in project or user home."""
project_root = Path(".").resolve() / ".autoresearch"
if project_root.exists():
return project_root
user_root = Path.home() / ".autoresearch"
if user_root.exists():
return user_root
return None
def load_config(experiment_dir):
"""Load config.cfg from experiment directory."""
cfg_file = experiment_dir / "config.cfg"
if not cfg_file.exists():
print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.")
print(f" Error: no config.cfg in {experiment_dir}")
sys.exit(1)
config = {}
for line in cfg_file.read_text().splitlines():
@@ -49,239 +58,293 @@ def run_cmd(cmd, cwd=None, timeout=None):
def get_current_commit(path):
"""Get short hash of current HEAD."""
_, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
return commit
def get_current_metric(path, metric_grep):
"""Read the last recorded metric from results.tsv."""
tsv = Path(path) / "results.tsv"
def get_best_metric(experiment_dir, direction):
"""Read the best metric from results.tsv."""
tsv = experiment_dir / "results.tsv"
if not tsv.exists():
return None
lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l]
lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l]
if not lines:
return None
last = lines[-1].split("\t")
try:
return float(last[1])
except (ValueError, IndexError):
metrics = []
for line in lines:
parts = line.split("\t")
try:
if parts[1] != "N/A":
metrics.append(float(parts[1]))
except (ValueError, IndexError):
continue
if not metrics:
return None
return min(metrics) if direction == "lower" else max(metrics)
def run_evaluation(path, evaluate_cmd, time_budget_minutes):
"""Run evaluation with time limit."""
hard_limit = time_budget_minutes * 60 * 2.5 # 2.5x as hard timeout
def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
"""Run evaluation with time limit. Output goes to log_file."""
hard_limit = time_budget_minutes * 60 * 2.5
t0 = time.time()
try:
code, _, _ = run_cmd(
f"{evaluate_cmd} > run.log 2>&1",
cwd=path,
f"{eval_cmd} > {log_file} 2>&1",
cwd=str(project_root),
timeout=hard_limit
)
elapsed = time.time() - t0
return code, elapsed
except subprocess.TimeoutExpired:
elapsed = time.time() - t0
return -1, elapsed # -1 = timeout
return -1, elapsed
def extract_metric(path, metric_grep):
"""Extract metric value from run.log."""
code, out, _ = run_cmd(
f"grep '{metric_grep}' run.log | tail -1",
cwd=path
)
if not out:
return None
try:
return float(out.split(":")[-1].strip())
except ValueError:
def extract_metric(log_file, metric_grep):
"""Extract metric value from log file."""
log_path = Path(log_file)
if not log_path.exists():
return None
for line in reversed(log_path.read_text().splitlines()):
stripped = line.strip()
if stripped.startswith(metric_grep.lstrip("^")):
try:
return float(stripped.split(":")[-1].strip())
except ValueError:
continue
return None
def is_improvement(new_val, old_val, direction):
"""Check if new result is better than old."""
if old_val is None:
return True # First run always "improves"
return True
if direction == "lower":
return new_val < old_val
else:
return new_val > old_val
return new_val > old_val
def log_result(path, commit, metric_val, status, description):
def log_result(experiment_dir, commit, metric_val, status, description):
"""Append result to results.tsv."""
tsv = Path(path) / "results.tsv"
tsv = experiment_dir / "results.tsv"
metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
with open(tsv, "a") as f:
f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
def get_experiment_count(path):
def get_experiment_count(experiment_dir):
"""Count experiments run so far."""
tsv = Path(path) / "results.tsv"
tsv = experiment_dir / "results.tsv"
if not tsv.exists():
return 0
lines = tsv.read_text().splitlines()
return max(0, len(lines) - 1) # subtract header
return max(0, len(tsv.read_text().splitlines()) - 1)
def run_single_experiment(path, config, exp_num, dry_run=False):
def get_last_active(root):
"""Find the most recently modified experiment."""
latest = None
latest_time = 0
for domain_dir in root.iterdir():
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in domain_dir.iterdir():
if not exp_dir.is_dir():
continue
cfg = exp_dir / "config.cfg"
if cfg.exists() and cfg.stat().st_mtime > latest_time:
latest_time = cfg.stat().st_mtime
latest = f"{domain_dir.name}/{exp_dir.name}"
return latest
def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
"""Run one experiment iteration."""
direction = config.get("metric_direction", "lower")
metric_grep = config.get("metric_grep", "^metric:")
evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py")
eval_cmd = config.get("evaluate_cmd", "python evaluate.py")
time_budget = int(config.get("time_budget_minutes", 5))
metric_name = config.get("metric", "metric")
log_file = str(experiment_dir / "run.log")
best_so_far = get_current_metric(path, metric_grep)
best = get_best_metric(experiment_dir, direction)
ts = datetime.now().strftime("%H:%M:%S")
print(f"\n[{ts}] Experiment #{exp_num}")
print(f" Best {metric_name} so far: {best_so_far}")
print(f" Best {metric_name}: {best}")
if dry_run:
print(" [DRY RUN] Would run evaluation and check metric")
return "dry_run"
# Save pre-experiment state for rollback
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path)
# Save state for rollback
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
if code != 0:
print(" ✗ Can't get git state. Is this a git repo with commits?")
print(" Error: can't get git state")
return "error"
# Run evaluation
print(f" Running: {evaluate_cmd} (budget: {time_budget} min)")
ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget)
print(f" Running: {eval_cmd} (budget: {time_budget}m)")
ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file)
# Handle timeout
commit = get_current_commit(str(project_root))
# Timeout
if ret_code == -1:
print(f" TIMEOUT after {elapsed:.0f}s — discarding")
run_cmd("git checkout -- .", cwd=path) # revert uncommitted changes
# Commit was already made by the agent before evaluation
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
print(f" TIMEOUT after {elapsed:.0f}s — discarding")
run_cmd("git checkout -- .", cwd=str(project_root))
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
return "crash"
# Handle non-zero exit
# Crash
if ret_code != 0:
# Check if it crashed
code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path)
print(f" ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
_, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
print(f" CRASH (exit {ret_code}) after {elapsed:.0f}s")
print(f" Last output: {tail[:200]}")
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
return "crash"
# Extract metric
metric_val = extract_metric(path, metric_grep)
metric_val = extract_metric(log_file, metric_grep)
if metric_val is None:
print(f" Could not parse metric from run.log")
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, None, "crash", "metric_parse_failed")
print(f" Could not parse {metric_name} from run.log")
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
return "crash"
curr_commit = get_current_commit(path)
delta = ""
if best_so_far is not None:
diff = metric_val - best_so_far
delta = f" (Δ{diff:+.4f})"
if best is not None:
diff = metric_val - best
delta = f" (delta {diff:+.4f})"
print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
# Keep or discard
if is_improvement(metric_val, best_so_far, direction):
print(f" KEEP — improvement confirmed")
log_result(path, curr_commit, metric_val, "keep",
f"improvement_{metric_name}_{metric_val:.4f}")
if is_improvement(metric_val, best, direction):
print(f" KEEP — improvement")
log_result(experiment_dir, commit, metric_val, "keep",
f"improved_{metric_name}_{metric_val:.4f}")
return "keep"
else:
print(f" DISCARD — no improvement")
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
curr_commit = get_current_commit(path)
log_result(path, curr_commit, metric_val, "discard",
f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}")
print(f" DISCARD — no improvement")
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
best_str = f"{best:.4f}" if best else "?"
log_result(experiment_dir, commit, metric_val, "discard",
f"no_improvement_{metric_val:.4f}_vs_{best_str}")
return "discard"
def print_summary(path):
"""Print experiment summary."""
tsv = Path(path) / "results.tsv"
def print_summary(experiment_dir, config):
"""Print session summary."""
tsv = experiment_dir / "results.tsv"
if not tsv.exists():
return
lines = tsv.read_text().splitlines()[1:] # skip header
lines = tsv.read_text().splitlines()[1:]
if not lines:
return
keeps = [l for l in lines if "\tkeep\t" in l]
discards = [l for l in lines if "\tdiscard\t" in l]
crashes = [l for l in lines if "\tcrash\t" in l]
metric_name = config.get("metric", "metric")
direction = config.get("metric_direction", "lower")
print(f"\n{'='*50}")
print(f" Session Summary")
print(f"\n{'=' * 55}")
print(f" autoresearch — Session Summary")
print(f" Experiments: {len(lines)} total")
print(f" Keep: {len(keeps)} | Discard: {len(discards)} | 💥 Crash: {len(crashes)}")
print(f" Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")
if keeps:
try:
first_metric = float(keeps[0].split("\t")[1])
last_metric = float(keeps[-1].split("\t")[1])
direction = "" if last_metric < first_metric else ""
print(f" Best progress: {first_metric:.6f}{last_metric:.6f} {direction}")
valid = []
for l in keeps:
parts = l.split("\t")
if parts[1] != "N/A":
valid.append(float(parts[1]))
if len(valid) >= 2:
first, last = valid[0], valid[-1]
best = min(valid) if direction == "lower" else max(valid)
pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
print(f" {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
except (ValueError, IndexError):
pass
print(f"{'='*50}\n")
print(f"{'=' * 55}\n")
def main():
parser = argparse.ArgumentParser(description="autoresearch-agent runner")
parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
parser.add_argument("--loop", action="store_true", help="Run forever")
parser.add_argument("--single", action="store_true", help="Run one experiment")
parser.add_argument("--dry-run", action="store_true", help="Dry run only")
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
parser.add_argument("--path", default=".", help="Project root")
parser.add_argument("--max-experiments", type=int, default=0,
help="Max experiments (0 = unlimited)")
args = parser.parse_args()
path = Path(args.path).resolve()
config = load_config(path)
project_root = Path(args.path).resolve()
root = find_autoresearch_root()
print(f"\n🔬 autoresearch-agent")
print(f" Project: {path}")
print(f" Target: {config.get('target', '?')}")
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
print(f" Mode: {'loop' if args.loop else 'single'}")
if root is None:
print("No .autoresearch/ found. Run setup_experiment.py first.")
sys.exit(1)
if args.single:
exp_num = get_experiment_count(path) + 1
run_single_experiment(path, config, exp_num, args.dry_run)
# Resolve experiment
experiment_path = args.experiment
if args.resume:
experiment_path = get_last_active(root)
if not experiment_path:
print("No experiments found to resume.")
sys.exit(1)
print(f"Resuming: {experiment_path}")
if not experiment_path:
print("Specify --experiment domain/name or --resume")
sys.exit(1)
experiment_dir = root / experiment_path
if not experiment_dir.exists():
print(f"Experiment not found: {experiment_dir}")
print("Run: python scripts/setup_experiment.py --list")
sys.exit(1)
config = load_config(experiment_dir)
domain, name = experiment_path.split("/", 1)
print(f"\n autoresearch-agent")
print(f" Experiment: {experiment_path}")
print(f" Target: {config.get('target', '?')}")
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
print(f" Mode: {'loop' if args.loop else 'single'}")
if args.single or args.dry_run:
exp_num = get_experiment_count(experiment_dir) + 1
run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
return
if not args.loop and not args.dry_run:
if not args.loop:
print("\nSpecify --loop (forever) or --single (one experiment)")
sys.exit(1)
# Setup graceful shutdown
# Graceful shutdown
def handle_interrupt(sig, frame):
print_summary(path)
print("\nStopped by user.")
print_summary(experiment_dir, config)
print("\nStopped by user.")
sys.exit(0)
signal.signal(signal.SIGINT, handle_interrupt)
signal.signal(signal.SIGTERM, handle_interrupt)
# Main loop
consecutive_crashes = 0
exp_num = get_experiment_count(path) + 1
exp_num = get_experiment_count(experiment_dir) + 1
print(f"\nStarting loop. Ctrl+C to stop and print summary.\n")
print(f"\nStarting loop. Ctrl+C to stop.\n")
while True:
result = run_single_experiment(path, config, exp_num, args.dry_run)
result = run_single(project_root, experiment_dir, config, exp_num, False)
exp_num += 1
if result == "crash":
@@ -289,21 +352,16 @@ def main():
else:
consecutive_crashes = 0
# Bail if 5 consecutive crashes
if consecutive_crashes >= 5:
print("\n 5 consecutive crashes. Pausing for investigation.")
print(" Check run.log for the last error.")
print("\n 5 consecutive crashes. Pausing.")
print(" Check .autoresearch/{}/run.log".format(experiment_path))
break
# Check max experiments
if args.max_experiments > 0 and exp_num > args.max_experiments:
print(f"\n✓ Reached max experiments ({args.max_experiments})")
if 0 < args.max_experiments < exp_num:
print(f"\n Reached max experiments ({args.max_experiments})")
break
if args.single:
break
print_summary(path)
print_summary(experiment_dir, config)
if __name__ == "__main__":

View File

@@ -1,65 +1,52 @@
#!/usr/bin/env python3
"""
autoresearch-agent: Setup Wizard
autoresearch-agent: Setup Experiment
Initializes a new research run:
1. Validates the project structure
2. Creates a git branch
3. Runs the baseline experiment
4. Initializes results.tsv
Initialize a new experiment with domain, target, evaluator, and git branch.
Creates the .autoresearch/{domain}/{name}/ directory structure.
Usage:
python scripts/setup_experiment.py [--config experiment.yaml]
python scripts/setup_experiment.py --domain ml|prompt|code|skill
python scripts/setup_experiment.py --domain engineering --name api-speed \
--target src/api/search.py --eval "pytest bench.py" \
--metric p50_ms --direction lower
python scripts/setup_experiment.py --domain marketing --name medium-ctr \
--target content/titles.md --eval "python evaluate.py" \
--metric ctr_score --direction higher --evaluator llm_judge_content
python scripts/setup_experiment.py --list # List all experiments
python scripts/setup_experiment.py --list-evaluators # List available evaluators
"""
import argparse
import os
import shutil
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"]
DOMAINS = {
"ml": {
"target": "train.py",
"evaluate_cmd": "uv run train.py",
"metric": "val_bpb",
"metric_direction": "lower",
"time_budget_minutes": 5,
"metric_grep": "^val_bpb:",
},
"prompt": {
"target": "prompt.md",
"evaluate_cmd": "python evaluate.py",
"metric": "eval_score",
"metric_direction": "higher",
"time_budget_minutes": 2,
"metric_grep": "^eval_score:",
},
"code": {
"target": "src/module.py",
"evaluate_cmd": "python benchmark.py",
"metric": "p50_ms",
"metric_direction": "lower",
"time_budget_minutes": 10,
"metric_grep": "^p50_ms:",
},
"skill": {
"target": "SKILL.md",
"evaluate_cmd": "python scripts/skill_evaluator.py",
"metric": "pass_rate",
"metric_direction": "higher",
"time_budget_minutes": 5,
"metric_grep": "^pass_rate:",
},
}
EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators"
DEFAULT_CONFIG = """# autoresearch global config
default_time_budget_minutes: 5
default_scope: project
dashboard_format: markdown
"""
GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state
**/results.tsv
**/run.log
**/run.*.log
config.yaml
"""
def run_cmd(cmd, cwd=None, timeout=None):
"""Run a shell command and return (returncode, stdout, stderr)."""
"""Run shell command, return (returncode, stdout, stderr)."""
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True,
cwd=cwd, timeout=timeout
@@ -67,188 +54,315 @@ def run_cmd(cmd, cwd=None, timeout=None):
return result.returncode, result.stdout.strip(), result.stderr.strip()
def check_git_repo(path):
"""Verify we're in a git repo."""
code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path)
if code != 0:
print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'")
def get_autoresearch_root(scope, project_root=None):
"""Get the .autoresearch root directory based on scope."""
if scope == "user":
return Path.home() / ".autoresearch"
return Path(project_root or ".") / ".autoresearch"
def init_root(root):
"""Initialize .autoresearch root if it doesn't exist."""
created = False
if not root.exists():
root.mkdir(parents=True)
created = True
print(f" Created {root}/")
config_file = root / "config.yaml"
if not config_file.exists():
config_file.write_text(DEFAULT_CONFIG)
print(f" Created {config_file}")
gitignore = root / ".gitignore"
if not gitignore.exists():
gitignore.write_text(GITIGNORE_CONTENT)
print(f" Created {gitignore}")
return created
def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""):
"""Generate a program.md template for the experiment."""
direction_word = "Minimize" if direction == "lower" else "Maximize"
content = f"""# autoresearch — {name}
## Goal
{direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better.
## What the Agent Can Change
- Only `{target}` — this is the single file being optimized.
- Everything inside that file is fair game unless constrained below.
## What the Agent Cannot Change
- The evaluation script (`evaluate.py` or the eval command). It is read-only.
- Dependencies — do not add new packages or imports that aren't already available.
- Any other files in the project unless explicitly noted here.
{f"- Additional constraints: {constraints}" if constraints else ""}
## Strategy
1. First run: establish baseline. Do not change anything.
2. Profile/analyze the current state — understand why the metric is what it is.
3. Try the most obvious improvement first (low-hanging fruit).
4. If that works, push further in the same direction.
5. If stuck, try something orthogonal or radical.
6. Read the git log of previous experiments. Don't repeat failed approaches.
## Simplicity Rule
A small improvement that adds ugly complexity is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.
## Stop When
You don't stop. The human will interrupt you when they're satisfied.
If no improvement in 20+ consecutive runs, change strategy drastically.
"""
(experiment_dir / "program.md").write_text(content)
def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget):
"""Write experiment config."""
content = f"""target: {target}
evaluate_cmd: {eval_cmd}
metric: {metric}
metric_direction: {direction}
metric_grep: ^{metric}:
time_budget_minutes: {time_budget}
created: {datetime.now().strftime('%Y-%m-%d %H:%M')}
"""
(experiment_dir / "config.cfg").write_text(content)
def init_results_tsv(experiment_dir):
"""Create results.tsv with header."""
tsv = experiment_dir / "results.tsv"
if tsv.exists():
print(f" results.tsv already exists ({tsv.stat().st_size} bytes)")
return
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
print(" Created results.tsv")
def copy_evaluator(experiment_dir, evaluator_name):
"""Copy a built-in evaluator to the experiment directory."""
source = EVALUATOR_DIR / f"{evaluator_name}.py"
if not source.exists():
print(f" Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}")
print(f" Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}")
return False
print("✓ Git repository found")
dest = experiment_dir / "evaluate.py"
shutil.copy2(source, dest)
print(f" Copied evaluator: {evaluator_name}.py -> evaluate.py")
return True
def check_program_md(path):
"""Check program.md exists and has content."""
pm = Path(path) / "program.md"
if not pm.exists():
print("⚠ program.md not found. Creating template...")
return False
content = pm.read_text()
if len(content) < 100:
print("⚠ program.md looks empty. Fill it out before running experiments.")
return False
print(f"✓ program.md found ({len(content)} chars)")
return True
def check_target_file(path, target):
"""Check target file exists."""
tf = Path(path) / target
if not tf.exists():
print(f"✗ Target file not found: {target}")
return False
print(f"✓ Target file found: {target}")
return True
def check_evaluate_script(path):
"""Check evaluate.py exists."""
ev = Path(path) / "evaluate.py"
if not ev.exists():
print("⚠ evaluate.py not found. You need a fixed evaluation function.")
print(" Create evaluate.py that outputs: metric_name: <value>")
return False
print("✓ evaluate.py found")
return True
def create_branch(path, tag):
def create_branch(path, domain, name):
"""Create and checkout the experiment branch."""
branch = f"autoresearch/{tag}"
code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path)
branch = f"autoresearch/{domain}/{name}"
code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
if code != 0:
if "already exists" in err:
print(f" Branch '{branch}' already exists. Use a different tag.")
else:
print(f"✗ Failed to create branch: {err}")
print(f" Branch '{branch}' already exists. Checking out...")
run_cmd(f"git checkout {branch}", cwd=path)
return branch
print(f" Warning: could not create branch: {err}")
return None
print(f" Created branch: {branch}")
print(f" Created branch: {branch}")
return branch
def init_results_tsv(path):
"""Create results.tsv with header."""
tsv = Path(path) / "results.tsv"
if tsv.exists():
print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
def list_experiments(root):
"""List all experiments across all domains."""
if not root.exists():
print("No experiments found. Run setup to create your first experiment.")
return
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
print("✓ Created results.tsv")
experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir():
continue
cfg_file = exp_dir / "config.cfg"
if not cfg_file.exists():
continue
config = {}
for line in cfg_file.read_text().splitlines():
if ":" in line:
k, v = line.split(":", 1)
config[k.strip()] = v.strip()
# Count results
tsv = exp_dir / "results.tsv"
runs = 0
if tsv.exists():
runs = max(0, len(tsv.read_text().splitlines()) - 1)
experiments.append({
"domain": domain_dir.name,
"name": exp_dir.name,
"target": config.get("target", "?"),
"metric": config.get("metric", "?"),
"runs": runs,
})
if not experiments:
print("No experiments found.")
return
print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}")
print("-" * 95)
for e in experiments:
print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}")
print(f"\nTotal: {len(experiments)} experiments")
def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes):
"""Run the baseline experiment."""
print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...")
timeout = time_budget_minutes * 60 * 2.5 # 2.5x budget as hard limit
def list_evaluators():
"""List available built-in evaluators."""
if not EVALUATOR_DIR.exists():
print("No evaluators directory found.")
return
t0 = time.time()
code, out, err = run_cmd(
f"{evaluate_cmd} > run.log 2>&1",
cwd=path,
timeout=timeout
)
elapsed = time.time() - t0
if code != 0:
print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log")
return None
# Extract metric
grep_code, grep_out, _ = run_cmd(
f"grep '{metric_grep}' run.log | tail -1",
cwd=path
)
if not grep_out:
print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
return None
metric_value = grep_out.split(":")[-1].strip()
print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
return metric_value
print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n")
for f in sorted(EVALUATOR_DIR.glob("*.py")):
# Read first docstring line
desc = ""
for line in f.read_text().splitlines():
if line.strip().startswith('"""') or line.strip().startswith("'''"):
continue
if line.strip() and not line.startswith("#!"):
desc = line.strip().strip('"').strip("'")
break
print(f" {f.stem:<25} {desc}")
def main():
parser = argparse.ArgumentParser(description="autoresearch-agent setup")
parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain")
parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain")
parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)")
parser.add_argument("--target", help="Target file to optimize")
parser.add_argument("--evaluate-cmd", help="Evaluation command")
parser.add_argument("--metric", help="Metric name")
parser.add_argument("--direction", choices=["lower", "higher"], default="lower")
parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes")
parser.add_argument("--tag", help="Run tag (used in branch name)")
parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command")
parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')")
parser.add_argument("--direction", choices=["lower", "higher"], default="lower",
help="Is lower or higher better?")
parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)")
parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)")
parser.add_argument("--scope", choices=["project", "user"], default="project",
help="Where to store experiments: project (./) or user (~/)")
parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
parser.add_argument("--path", default=".", help="Project root path")
parser.add_argument("--skip-baseline", action="store_true")
parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
parser.add_argument("--list", action="store_true", help="List all experiments")
parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
args = parser.parse_args()
path = Path(args.path).resolve()
print(f"\n🔬 autoresearch-agent setup")
print(f" Project: {path}")
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
project_root = Path(args.path).resolve()
# Get config from domain or args
if args.domain:
config = DOMAINS[args.domain].copy()
# List mode
if args.list:
root = get_autoresearch_root("project", project_root)
list_experiments(root)
user_root = get_autoresearch_root("user")
if user_root.exists() and user_root != root:
print(f"\n--- User-level experiments ({user_root}) ---")
list_experiments(user_root)
return
if args.list_evaluators:
list_evaluators()
return
# Validate required args for setup
if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]):
parser.error("Required: --domain, --name, --target, --eval, --metric")
root = get_autoresearch_root(args.scope, project_root)
print(f"\n autoresearch-agent setup")
print(f" Project: {project_root}")
print(f" Scope: {args.scope}")
print(f" Domain: {args.domain}")
print(f" Experiment: {args.name}")
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
# Check git
code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
if code != 0:
print(" Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
sys.exit(1)
print(" Git repository found")
# Check target file
target_path = project_root / args.target
if not target_path.exists():
print(f" Error: target file not found: {args.target}")
sys.exit(1)
print(f" Target file found: {args.target}")
# Init root
init_root(root)
# Create experiment directory
experiment_dir = root / args.domain / args.name
if experiment_dir.exists():
print(f" Warning: experiment '{args.domain}/{args.name}' already exists.")
print(f" Use --name with a different name, or delete {experiment_dir}")
sys.exit(1)
experiment_dir.mkdir(parents=True)
print(f" Created {experiment_dir}/")
# Create files
create_program_md(experiment_dir, args.domain, args.name,
args.target, args.metric, args.direction, args.constraints)
print(" Created program.md")
create_config(experiment_dir, args.target, args.eval_cmd,
args.metric, args.direction, args.time_budget)
print(" Created config.cfg")
init_results_tsv(experiment_dir)
# Copy evaluator if specified
if args.evaluator:
copy_evaluator(experiment_dir, args.evaluator)
# Create git branch
if not args.skip_branch:
create_branch(str(project_root), args.domain, args.name)
# Test evaluation command
print(f"\n Testing evaluation: {args.eval_cmd}")
code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60)
if code != 0:
print(f" Warning: eval command failed (exit {code})")
if err:
print(f" stderr: {err[:200]}")
print(" Fix the eval command before running the experiment loop.")
else:
config = {
"target": args.target or "target.py",
"evaluate_cmd": args.evaluate_cmd or "python evaluate.py",
"metric": args.metric or "score",
"metric_direction": args.direction,
"time_budget_minutes": args.budget,
"metric_grep": f"^{args.metric or 'score'}:",
}
# Check metric is parseable
full_output = out + "\n" + err
metric_found = False
for line in full_output.splitlines():
if line.strip().startswith(f"{args.metric}:"):
metric_found = True
print(f" Eval works. Baseline: {line.strip()}")
break
if not metric_found:
print(f" Warning: eval ran but '{args.metric}:' not found in output.")
print(f" Make sure your eval command outputs: {args.metric}: <value>")
tag = args.tag or datetime.now().strftime("%b%d").lower()
# Validation checks
checks = [
check_git_repo(path),
check_program_md(path),
check_target_file(path, config["target"]),
check_evaluate_script(path),
]
if not all(checks):
print("\n⚠ Fix the above issues before running experiments.")
sys.exit(1)
# Create branch
branch = create_branch(path, tag)
if not branch:
sys.exit(1)
# Init results TSV
init_results_tsv(path)
# Save config for run_experiment.py
config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
(path / ".autoresearch.cfg").write_text(config_content + "\n")
print("✓ Saved .autoresearch.cfg")
# Run baseline
if not args.skip_baseline:
baseline = run_baseline(
path,
config["evaluate_cmd"],
config["metric_grep"],
config["time_budget_minutes"]
)
if baseline:
# Log baseline to TSV
code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
with open(path / "results.tsv", "a") as f:
f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
print(f"✓ Baseline logged to results.tsv")
print(f"\n✅ Setup complete!")
print(f" Branch: {branch}")
print(f" Target: {config['target']}")
print(f" Metric: {config['metric']} ({config['metric_direction']} is better)")
print(f" Budget: {config['time_budget_minutes']} min/experiment")
print(f"\nTo start the autonomous loop:")
print(f" python scripts/run_experiment.py --loop")
print(f"\nOr run a single experiment:")
print(f" python scripts/run_experiment.py --single")
# Summary
print(f"\n Setup complete!")
print(f" Experiment: {args.domain}/{args.name}")
print(f" Target: {args.target}")
print(f" Metric: {args.metric} ({args.direction} is better)")
print(f" Budget: {args.time_budget} min/experiment")
if not args.skip_branch:
print(f" Branch: autoresearch/{args.domain}/{args.name}")
print(f"\n To start:")
print(f" python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")
if __name__ == "__main__":