diff --git a/engineering/autoresearch-agent/SKILL.md b/engineering/autoresearch-agent/SKILL.md new file mode 100644 index 0000000..9193185 --- /dev/null +++ b/engineering/autoresearch-agent/SKILL.md @@ -0,0 +1,246 @@ +--- +name: "autoresearch-agent" +description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo." +license: MIT +metadata: + version: 1.0.0 + author: Alireza Rezvani + category: engineering + updated: 2026-03-13 +--- + +# Autoresearch Agent + +> You sleep. The agent experiments. You wake up to results. + +Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop. + +**Works for any domain with a measurable metric:** +- ML training optimization (original use case — optimize `train.py` by `val_bpb`) +- Prompt engineering (optimize system prompts by LLM-eval quality score) +- Code performance (optimize a module by benchmark runtime/score) +- Agent skill improvement (optimize `SKILL.md` by task completion rate) + +--- + +## Before Starting + +Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing. + +If no `program.md` exists, run the **Setup Wizard** below. + +--- + +## Setup Wizard + +Answer these 5 questions to configure the experiment: + +### 1. What are we optimizing? +The **target** — one file the agent is allowed to modify: +- `train.py` — ML training loop (Karpathy-style) +- `prompt.md` — system prompt for an LLM +- `src/module.py` — a specific code module +- `SKILL.md` — an agent skill definition + +### 2. What's the metric? +A **single number** that defines success. Lower or higher = better: +- `val_bpb` — validation bits per byte (ML, lower is better) +- `eval_score` — LLM quality score 0-100 (higher is better) +- `p50_ms` — median latency in milliseconds (lower is better) +- `pass_rate` — test pass rate 0-1 (higher is better) + +### 3. What's the time budget per experiment? +How long one experiment run takes: +- `5m` — fast iteration (Karpathy default, ~12 experiments/hour) +- `10m` — moderate (6/hour) +- `30m` — slow but thorough (2/hour) + +### 4. What can the agent change? +Constraints on the target file: +- Architecture only? Hyperparameters only? Everything? +- What packages/imports are available? +- What's explicitly off-limits? + +### 5. What's the evaluation function? +How we score each experiment: +- Fixed script that outputs the metric (e.g. `python evaluate.py`) +- API call that returns a score +- Test suite with a pass rate + +Once answered, run: `python scripts/setup_experiment.py` to initialize. + +--- + +## The Three Files + +Every autoresearch project has the same structure: + +``` +project/ +├── program.md ← Human writes this: objectives, constraints, strategy +├── target.* ← Agent modifies this: the thing being optimized +├── evaluate.py ← Fixed: the measurement function (never touch) +├── results.tsv ← Auto-generated: experiment log (git-tracked for continuity) +└── scripts/ + ├── setup_experiment.py ← Initialize a new run + ├── run_experiment.py ← Execute one experiment iteration + └── log_results.py ← Record results to TSV +``` + +### `program.md` — Your Research Directions +Write this once. The agent reads it before every experiment. It should contain: +- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code) +- **Strategy:** What directions to explore first +- **Constraints:** What the agent cannot change +- **Stopping criteria:** When a result is "good enough" + +See `references/program-template.md` for domain-specific templates. + +### Target File — The Only File the Agent Edits +Whatever you're optimizing. Strict scope: **one file, one metric**. + +### `evaluate.py` — Fixed Evaluation (Never Modified) +The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured. + +--- + +## The Experiment Loop + +Run: `python scripts/run_experiment.py --loop` + +``` +LOOP FOREVER: + +1. Read program.md for current strategy +2. Review git history: what has been tried? What worked? +3. Propose ONE change to the target file +4. Apply the change +5. git commit (with descriptive message) +6. Run evaluation: python evaluate.py > run.log 2>&1 +7. Parse metric from run.log +8. If metric improved → ADVANCE (keep commit, log "keep") +9. If metric equal/worse → REVERT (git reset, log "discard") +10. If crash → attempt fix, if unfixable log "crash" and revert +11. Update results.tsv +12. Go to 1 +``` + +### Rules (from Karpathy's original) + +- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted. +- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win. +- **One change per experiment** — don't change 5 things at once. You won't know what worked. +- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on. +- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash. +- **No new dependencies** — only use what's already available. + +--- + +## Results Log + +`results.tsv` (tab-separated, not git-tracked): + +``` +commit metric status description +a1b2c3d 0.9979 keep baseline +b2c3d4e 0.9932 keep increased learning rate +c3d4e5f 1.0050 discard switched to GeLU (worse) +d4e5f6g 0.0000 crash doubled model width (OOM) +``` + +Run `python scripts/log_results.py --summary` for a visual summary. + +--- + +## Domain-Specific Configurations + +### ML Training (Karpathy-style) +```yaml +target: train.py +evaluate: uv run prepare.py --eval-only +metric: val_bpb (lower is better) +time_budget: 5m +git_branch: autoresearch/{date}-{tag} +``` + +### Prompt Engineering +```yaml +target: prompt.md +evaluate: python evaluate.py --model gpt-4o --test-cases tests/ +metric: eval_score (0-100, higher is better) +time_budget: 2m +git_branch: prompt-research/{date} +``` + +### Code Performance +```yaml +target: src/hot_module.py +evaluate: python benchmark.py --runs 5 --warmup 1 +metric: p50_ms (lower is better) +time_budget: 10m +git_branch: perf-research/{date} +``` + +### Agent Skill Optimization +```yaml +target: SKILL.md +evaluate: python scripts/skill_evaluator.py --task-suite tests/ +metric: pass_rate (0-1, higher is better) +time_budget: 5m +git_branch: skill-research/{date} +``` + +See `references/experiment-domains.md` for full setup guides per domain. + +--- + +## Scripts + +| Script | Purpose | +|--------|---------| +| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run | +| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) | +| `log_results.py` | Record results to TSV; `--summary` prints progress table | + +--- + +## Installation + +### One-liner (any tool) +```bash +git clone https://github.com/alirezarezvani/claude-skills.git +cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/ +``` + +### Multi-tool install +```bash +# Clone the repo, then use the convert script for your tool: +./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw +``` + +### OpenClaw +```bash +clawhub install autoresearch-agent +``` + +--- + +## Proactive Triggers + +Flag these issues without being asked: + +- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template. +- **Target file has no git history** → `git init` and commit baseline first. +- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting. +- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash. +- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions. +- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons. + +--- + +## Related Skills + +- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics. +- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops. +- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops. +- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function. diff --git a/engineering/autoresearch-agent/references/experiment-domains.md b/engineering/autoresearch-agent/references/experiment-domains.md new file mode 100644 index 0000000..3b79dc3 --- /dev/null +++ b/engineering/autoresearch-agent/references/experiment-domains.md @@ -0,0 +1,175 @@ +# Experiment Domains Guide + +## Domain 1: ML Training (Karpathy-style) + +**Best for:** LLM/neural net training optimization on a single GPU + +**Requirements:** +- NVIDIA GPU (H100 recommended, A100/RTX also work) +- CUDA + PyTorch +- `uv` package manager +- ~50GB disk (training data) + +**Setup:** +```bash +# Clone autoresearch repo (the ML training environment) +git clone https://github.com/karpathy/autoresearch my-ml-research +cd my-ml-research +uv sync +uv run prepare.py # one-time data download + tokenizer (~2 min) + +# Initialize autoresearch skill +cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts + +# Configure +python scripts/setup_experiment.py --domain ml --tag mar13 +``` + +**Metric:** `val_bpb` — validation bits per byte. Lower = better model. + +**What the agent can change in `train.py`:** +- Model depth, width, attention heads +- Learning rate, scheduler, warmup +- Optimizer (Muon, AdamW, variants) +- Batch size, gradient accumulation +- Architecture (attention patterns, FFN type) + +**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):** +Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256. + +--- + +## Domain 2: Prompt Engineering + +**Best for:** Optimizing system prompts for quality/accuracy/tone + +**Requirements:** +- LLM API access (OpenAI, Anthropic, etc.) +- Test cases with expected outputs +- An LLM judge for scoring (can be same model) + +**Setup:** +```bash +mkdir my-prompt-research && cd my-prompt-research +git init + +# Create prompt.md (the thing being optimized) +echo "You are a helpful assistant." > prompt.md + +# Create evaluate.py (fixed — never modify) +cat > evaluate.py << 'EOF' +#!/usr/bin/env python3 +# Fixed evaluation harness — DO NOT MODIFY +import json, sys +from pathlib import Path + +PROMPT = Path("prompt.md").read_text() +# Load test cases +TEST_CASES = json.loads(Path("tests/cases.json").read_text()) + +# Run prompt against test cases, score with LLM judge +# ... (customize for your LLM + scoring logic) +total = sum(score_case(PROMPT, case) for case in TEST_CASES) +score = total / len(TEST_CASES) * 100 +print(f"eval_score: {score:.2f}") +EOF + +# Initialize +python scripts/setup_experiment.py --domain prompt --tag mar13 +``` + +**Metric:** `eval_score` (0-100). Higher = better prompt. + +--- + +## Domain 3: Code Performance + +**Best for:** Optimizing a specific hot module for speed + +**Requirements:** +- A Python module with measurable performance +- Existing tests (correctness must not regress) +- A benchmark harness + +**Setup:** +```bash +cd my-project + +# Create benchmark.py (fixed — never modify) +cat > benchmark.py << 'EOF' +#!/usr/bin/env python3 +# Fixed benchmark — DO NOT MODIFY +import time, statistics +from src.module import your_function +from tests.test_module import run_tests + +# Correctness check first +if not run_tests(): + print("TESTS FAILED") + sys.exit(1) + +# Benchmark +data = generate_test_data(n=10000) +times = [] +for _ in range(10): + t0 = time.perf_counter() + your_function(data) + times.append((time.perf_counter() - t0) * 1000) + +p50 = statistics.median(times) +print(f"p50_ms: {p50:.2f}") +print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}") +EOF + +python scripts/setup_experiment.py --domain code \ + --target src/module.py \ + --tag mar13 +``` + +**Metric:** `p50_ms` — median latency. Lower = faster. + +--- + +## Domain 4: Agent Skill Optimization + +**Best for:** Improving the quality of claude-skills SKILL.md files + +**Requirements:** +- A SKILL.md to optimize +- A task evaluation suite (15-20 standardized tasks) +- An LLM judge for scoring + +**Setup:** +```bash +# Create a new skill research project +mkdir skill-research-{skill-name} && cd skill-research-{skill-name} +git init + +# Copy the skill to optimize +cp ~/.claude/skills/{skill-name}/SKILL.md . + +# Create evaluate.py +cat > scripts/skill_evaluator.py << 'EOF' +#!/usr/bin/env python3 +# Fixed evaluator — DO NOT MODIFY +# Runs SKILL.md against 15 standardized tasks using LLM judge +# Outputs: pass_rate: 0.80 (etc.) +EOF + +python scripts/setup_experiment.py --domain skill --tag mar13 +``` + +**Metric:** `pass_rate` (0-1). Higher = better skill. + +--- + +## Choosing Your Domain + +| Question | Recommendation | +|----------|---------------| +| Do I have a GPU and want to improve an LLM? | ML Training | +| Do I want to improve a prompt/system message? | Prompt Engineering | +| Do I have slow Python code I want to speed up? | Code Performance | +| Do I want to improve one of my claude-skills? | Skill Optimization | + +**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results. diff --git a/engineering/autoresearch-agent/references/program-template.md b/engineering/autoresearch-agent/references/program-template.md new file mode 100644 index 0000000..03498d4 --- /dev/null +++ b/engineering/autoresearch-agent/references/program-template.md @@ -0,0 +1,170 @@ +# program.md Templates + +Copy the template for your domain and paste into your project root as `program.md`. + +--- + +## ML Training (Karpathy-style) + +```markdown +# autoresearch — ML Training + +## Goal +Minimize val_bpb on the validation set. Lower is better. + +## What You Can Change (train.py only) +- Model architecture (depth, width, attention heads, FFN ratio) +- Optimizer (learning rate, warmup, scheduler, weight decay) +- Training loop (batch size, gradient accumulation, clipping) +- Regularization (dropout, weight tying, etc.) +- Any self-contained improvement that doesn't require new packages + +## What You Cannot Change +- prepare.py (fixed — contains evaluation harness) +- Dependencies (pyproject.toml is locked) +- Time budget (always 5 minutes, wall clock) +- Evaluation metric (val_bpb is the ground truth) + +## Strategy +1. First run: establish baseline. Do not change anything. +2. Explore learning rate range (try 2x and 0.5x current) +3. Try depth changes (±2 layers) +4. Try optimizer changes (Muon vs. AdamW variants) +5. If things improve, double down. If stuck, try something radical. + +## Simplicity Rule +A small improvement with ugly code is NOT worth it. +Equal performance with simpler code IS worth it. +Removing code that gets same results is the best outcome. + +## Stop When +val_bpb < 0.95 OR after 100 experiments, whichever comes first. +``` + +--- + +## Prompt Engineering + +```markdown +# autoresearch — Prompt Optimization + +## Goal +Maximize eval_score on the test suite. Higher is better (0-100). + +## What You Can Change (prompt.md only) +- System prompt instructions +- Examples and few-shot demonstrations +- Output format specifications +- Chain-of-thought instructions +- Persona and tone +- Task decomposition strategies + +## What You Cannot Change +- evaluate.py (fixed evaluation harness) +- Test cases in tests/ (ground truth) +- Model being evaluated (specified in evaluate.py) +- Scoring criteria (defined in evaluate.py) + +## Strategy +1. First run: baseline with current prompt (or empty) +2. Add clear role/persona definition +3. Add output format specification +4. Add chain-of-thought instruction +5. Add 2-3 diverse examples +6. Refine based on failure modes from run.log + +## Evaluation +- evaluate.py runs the prompt against 20 test cases +- Each test case is scored 0-5 by GPT-4o +- eval_score = average * 20 (maps to 0-100) +- Run log shows which test cases failed + +## Stop When +eval_score >= 85 OR after 50 experiments. +``` + +--- + +## Code Performance + +```markdown +# autoresearch — Performance Optimization + +## Goal +Minimize p50_ms (median latency). Lower is better. + +## What You Can Change (src/module.py only) +- Algorithm implementation +- Data structures (use faster alternatives) +- Caching and memoization +- Vectorization (NumPy, etc.) +- Loop optimization +- I/O patterns +- Memory allocation patterns + +## What You Cannot Change +- benchmark.py (fixed benchmark harness) +- Public API (function signatures must stay the same) +- External dependencies (add nothing new) +- Correctness tests (tests/ must still pass) + +## Constraints +- Correctness is non-negotiable. benchmark.py runs tests first. +- If tests fail → immediate crash status, no metric recorded. +- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x. + +## Strategy +1. Baseline: profile first, don't guess +2. Check if there's any O(n²) → O(n log n) opportunity +3. Try caching repeated computations +4. Try NumPy vectorization for loops +5. Try algorithm-level changes last (higher risk) + +## Stop When +p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments. +``` + +--- + +## Agent Skill Optimization + +```markdown +# autoresearch — Skill Optimization + +## Goal +Maximize pass_rate on the task evaluation suite. Higher is better (0-1). + +## What You Can Change (SKILL.md only) +- Skill description and trigger phrases +- Core workflow steps and ordering +- Decision frameworks and rules +- Output format specifications +- Example inputs/outputs +- Related skills disambiguation +- Proactive trigger conditions + +## What You Cannot Change +- scripts/skill_evaluator.py (fixed evaluation) +- Test tasks in tests/ (ground truth benchmark) +- Skill name (used for routing) +- License or metadata + +## Evaluation +- skill_evaluator.py runs SKILL.md against 15 standardized tasks +- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass) +- pass_rate = sum(scores) / 15 + +## Strategy +1. Baseline: run as-is +2. Improve trigger description (better routing = more passes) +3. Sharpen the core workflow (clearer = better execution) +4. Add missing edge cases to the rules section +5. Improve disambiguation (reduce false-positive routing) + +## Simplicity Rule +A shorter SKILL.md that achieves the same score is better. +Aim for 200-400 lines total. + +## Stop When +pass_rate >= 0.90 OR after 30 experiments. +``` diff --git a/engineering/autoresearch-agent/scripts/log_results.py b/engineering/autoresearch-agent/scripts/log_results.py new file mode 100644 index 0000000..0186805 --- /dev/null +++ b/engineering/autoresearch-agent/scripts/log_results.py @@ -0,0 +1,126 @@ +#!/usr/bin/env python3 +""" +autoresearch-agent: Results Logger + +View and analyze experiment results from results.tsv. + +Usage: + python scripts/log_results.py --summary # Print progress table + python scripts/log_results.py --best # Show best result + python scripts/log_results.py --history # Full experiment history + python scripts/log_results.py --record commit val status desc # Add entry manually +""" + +import argparse +import sys +from pathlib import Path + + +def load_results(path): + tsv = Path(path) / "results.tsv" + if not tsv.exists(): + return [] + lines = tsv.read_text().splitlines()[1:] # skip header + results = [] + for line in lines: + parts = line.split("\t") + if len(parts) >= 4: + try: + metric_val = float(parts[1]) if parts[1] != "N/A" else None + except ValueError: + metric_val = None + results.append({ + "commit": parts[0], + "metric": metric_val, + "status": parts[2], + "description": parts[3] + }) + return results + + +def print_summary(results, metric_name="metric", direction="lower"): + if not results: + print("No experiments logged yet.") + return + + keeps = [r for r in results if r["status"] == "keep"] + discards = [r for r in results if r["status"] == "discard"] + crashes = [r for r in results if r["status"] == "crash"] + + print(f"\n{'─'*60}") + print(f" autoresearch-agent — Results Summary") + print(f"{'─'*60}") + print(f" Total experiments: {len(results)}") + print(f" ✅ Keep: {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)") + print(f" ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)") + print(f" 💥 Crash: {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)") + + if keeps: + valid = [r for r in keeps if r["metric"] is not None] + if valid: + baseline = valid[0]["metric"] + best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid) + best_run = next(r for r in valid if r["metric"] == best) + improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100) + + print(f"\n {metric_name}:") + print(f" Baseline: {baseline:.6f}") + print(f" Best: {best:.6f} (commit: {best_run['commit']})") + print(f" Change: {improvement:+.2f}%") + + print(f"{'─'*60}\n") + + +def print_history(results): + if not results: + print("No experiments logged yet.") + return + + print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION") + print("─" * 60) + for r in results: + metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash " + status_icon = {"keep": "✅", "discard": "❌", "crash": "💥"}.get(r["status"], "?") + print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}") + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--summary", action="store_true") + parser.add_argument("--best", action="store_true") + parser.add_argument("--history", action="store_true") + parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC")) + parser.add_argument("--path", default=".") + parser.add_argument("--metric", default="metric") + parser.add_argument("--direction", default="lower", choices=["lower", "higher"]) + args = parser.parse_args() + + path = Path(args.path).resolve() + + if args.record: + commit, metric, status, desc = args.record + tsv = path / "results.tsv" + if not tsv.exists(): + tsv.write_text("commit\tmetric\tstatus\tdescription\n") + with open(tsv, "a") as f: + f.write(f"{commit}\t{metric}\t{status}\t{desc}\n") + print(f"✓ Logged: {commit} {metric} {status}") + return + + results = load_results(path) + + if args.history: + print_history(results) + elif args.best: + keeps = [r for r in results if r["status"] == "keep" and r["metric"]] + if not keeps: + print("No successful experiments yet.") + return + best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"]) + print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}") + else: + print_summary(results, args.metric, args.direction) + + +if __name__ == "__main__": + main() diff --git a/engineering/autoresearch-agent/scripts/run_experiment.py b/engineering/autoresearch-agent/scripts/run_experiment.py new file mode 100644 index 0000000..eb6a93d --- /dev/null +++ b/engineering/autoresearch-agent/scripts/run_experiment.py @@ -0,0 +1,310 @@ +#!/usr/bin/env python3 +""" +autoresearch-agent: Experiment Runner + +Executes the autonomous experiment loop: +- Reads .autoresearch.cfg for project config +- Runs the target evaluation +- Keeps improvements (git commit) or discards failures (git reset) +- Logs everything to results.tsv +- Loops indefinitely until interrupted + +Usage: + python scripts/run_experiment.py --loop # Run forever + python scripts/run_experiment.py --single # Run one experiment + python scripts/run_experiment.py --dry-run # Show what would happen +""" + +import argparse +import os +import signal +import subprocess +import sys +import time +from datetime import datetime +from pathlib import Path + + +def load_config(path): + """Load .autoresearch.cfg""" + cfg_file = Path(path) / ".autoresearch.cfg" + if not cfg_file.exists(): + print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.") + sys.exit(1) + config = {} + for line in cfg_file.read_text().splitlines(): + if ":" in line: + k, v = line.split(":", 1) + config[k.strip()] = v.strip() + return config + + +def run_cmd(cmd, cwd=None, timeout=None): + """Run shell command, return (returncode, stdout, stderr).""" + result = subprocess.run( + cmd, shell=True, capture_output=True, text=True, + cwd=cwd, timeout=timeout + ) + return result.returncode, result.stdout.strip(), result.stderr.strip() + + +def get_current_commit(path): + _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path) + return commit + + +def get_current_metric(path, metric_grep): + """Read the last recorded metric from results.tsv.""" + tsv = Path(path) / "results.tsv" + if not tsv.exists(): + return None + lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l] + if not lines: + return None + last = lines[-1].split("\t") + try: + return float(last[1]) + except (ValueError, IndexError): + return None + + +def run_evaluation(path, evaluate_cmd, time_budget_minutes): + """Run evaluation with time limit.""" + hard_limit = time_budget_minutes * 60 * 2.5 # 2.5x as hard timeout + t0 = time.time() + try: + code, _, _ = run_cmd( + f"{evaluate_cmd} > run.log 2>&1", + cwd=path, + timeout=hard_limit + ) + elapsed = time.time() - t0 + return code, elapsed + except subprocess.TimeoutExpired: + elapsed = time.time() - t0 + return -1, elapsed # -1 = timeout + + +def extract_metric(path, metric_grep): + """Extract metric value from run.log.""" + code, out, _ = run_cmd( + f"grep '{metric_grep}' run.log | tail -1", + cwd=path + ) + if not out: + return None + try: + return float(out.split(":")[-1].strip()) + except ValueError: + return None + + +def is_improvement(new_val, old_val, direction): + """Check if new result is better than old.""" + if old_val is None: + return True # First run always "improves" + if direction == "lower": + return new_val < old_val + else: + return new_val > old_val + + +def log_result(path, commit, metric_val, status, description): + """Append result to results.tsv.""" + tsv = Path(path) / "results.tsv" + metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A" + with open(tsv, "a") as f: + f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n") + + +def get_experiment_count(path): + """Count experiments run so far.""" + tsv = Path(path) / "results.tsv" + if not tsv.exists(): + return 0 + lines = tsv.read_text().splitlines() + return max(0, len(lines) - 1) # subtract header + + +def run_single_experiment(path, config, exp_num, dry_run=False): + """Run one experiment iteration.""" + direction = config.get("metric_direction", "lower") + metric_grep = config.get("metric_grep", "^metric:") + evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py") + time_budget = int(config.get("time_budget_minutes", 5)) + metric_name = config.get("metric", "metric") + + best_so_far = get_current_metric(path, metric_grep) + ts = datetime.now().strftime("%H:%M:%S") + + print(f"\n[{ts}] Experiment #{exp_num}") + print(f" Best {metric_name} so far: {best_so_far}") + + if dry_run: + print(" [DRY RUN] Would run evaluation and check metric") + return "dry_run" + + # Save pre-experiment state for rollback + code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path) + if code != 0: + print(" ✗ Can't get git state. Is this a git repo with commits?") + return "error" + + # Run evaluation + print(f" Running: {evaluate_cmd} (budget: {time_budget} min)") + ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget) + + # Handle timeout + if ret_code == -1: + print(f" ✗ TIMEOUT after {elapsed:.0f}s — discarding") + run_cmd("git checkout -- .", cwd=path) # revert uncommitted changes + # Commit was already made by the agent before evaluation + run_cmd(f"git reset --hard {pre_commit}", cwd=path) + curr_commit = get_current_commit(path) + log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s") + return "crash" + + # Handle non-zero exit + if ret_code != 0: + # Check if it crashed + code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path) + print(f" ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s") + print(f" Last output: {tail[:200]}") + run_cmd(f"git reset --hard {pre_commit}", cwd=path) + curr_commit = get_current_commit(path) + log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}") + return "crash" + + # Extract metric + metric_val = extract_metric(path, metric_grep) + if metric_val is None: + print(f" ✗ Could not parse metric from run.log") + run_cmd(f"git reset --hard {pre_commit}", cwd=path) + curr_commit = get_current_commit(path) + log_result(path, curr_commit, None, "crash", "metric_parse_failed") + return "crash" + + curr_commit = get_current_commit(path) + delta = "" + if best_so_far is not None: + diff = metric_val - best_so_far + delta = f" (Δ{diff:+.4f})" + + print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s") + + # Keep or discard + if is_improvement(metric_val, best_so_far, direction): + print(f" ✅ KEEP — improvement confirmed") + log_result(path, curr_commit, metric_val, "keep", + f"improvement_{metric_name}_{metric_val:.4f}") + return "keep" + else: + print(f" ❌ DISCARD — no improvement") + run_cmd(f"git reset --hard {pre_commit}", cwd=path) + curr_commit = get_current_commit(path) + log_result(path, curr_commit, metric_val, "discard", + f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}") + return "discard" + + +def print_summary(path): + """Print experiment summary.""" + tsv = Path(path) / "results.tsv" + if not tsv.exists(): + return + lines = tsv.read_text().splitlines()[1:] # skip header + if not lines: + return + + keeps = [l for l in lines if "\tkeep\t" in l] + discards = [l for l in lines if "\tdiscard\t" in l] + crashes = [l for l in lines if "\tcrash\t" in l] + + print(f"\n{'='*50}") + print(f" Session Summary") + print(f" Experiments: {len(lines)} total") + print(f" ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}") + + if keeps: + try: + first_metric = float(keeps[0].split("\t")[1]) + last_metric = float(keeps[-1].split("\t")[1]) + direction = "↓" if last_metric < first_metric else "↑" + print(f" Best progress: {first_metric:.6f} → {last_metric:.6f} {direction}") + except (ValueError, IndexError): + pass + print(f"{'='*50}\n") + + +def main(): + parser = argparse.ArgumentParser(description="autoresearch-agent runner") + parser.add_argument("--loop", action="store_true", help="Run forever") + parser.add_argument("--single", action="store_true", help="Run one experiment") + parser.add_argument("--dry-run", action="store_true", help="Dry run only") + parser.add_argument("--path", default=".", help="Project root") + parser.add_argument("--max-experiments", type=int, default=0, + help="Max experiments (0 = unlimited)") + args = parser.parse_args() + + path = Path(args.path).resolve() + config = load_config(path) + + print(f"\n🔬 autoresearch-agent") + print(f" Project: {path}") + print(f" Target: {config.get('target', '?')}") + print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)") + print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment") + print(f" Mode: {'loop' if args.loop else 'single'}") + + if args.single: + exp_num = get_experiment_count(path) + 1 + run_single_experiment(path, config, exp_num, args.dry_run) + return + + if not args.loop and not args.dry_run: + print("\nSpecify --loop (forever) or --single (one experiment)") + sys.exit(1) + + # Setup graceful shutdown + def handle_interrupt(sig, frame): + print_summary(path) + print("\n⏹ Stopped by user.") + sys.exit(0) + + signal.signal(signal.SIGINT, handle_interrupt) + signal.signal(signal.SIGTERM, handle_interrupt) + + # Main loop + consecutive_crashes = 0 + exp_num = get_experiment_count(path) + 1 + + print(f"\nStarting loop. Ctrl+C to stop and print summary.\n") + + while True: + result = run_single_experiment(path, config, exp_num, args.dry_run) + exp_num += 1 + + if result == "crash": + consecutive_crashes += 1 + else: + consecutive_crashes = 0 + + # Bail if 5 consecutive crashes + if consecutive_crashes >= 5: + print("\n⚠ 5 consecutive crashes. Pausing for investigation.") + print(" Check run.log for the last error.") + break + + # Check max experiments + if args.max_experiments > 0 and exp_num > args.max_experiments: + print(f"\n✓ Reached max experiments ({args.max_experiments})") + break + + if args.single: + break + + print_summary(path) + + +if __name__ == "__main__": + main() diff --git a/engineering/autoresearch-agent/scripts/setup_experiment.py b/engineering/autoresearch-agent/scripts/setup_experiment.py new file mode 100644 index 0000000..2898f13 --- /dev/null +++ b/engineering/autoresearch-agent/scripts/setup_experiment.py @@ -0,0 +1,255 @@ +#!/usr/bin/env python3 +""" +autoresearch-agent: Setup Wizard + +Initializes a new research run: +1. Validates the project structure +2. Creates a git branch +3. Runs the baseline experiment +4. Initializes results.tsv + +Usage: + python scripts/setup_experiment.py [--config experiment.yaml] + python scripts/setup_experiment.py --domain ml|prompt|code|skill +""" + +import argparse +import os +import subprocess +import sys +import time +from datetime import datetime +from pathlib import Path + + +DOMAINS = { + "ml": { + "target": "train.py", + "evaluate_cmd": "uv run train.py", + "metric": "val_bpb", + "metric_direction": "lower", + "time_budget_minutes": 5, + "metric_grep": "^val_bpb:", + }, + "prompt": { + "target": "prompt.md", + "evaluate_cmd": "python evaluate.py", + "metric": "eval_score", + "metric_direction": "higher", + "time_budget_minutes": 2, + "metric_grep": "^eval_score:", + }, + "code": { + "target": "src/module.py", + "evaluate_cmd": "python benchmark.py", + "metric": "p50_ms", + "metric_direction": "lower", + "time_budget_minutes": 10, + "metric_grep": "^p50_ms:", + }, + "skill": { + "target": "SKILL.md", + "evaluate_cmd": "python scripts/skill_evaluator.py", + "metric": "pass_rate", + "metric_direction": "higher", + "time_budget_minutes": 5, + "metric_grep": "^pass_rate:", + }, +} + + +def run_cmd(cmd, cwd=None, timeout=None): + """Run a shell command and return (returncode, stdout, stderr).""" + result = subprocess.run( + cmd, shell=True, capture_output=True, text=True, + cwd=cwd, timeout=timeout + ) + return result.returncode, result.stdout.strip(), result.stderr.strip() + + +def check_git_repo(path): + """Verify we're in a git repo.""" + code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path) + if code != 0: + print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'") + return False + print("✓ Git repository found") + return True + + +def check_program_md(path): + """Check program.md exists and has content.""" + pm = Path(path) / "program.md" + if not pm.exists(): + print("⚠ program.md not found. Creating template...") + return False + content = pm.read_text() + if len(content) < 100: + print("⚠ program.md looks empty. Fill it out before running experiments.") + return False + print(f"✓ program.md found ({len(content)} chars)") + return True + + +def check_target_file(path, target): + """Check target file exists.""" + tf = Path(path) / target + if not tf.exists(): + print(f"✗ Target file not found: {target}") + return False + print(f"✓ Target file found: {target}") + return True + + +def check_evaluate_script(path): + """Check evaluate.py exists.""" + ev = Path(path) / "evaluate.py" + if not ev.exists(): + print("⚠ evaluate.py not found. You need a fixed evaluation function.") + print(" Create evaluate.py that outputs: metric_name: ") + return False + print("✓ evaluate.py found") + return True + + +def create_branch(path, tag): + """Create and checkout the experiment branch.""" + branch = f"autoresearch/{tag}" + code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path) + if code != 0: + if "already exists" in err: + print(f"✗ Branch '{branch}' already exists. Use a different tag.") + else: + print(f"✗ Failed to create branch: {err}") + return None + print(f"✓ Created branch: {branch}") + return branch + + +def init_results_tsv(path): + """Create results.tsv with header.""" + tsv = Path(path) / "results.tsv" + if tsv.exists(): + print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)") + return + tsv.write_text("commit\tmetric\tstatus\tdescription\n") + print("✓ Created results.tsv") + + +def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes): + """Run the baseline experiment.""" + print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...") + timeout = time_budget_minutes * 60 * 2.5 # 2.5x budget as hard limit + + t0 = time.time() + code, out, err = run_cmd( + f"{evaluate_cmd} > run.log 2>&1", + cwd=path, + timeout=timeout + ) + elapsed = time.time() - t0 + + if code != 0: + print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log") + return None + + # Extract metric + grep_code, grep_out, _ = run_cmd( + f"grep '{metric_grep}' run.log | tail -1", + cwd=path + ) + if not grep_out: + print("✗ Could not extract metric from run.log. Check metric_grep pattern.") + return None + + metric_value = grep_out.split(":")[-1].strip() + print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}") + return metric_value + + +def main(): + parser = argparse.ArgumentParser(description="autoresearch-agent setup") + parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain") + parser.add_argument("--target", help="Target file to optimize") + parser.add_argument("--evaluate-cmd", help="Evaluation command") + parser.add_argument("--metric", help="Metric name") + parser.add_argument("--direction", choices=["lower", "higher"], default="lower") + parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes") + parser.add_argument("--tag", help="Run tag (used in branch name)") + parser.add_argument("--path", default=".", help="Project root path") + parser.add_argument("--skip-baseline", action="store_true") + args = parser.parse_args() + + path = Path(args.path).resolve() + print(f"\n🔬 autoresearch-agent setup") + print(f" Project: {path}") + print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n") + + # Get config from domain or args + if args.domain: + config = DOMAINS[args.domain].copy() + else: + config = { + "target": args.target or "target.py", + "evaluate_cmd": args.evaluate_cmd or "python evaluate.py", + "metric": args.metric or "score", + "metric_direction": args.direction, + "time_budget_minutes": args.budget, + "metric_grep": f"^{args.metric or 'score'}:", + } + + tag = args.tag or datetime.now().strftime("%b%d").lower() + + # Validation checks + checks = [ + check_git_repo(path), + check_program_md(path), + check_target_file(path, config["target"]), + check_evaluate_script(path), + ] + + if not all(checks): + print("\n⚠ Fix the above issues before running experiments.") + sys.exit(1) + + # Create branch + branch = create_branch(path, tag) + if not branch: + sys.exit(1) + + # Init results TSV + init_results_tsv(path) + + # Save config for run_experiment.py + config_content = "\n".join(f"{k}: {v}" for k, v in config.items()) + (path / ".autoresearch.cfg").write_text(config_content + "\n") + print("✓ Saved .autoresearch.cfg") + + # Run baseline + if not args.skip_baseline: + baseline = run_baseline( + path, + config["evaluate_cmd"], + config["metric_grep"], + config["time_budget_minutes"] + ) + if baseline: + # Log baseline to TSV + code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path) + with open(path / "results.tsv", "a") as f: + f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n") + print(f"✓ Baseline logged to results.tsv") + + print(f"\n✅ Setup complete!") + print(f" Branch: {branch}") + print(f" Target: {config['target']}") + print(f" Metric: {config['metric']} ({config['metric_direction']} is better)") + print(f" Budget: {config['time_budget_minutes']} min/experiment") + print(f"\nTo start the autonomous loop:") + print(f" python scripts/run_experiment.py --loop") + print(f"\nOr run a single experiment:") + print(f" python scripts/run_experiment.py --single") + + +if __name__ == "__main__": + main()