diff --git a/engineering/autoresearch-agent/SKILL.md b/engineering/autoresearch-agent/SKILL.md index 9193185..40f1779 100644 --- a/engineering/autoresearch-agent/SKILL.md +++ b/engineering/autoresearch-agent/SKILL.md @@ -1,9 +1,9 @@ --- name: "autoresearch-agent" -description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo." +description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo." license: MIT metadata: - version: 1.0.0 + version: 2.0.0 author: Alireza Rezvani category: engineering updated: 2026-03-13 @@ -13,194 +13,233 @@ metadata: > You sleep. The agent experiments. You wake up to results. -Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop. +Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely. -**Works for any domain with a measurable metric:** -- ML training optimization (original use case — optimize `train.py` by `val_bpb`) -- Prompt engineering (optimize system prompts by LLM-eval quality score) -- Code performance (optimize a module by benchmark runtime/score) -- Agent skill improvement (optimize `SKILL.md` by task completion rate) +Not one guess — fifty measured attempts, compounding. --- -## Before Starting +## When This Skill Activates -Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing. +Recognize these patterns from the user: -If no `program.md` exists, run the **Setup Wizard** below. +- "Make this faster / smaller / better" +- "Optimize [file] for [metric]" +- "Improve my [headlines / copy / prompts]" +- "Run experiments overnight" +- "I want to get [metric] from X to Y" +- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch + +If the user describes a target file + a way to measure success → this skill applies. --- -## Setup Wizard +## Setup -Answer these 5 questions to configure the experiment: +### First Time — Create the Experiment -### 1. What are we optimizing? -The **target** — one file the agent is allowed to modify: -- `train.py` — ML training loop (Karpathy-style) -- `prompt.md` — system prompt for an LLM -- `src/module.py` — a specific code module -- `SKILL.md` — an agent skill definition +Run the setup script. The user decides where experiments live: -### 2. What's the metric? -A **single number** that defines success. Lower or higher = better: -- `val_bpb` — validation bits per byte (ML, lower is better) -- `eval_score` — LLM quality score 0-100 (higher is better) -- `p50_ms` — median latency in milliseconds (lower is better) -- `pass_rate` — test pass rate 0-1 (higher is better) - -### 3. What's the time budget per experiment? -How long one experiment run takes: -- `5m` — fast iteration (Karpathy default, ~12 experiments/hour) -- `10m` — moderate (6/hour) -- `30m` — slow but thorough (2/hour) - -### 4. What can the agent change? -Constraints on the target file: -- Architecture only? Hyperparameters only? Everything? -- What packages/imports are available? -- What's explicitly off-limits? - -### 5. What's the evaluation function? -How we score each experiment: -- Fixed script that outputs the metric (e.g. `python evaluate.py`) -- API call that returns a score -- Test suite with a pass rate - -Once answered, run: `python scripts/setup_experiment.py` to initialize. - ---- - -## The Three Files - -Every autoresearch project has the same structure: - -``` -project/ -├── program.md ← Human writes this: objectives, constraints, strategy -├── target.* ← Agent modifies this: the thing being optimized -├── evaluate.py ← Fixed: the measurement function (never touch) -├── results.tsv ← Auto-generated: experiment log (git-tracked for continuity) -└── scripts/ - ├── setup_experiment.py ← Initialize a new run - ├── run_experiment.py ← Execute one experiment iteration - └── log_results.py ← Record results to TSV +**Project-level** (inside repo, git-tracked, shareable with team): +```bash +python scripts/setup_experiment.py \ + --domain engineering \ + --name api-speed \ + --target src/api/search.py \ + --eval "pytest bench.py --tb=no -q" \ + --metric p50_ms \ + --direction lower \ + --scope project ``` -### `program.md` — Your Research Directions -Write this once. The agent reads it before every experiment. It should contain: -- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code) -- **Strategy:** What directions to explore first -- **Constraints:** What the agent cannot change -- **Stopping criteria:** When a result is "good enough" +**User-level** (personal, in `~/.autoresearch/`): +```bash +python scripts/setup_experiment.py \ + --domain marketing \ + --name medium-ctr \ + --target content/titles.md \ + --eval "python evaluate.py" \ + --metric ctr_score \ + --direction higher \ + --evaluator llm_judge_content \ + --scope user +``` -See `references/program-template.md` for domain-specific templates. +The `--scope` flag determines where `.autoresearch/` lives: +- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored. +- `user` → `~/.autoresearch/` in the home directory. Everything is personal. -### Target File — The Only File the Agent Edits -Whatever you're optimizing. Strict scope: **one file, one metric**. +### What Setup Creates -### `evaluate.py` — Fixed Evaluation (Never Modified) -The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured. +``` +.autoresearch/ +├── config.yaml ← Global settings +├── .gitignore ← Ignores results.tsv, *.log +└── {domain}/{experiment-name}/ + ├── program.md ← Objectives, constraints, strategy + ├── config.cfg ← Target, eval cmd, metric, direction + ├── results.tsv ← Experiment log (gitignored) + └── evaluate.py ← Evaluation script (if --evaluator used) +``` + +### Domains + +| Domain | Use Cases | +|--------|-----------| +| `engineering` | Code speed, memory, bundle size, test pass rate, build time | +| `marketing` | Headlines, social copy, email subjects, ad copy, engagement | +| `content` | Article structure, SEO descriptions, readability, CTR | +| `prompts` | System prompts, chatbot tone, agent instructions | +| `custom` | Anything else with a measurable metric | + +### If `program.md` Already Exists + +The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing. --- ## The Experiment Loop -Run: `python scripts/run_experiment.py --loop` +### Starting an Experiment + +```bash +# Run specific experiment +python scripts/run_experiment.py --experiment engineering/api-speed --loop + +# Single iteration (test setup) +python scripts/run_experiment.py --experiment engineering/api-speed --single + +# Resume last active experiment +python scripts/run_experiment.py --resume --loop + +# Dry run (show what would happen) +python scripts/run_experiment.py --experiment engineering/api-speed --dry-run +``` + +### The Loop Protocol ``` LOOP FOREVER: -1. Read program.md for current strategy -2. Review git history: what has been tried? What worked? -3. Propose ONE change to the target file -4. Apply the change -5. git commit (with descriptive message) -6. Run evaluation: python evaluate.py > run.log 2>&1 -7. Parse metric from run.log -8. If metric improved → ADVANCE (keep commit, log "keep") -9. If metric equal/worse → REVERT (git reset, log "discard") -10. If crash → attempt fix, if unfixable log "crash" and revert -11. Update results.tsv -12. Go to 1 +1. Read program.md for current strategy and constraints +2. Review git log: what has been tried? What worked? What crashed? +3. Review results.tsv: current best metric, trend, recent failures +4. Propose ONE change to the target file +5. Apply the change +6. git commit -m "experiment: [short description of what changed]" +7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1 +8. Parse metric from run.log (grep for metric_name: value) +9. Decision: + - Metric improved → KEEP (advance branch, log "keep") + - Metric equal or worse → REVERT (git reset --hard, log "discard") + - Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash") +10. Append result to results.tsv +11. Go to 1 ``` -### Rules (from Karpathy's original) +### Rules -- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted. -- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win. -- **One change per experiment** — don't change 5 things at once. You won't know what worked. -- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on. -- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash. -- **No new dependencies** — only use what's already available. +- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes. +- **One change per experiment.** Don't change 5 things at once. You won't know what worked. +- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome. +- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this. +- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash. +- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert. +- **No new dependencies.** Only use what's already available in the project. --- -## Results Log +## Evaluators -`results.tsv` (tab-separated, not git-tracked): +Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`. +### Free Evaluators (no API cost) + +| Evaluator | Metric | Use Case | +|-----------|--------|----------| +| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time | +| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size | +| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage | +| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time | +| `memory_usage` | `peak_mb` (lower) | Peak memory during execution | + +### LLM Judge Evaluators (uses your subscription) + +| Evaluator | Metric | Use Case | +|-----------|--------|----------| +| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions | +| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions | +| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails | + +LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator. + +The user's existing subscription covers the cost: +- Claude Code Max → unlimited Claude calls for evaluation +- Codex CLI (ChatGPT Pro) → unlimited Codex calls +- Gemini CLI (free tier) → free evaluation calls + +### Custom Evaluators + +If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout. + +```python +#!/usr/bin/env python3 +# My custom evaluator — DO NOT MODIFY after experiment starts +import subprocess +result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True) +# Parse and output +print(f"my_metric: {parse_score(result.stdout)}") ``` -commit metric status description -a1b2c3d 0.9979 keep baseline -b2c3d4e 0.9932 keep increased learning rate -c3d4e5f 1.0050 discard switched to GeLU (worse) -d4e5f6g 0.0000 crash doubled model width (OOM) -``` - -Run `python scripts/log_results.py --summary` for a visual summary. --- -## Domain-Specific Configurations +## Viewing Results -### ML Training (Karpathy-style) -```yaml -target: train.py -evaluate: uv run prepare.py --eval-only -metric: val_bpb (lower is better) -time_budget: 5m -git_branch: autoresearch/{date}-{tag} +```bash +# Single experiment +python scripts/log_results.py --experiment engineering/api-speed + +# All experiments in a domain +python scripts/log_results.py --domain engineering + +# Cross-experiment dashboard +python scripts/log_results.py --dashboard + +# Export formats +python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv +python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md +python scripts/log_results.py --dashboard --format markdown --output dashboard.md ``` -### Prompt Engineering -```yaml -target: prompt.md -evaluate: python evaluate.py --model gpt-4o --test-cases tests/ -metric: eval_score (0-100, higher is better) -time_budget: 2m -git_branch: prompt-research/{date} +### Dashboard Output + +``` +DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS +engineering api-speed 47 14 185ms -76.9% active +engineering bundle-size 23 8 412KB -58.3% paused +marketing medium-ctr 31 11 8.4/10 +68.0% active +prompts support-tone 15 6 82/100 +46.4% done ``` -### Code Performance -```yaml -target: src/hot_module.py -evaluate: python benchmark.py --runs 5 --warmup 1 -metric: p50_ms (lower is better) -time_budget: 10m -git_branch: perf-research/{date} -``` +### Export Formats -### Agent Skill Optimization -```yaml -target: SKILL.md -evaluate: python scripts/skill_evaluator.py --task-suite tests/ -metric: pass_rate (0-1, higher is better) -time_budget: 5m -git_branch: skill-research/{date} -``` - -See `references/experiment-domains.md` for full setup guides per domain. +- **TSV** — default, tab-separated (compatible with spreadsheets) +- **CSV** — comma-separated, with proper quoting +- **Markdown** — formatted table, readable in GitHub/docs --- -## Scripts +## Proactive Triggers -| Script | Purpose | -|--------|---------| -| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run | -| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) | -| `log_results.py` | Record results to TSV; `--summary` prints progress table | +Flag these without being asked: + +- **No evaluation command works** → Test it before starting the loop. Run once, verify output. +- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first. +- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting. +- **Time budget too short** → If eval takes longer than budget, every run crashes. +- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons. +- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles. +- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach. --- @@ -214,7 +253,6 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/ ### Multi-tool install ```bash -# Clone the repo, then use the convert script for your tool: ./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw ``` @@ -225,22 +263,9 @@ clawhub install autoresearch-agent --- -## Proactive Triggers - -Flag these issues without being asked: - -- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template. -- **Target file has no git history** → `git init` and commit baseline first. -- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting. -- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash. -- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions. -- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons. - ---- - ## Related Skills -- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics. -- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops. -- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops. -- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function. +- **self-improving-agent** — improves an agent's own memory/rules over time. NOT for structured experiment loops. +- **senior-ml-engineer** — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization. +- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function. +- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops. diff --git a/engineering/autoresearch-agent/evaluators/benchmark_size.py b/engineering/autoresearch-agent/evaluators/benchmark_size.py new file mode 100644 index 0000000..1e5bfb1 --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/benchmark_size.py @@ -0,0 +1,56 @@ +#!/usr/bin/env python3 +"""Measure file, bundle, or Docker image size. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import os +import subprocess +import sys + +# --- CONFIGURE ONE OF THESE --- +# Option 1: File size +TARGET_FILE = "dist/main.js" + +# Option 2: Directory size (uncomment to use) +# TARGET_DIR = "dist/" + +# Option 3: Docker image (uncomment to use) +# DOCKER_IMAGE = "myapp:latest" +# DOCKER_BUILD_CMD = "docker build -t myapp:latest ." + +# Option 4: Build first, then measure (uncomment to use) +# BUILD_CMD = "npm run build" +# --- END CONFIG --- + +# Build if needed +if "BUILD_CMD" in dir() or "BUILD_CMD" in globals(): + result = subprocess.run(BUILD_CMD, shell=True, capture_output=True) + if result.returncode != 0: + print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr) + sys.exit(1) + +# Measure +if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals(): + if "DOCKER_BUILD_CMD" in dir(): + subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True) + result = subprocess.run( + f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'", + shell=True, capture_output=True, text=True + ) + size_bytes = int(result.stdout.strip()) +elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals(): + size_bytes = sum( + os.path.getsize(os.path.join(dp, f)) + for dp, _, fns in os.walk(TARGET_DIR) for f in fns + ) +elif os.path.exists(TARGET_FILE): + size_bytes = os.path.getsize(TARGET_FILE) +else: + print(f"Target not found: {TARGET_FILE}", file=sys.stderr) + sys.exit(1) + +size_kb = size_bytes / 1024 +size_mb = size_bytes / (1024 * 1024) + +print(f"size_bytes: {size_bytes}") +print(f"size_kb: {size_kb:.1f}") +print(f"size_mb: {size_mb:.2f}") diff --git a/engineering/autoresearch-agent/evaluators/benchmark_speed.py b/engineering/autoresearch-agent/evaluators/benchmark_speed.py new file mode 100644 index 0000000..9da6966 --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/benchmark_speed.py @@ -0,0 +1,40 @@ +#!/usr/bin/env python3 +"""Measure execution speed of a target function or command. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import statistics +import subprocess +import sys +import time + +# --- CONFIGURE THESE --- +COMMAND = "python src/module.py" # Command to benchmark +RUNS = 5 # Number of runs +WARMUP = 1 # Warmup runs (not counted) +# --- END CONFIG --- + +times = [] + +# Warmup +for _ in range(WARMUP): + subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120) + +# Benchmark +for i in range(RUNS): + t0 = time.perf_counter() + result = subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120) + elapsed = (time.perf_counter() - t0) * 1000 # ms + + if result.returncode != 0: + print(f"Run {i+1} failed (exit {result.returncode})", file=sys.stderr) + print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr) + sys.exit(1) + + times.append(elapsed) + +p50 = statistics.median(times) +p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 5 else max(times) + +print(f"p50_ms: {p50:.2f}") +print(f"p95_ms: {p95:.2f}") +print(f"runs: {RUNS}") diff --git a/engineering/autoresearch-agent/evaluators/build_speed.py b/engineering/autoresearch-agent/evaluators/build_speed.py new file mode 100644 index 0000000..71dbaa0 --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/build_speed.py @@ -0,0 +1,39 @@ +#!/usr/bin/env python3 +"""Measure build/compile time. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import subprocess +import sys +import time + +# --- CONFIGURE THESE --- +BUILD_CMD = "npm run build" # or: docker build -t test . +CLEAN_CMD = "" # optional: npm run clean (run before each build) +RUNS = 3 # Number of builds to average +# --- END CONFIG --- + +times = [] + +for i in range(RUNS): + # Clean if configured + if CLEAN_CMD: + subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60) + + t0 = time.perf_counter() + result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600) + elapsed = time.perf_counter() - t0 + + if result.returncode != 0: + print(f"Build {i+1} failed (exit {result.returncode})", file=sys.stderr) + print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr) + sys.exit(1) + + times.append(elapsed) + +import statistics +avg = statistics.mean(times) +median = statistics.median(times) + +print(f"build_seconds: {median:.2f}") +print(f"build_avg: {avg:.2f}") +print(f"runs: {RUNS}") diff --git a/engineering/autoresearch-agent/evaluators/llm_judge_content.py b/engineering/autoresearch-agent/evaluators/llm_judge_content.py new file mode 100644 index 0000000..795cbec --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/llm_judge_content.py @@ -0,0 +1,72 @@ +#!/usr/bin/env python3 +"""LLM judge for content quality (headlines, titles, descriptions). +Uses the user's existing CLI tool (claude, codex, gemini) for evaluation. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import subprocess +import sys +from pathlib import Path + +# --- CONFIGURE THESE --- +TARGET_FILE = "content/titles.md" # File being optimized +CLI_TOOL = "claude" # or: codex, gemini +# --- END CONFIG --- + +# The judge prompt is FIXED — the agent cannot change how it's evaluated +JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly. + +Criteria (each scored 1-10): + +1. CURIOSITY GAP — Does this make you want to click? Is there an information gap + that can only be resolved by reading? Generic titles score 1-3. Specific, + intriguing titles score 7-10. + +2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved + performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9. + +3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out, + or recognition? Flat titles score 1-3. Emotionally charged score 7-10. + +4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or + search results? Would they pause on this headline? Rate honestly. + +5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally? + Keyword-stuffed = 3. Natural integration of search terms = 8-10. + +Output EXACTLY this format (nothing else): +curiosity: +specificity: +emotional: +scroll_stop: +seo: +ctr_score: + +Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+.""" + +content = Path(TARGET_FILE).read_text() +full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}" + +# Call the user's CLI tool +result = subprocess.run( + [CLI_TOOL, "-p", full_prompt], + capture_output=True, text=True, timeout=120 +) + +if result.returncode != 0: + print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr) + sys.exit(1) + +# Parse output — look for ctr_score line +output = result.stdout +for line in output.splitlines(): + line = line.strip() + if line.startswith("ctr_score:"): + print(line) + elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")): + print(line) + +# Verify ctr_score was found +if "ctr_score:" not in output: + print("Could not parse ctr_score from LLM output", file=sys.stderr) + print(f"Raw output: {output[:500]}", file=sys.stderr) + sys.exit(1) diff --git a/engineering/autoresearch-agent/evaluators/llm_judge_copy.py b/engineering/autoresearch-agent/evaluators/llm_judge_copy.py new file mode 100644 index 0000000..c074feb --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/llm_judge_copy.py @@ -0,0 +1,88 @@ +#!/usr/bin/env python3 +"""LLM judge for marketing copy (social posts, ads, emails). +Uses the user's existing CLI tool for evaluation. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import subprocess +import sys +from pathlib import Path + +# --- CONFIGURE THESE --- +TARGET_FILE = "posts.md" # Copy being optimized +CLI_TOOL = "claude" # or: codex, gemini +PLATFORM = "twitter" # twitter, linkedin, instagram, email, ad +# --- END CONFIG --- + +JUDGE_PROMPTS = { + "twitter": """Score this Twitter/X post strictly: +1. HOOK (1-10) — Does the first line stop the scroll? +2. VALUE (1-10) — Does it provide insight, entertainment, or utility? +3. ENGAGEMENT (1-10) — Would people reply, retweet, or like? +4. BREVITY (1-10) — Is every word earning its place? No filler? +5. CTA (1-10) — Is there a clear next action (even implicit)?""", + + "linkedin": """Score this LinkedIn post strictly: +1. HOOK (1-10) — Does the first line make you click "see more"? +2. STORYTELLING (1-10) — Is there a narrative arc or just statements? +3. CREDIBILITY (1-10) — Does it demonstrate expertise without bragging? +4. ENGAGEMENT (1-10) — Would professionals comment or share? +5. CTA (1-10) — Does it invite discussion or action?""", + + "instagram": """Score this Instagram caption strictly: +1. HOOK (1-10) — Does the first line grab attention? +2. RELATABILITY (1-10) — Does the audience see themselves in this? +3. VISUAL MATCH (1-10) — Does the copy complement visual content? +4. HASHTAG STRATEGY (1-10) — Are hashtags relevant and not spammy? +5. CTA (1-10) — Does it encourage saves, shares, or comments?""", + + "email": """Score this email subject + preview strictly: +1. OPEN INCENTIVE (1-10) — Would you open this in a crowded inbox? +2. SPECIFICITY (1-10) — Is it concrete or vague? +3. URGENCY (1-10) — Is there a reason to open now vs later? +4. PERSONALIZATION (1-10) — Does it feel written for someone, not everyone? +5. PREVIEW SYNC (1-10) — Does the preview text complement the subject?""", + + "ad": """Score this ad copy strictly: +1. ATTENTION (1-10) — Does it stop someone scrolling past ads? +2. DESIRE (1-10) — Does it create want for the product/service? +3. PROOF (1-10) — Is there credibility (numbers, social proof)? +4. ACTION (1-10) — Is the CTA clear and compelling? +5. OBJECTION HANDLING (1-10) — Does it preempt "why not"?""", +} + +platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"]) + +JUDGE_PROMPT = f"""{platform_prompt} + +Output EXACTLY this format: +criterion_1: +criterion_2: +criterion_3: +criterion_4: +criterion_5: +engagement_score: + +Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+.""" + +content = Path(TARGET_FILE).read_text() +full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}" + +result = subprocess.run( + [CLI_TOOL, "-p", full_prompt], + capture_output=True, text=True, timeout=120 +) + +if result.returncode != 0: + print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr) + sys.exit(1) + +output = result.stdout +for line in output.splitlines(): + line = line.strip() + if line.startswith("engagement_score:") or line.startswith("criterion_"): + print(line) + +if "engagement_score:" not in output: + print("Could not parse engagement_score from LLM output", file=sys.stderr) + print(f"Raw: {output[:500]}", file=sys.stderr) + sys.exit(1) diff --git a/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py b/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py new file mode 100644 index 0000000..79dfbc5 --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py @@ -0,0 +1,99 @@ +#!/usr/bin/env python3 +"""LLM judge for prompt/instruction quality. +Uses the user's existing CLI tool for evaluation. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import json +import subprocess +import sys +from pathlib import Path + +# --- CONFIGURE THESE --- +TARGET_FILE = "prompt.md" # Prompt being optimized +TEST_CASES_FILE = "tests/cases.json" # Test cases: [{"input": "...", "expected": "..."}] +CLI_TOOL = "claude" # or: codex, gemini +# --- END CONFIG --- + +JUDGE_PROMPT_TEMPLATE = """You are evaluating a system prompt's effectiveness. + +SYSTEM PROMPT BEING TESTED: +{prompt} + +TEST INPUT: +{input} + +EXPECTED OUTPUT (reference): +{expected} + +ACTUAL OUTPUT: +{actual} + +Score the actual output on these criteria (each 1-10): +1. ACCURACY — Does it match the expected output's intent and facts? +2. COMPLETENESS — Does it cover all required elements? +3. CLARITY — Is it well-structured and easy to understand? +4. INSTRUCTION_FOLLOWING — Does it follow the system prompt's guidelines? + +Output EXACTLY: quality_score: +Nothing else.""" + +prompt = Path(TARGET_FILE).read_text() +test_cases = json.loads(Path(TEST_CASES_FILE).read_text()) + +scores = [] + +for i, case in enumerate(test_cases): + # Generate output using the prompt + gen_prompt = f"{prompt}\n\n{case['input']}" + gen_result = subprocess.run( + [CLI_TOOL, "-p", gen_prompt], + capture_output=True, text=True, timeout=60 + ) + if gen_result.returncode != 0: + print(f"Generation failed for case {i+1}", file=sys.stderr) + scores.append(0) + continue + + actual = gen_result.stdout.strip() + + # Judge the output + judge_prompt = JUDGE_PROMPT_TEMPLATE.format( + prompt=prompt[:500], + input=case["input"], + expected=case.get("expected", "N/A"), + actual=actual[:500] + ) + + judge_result = subprocess.run( + [CLI_TOOL, "-p", judge_prompt], + capture_output=True, text=True, timeout=60 + ) + + if judge_result.returncode != 0: + scores.append(0) + continue + + # Parse score + for line in judge_result.stdout.splitlines(): + if "quality_score:" in line: + try: + score = float(line.split(":")[-1].strip()) + scores.append(score) + except ValueError: + scores.append(0) + break + else: + scores.append(0) + + print(f" Case {i+1}/{len(test_cases)}: {scores[-1]:.1f}", file=sys.stderr) + +if not scores: + print("No test cases evaluated", file=sys.stderr) + sys.exit(1) + +avg = sum(scores) / len(scores) +quality = avg * 10 # Scale to 0-100 + +print(f"quality_score: {quality:.2f}") +print(f"cases_tested: {len(scores)}") +print(f"avg_per_case: {avg:.2f}") diff --git a/engineering/autoresearch-agent/evaluators/memory_usage.py b/engineering/autoresearch-agent/evaluators/memory_usage.py new file mode 100644 index 0000000..6a19649 --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/memory_usage.py @@ -0,0 +1,52 @@ +#!/usr/bin/env python3 +"""Measure peak memory usage of a command. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import os +import platform +import subprocess +import sys + +# --- CONFIGURE THESE --- +COMMAND = "python src/module.py" # Command to measure +# --- END CONFIG --- + +system = platform.system() + +if system == "Linux": + # Use /usr/bin/time for peak RSS + result = subprocess.run( + f"/usr/bin/time -v {COMMAND}", + shell=True, capture_output=True, text=True, timeout=300 + ) + output = result.stderr + for line in output.splitlines(): + if "Maximum resident set size" in line: + kb = int(line.split(":")[-1].strip()) + mb = kb / 1024 + print(f"peak_mb: {mb:.1f}") + print(f"peak_kb: {kb}") + sys.exit(0) + print("Could not parse memory from /usr/bin/time output", file=sys.stderr) + sys.exit(1) + +elif system == "Darwin": + # macOS: use /usr/bin/time -l + result = subprocess.run( + f"/usr/bin/time -l {COMMAND}", + shell=True, capture_output=True, text=True, timeout=300 + ) + output = result.stderr + for line in output.splitlines(): + if "maximum resident set size" in line.lower(): + # macOS reports in bytes + val = int(line.strip().split()[0]) + mb = val / (1024 * 1024) + print(f"peak_mb: {mb:.1f}") + sys.exit(0) + print("Could not parse memory from time output", file=sys.stderr) + sys.exit(1) + +else: + print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr) + sys.exit(1) diff --git a/engineering/autoresearch-agent/evaluators/test_pass_rate.py b/engineering/autoresearch-agent/evaluators/test_pass_rate.py new file mode 100644 index 0000000..de421bc --- /dev/null +++ b/engineering/autoresearch-agent/evaluators/test_pass_rate.py @@ -0,0 +1,55 @@ +#!/usr/bin/env python3 +"""Measure test suite pass rate. +DO NOT MODIFY after experiment starts — this is the fixed evaluator.""" + +import re +import subprocess +import sys + +# --- CONFIGURE THESE --- +TEST_CMD = "pytest tests/ --tb=no -q" # Test command +# --- END CONFIG --- + +result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300) +output = result.stdout + "\n" + result.stderr + +# Try to parse pytest output: "X passed, Y failed, Z errors" +passed = failed = errors = 0 + +# pytest short format: "5 passed, 2 failed in 1.23s" +match = re.search(r"(\d+) passed", output) +if match: + passed = int(match.group(1)) +match = re.search(r"(\d+) failed", output) +if match: + failed = int(match.group(1)) +match = re.search(r"(\d+) error", output) +if match: + errors = int(match.group(1)) + +total = passed + failed + errors +if total == 0: + # Try unittest format: "Ran X tests" + match = re.search(r"Ran (\d+) test", output) + if match: + total = int(match.group(1)) + if result.returncode == 0: + passed = total + else: + # Count failures from output + fail_match = re.search(r"FAILED \(failures=(\d+)", output) + if fail_match: + failed = int(fail_match.group(1)) + passed = total - failed + +if total == 0: + print("Could not parse test results", file=sys.stderr) + print(f"Output: {output[:500]}", file=sys.stderr) + sys.exit(1) + +rate = passed / total + +print(f"pass_rate: {rate:.4f}") +print(f"passed: {passed}") +print(f"failed: {failed}") +print(f"total: {total}") diff --git a/engineering/autoresearch-agent/references/experiment-domains.md b/engineering/autoresearch-agent/references/experiment-domains.md index 3b79dc3..1ac50aa 100644 --- a/engineering/autoresearch-agent/references/experiment-domains.md +++ b/engineering/autoresearch-agent/references/experiment-domains.md @@ -1,175 +1,255 @@ # Experiment Domains Guide -## Domain 1: ML Training (Karpathy-style) +## Domain: Engineering -**Best for:** LLM/neural net training optimization on a single GPU +### Code Speed Optimization -**Requirements:** -- NVIDIA GPU (H100 recommended, A100/RTX also work) -- CUDA + PyTorch -- `uv` package manager -- ~50GB disk (training data) - -**Setup:** ```bash -# Clone autoresearch repo (the ML training environment) -git clone https://github.com/karpathy/autoresearch my-ml-research -cd my-ml-research -uv sync -uv run prepare.py # one-time data download + tokenizer (~2 min) - -# Initialize autoresearch skill -cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts - -# Configure -python scripts/setup_experiment.py --domain ml --tag mar13 +python scripts/setup_experiment.py \ + --domain engineering \ + --name api-speed \ + --target src/api/search.py \ + --eval "python -m pytest tests/bench_search.py --tb=no -q" \ + --metric p50_ms \ + --direction lower \ + --evaluator benchmark_speed ``` -**Metric:** `val_bpb` — validation bits per byte. Lower = better model. +**What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O. +**Cost:** Free — just runs benchmarks. +**Speed:** ~5 min/experiment, ~12/hour, ~100 overnight. -**What the agent can change in `train.py`:** -- Model depth, width, attention heads -- Learning rate, scheduler, warmup -- Optimizer (Muon, AdamW, variants) -- Batch size, gradient accumulation -- Architecture (attention patterns, FFN type) +### Bundle Size Reduction -**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):** -Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256. +```bash +python scripts/setup_experiment.py \ + --domain engineering \ + --name bundle-size \ + --target webpack.config.js \ + --eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \ + --metric size_bytes \ + --direction lower \ + --evaluator benchmark_size +``` + +Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`. + +### Test Pass Rate + +```bash +python scripts/setup_experiment.py \ + --domain engineering \ + --name fix-flaky-tests \ + --target src/utils/parser.py \ + --eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \ + --metric pass_rate \ + --direction higher \ + --evaluator test_pass_rate +``` + +### Docker Build Speed + +```bash +python scripts/setup_experiment.py \ + --domain engineering \ + --name docker-build \ + --target Dockerfile \ + --eval "python .autoresearch/engineering/docker-build/evaluate.py" \ + --metric build_seconds \ + --direction lower \ + --evaluator build_speed +``` + +### Memory Optimization + +```bash +python scripts/setup_experiment.py \ + --domain engineering \ + --name memory-usage \ + --target src/processor.py \ + --eval "python .autoresearch/engineering/memory-usage/evaluate.py" \ + --metric peak_mb \ + --direction lower \ + --evaluator memory_usage +``` + +### ML Training (Karpathy-style) + +Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch). + +```bash +python scripts/setup_experiment.py \ + --domain engineering \ + --name ml-training \ + --target train.py \ + --eval "uv run train.py" \ + --metric val_bpb \ + --direction lower \ + --time-budget 5 +``` --- -## Domain 2: Prompt Engineering +## Domain: Marketing -**Best for:** Optimizing system prompts for quality/accuracy/tone +### Medium Article Headlines -**Requirements:** -- LLM API access (OpenAI, Anthropic, etc.) -- Test cases with expected outputs -- An LLM judge for scoring (can be same model) - -**Setup:** ```bash -mkdir my-prompt-research && cd my-prompt-research -git init - -# Create prompt.md (the thing being optimized) -echo "You are a helpful assistant." > prompt.md - -# Create evaluate.py (fixed — never modify) -cat > evaluate.py << 'EOF' -#!/usr/bin/env python3 -# Fixed evaluation harness — DO NOT MODIFY -import json, sys -from pathlib import Path - -PROMPT = Path("prompt.md").read_text() -# Load test cases -TEST_CASES = json.loads(Path("tests/cases.json").read_text()) - -# Run prompt against test cases, score with LLM judge -# ... (customize for your LLM + scoring logic) -total = sum(score_case(PROMPT, case) for case in TEST_CASES) -score = total / len(TEST_CASES) * 100 -print(f"eval_score: {score:.2f}") -EOF - -# Initialize -python scripts/setup_experiment.py --domain prompt --tag mar13 +python scripts/setup_experiment.py \ + --domain marketing \ + --name medium-ctr \ + --target content/titles.md \ + --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \ + --metric ctr_score \ + --direction higher \ + --evaluator llm_judge_content ``` -**Metric:** `eval_score` (0-100). Higher = better prompt. +Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`. + +**What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers. +**Cost:** Uses your CLI subscription (Claude Max = unlimited). +**Speed:** ~2 min/experiment, ~30/hour. + +### Social Media Copy + +```bash +python scripts/setup_experiment.py \ + --domain marketing \ + --name twitter-engagement \ + --target social/tweets.md \ + --eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \ + --metric engagement_score \ + --direction higher \ + --evaluator llm_judge_copy +``` + +Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram). + +### Email Subject Lines + +```bash +python scripts/setup_experiment.py \ + --domain marketing \ + --name email-open-rate \ + --target emails/subjects.md \ + --eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \ + --metric engagement_score \ + --direction higher \ + --evaluator llm_judge_copy +``` + +Edit `evaluate.py`: set `PLATFORM = "email"`. + +### Ad Copy + +```bash +python scripts/setup_experiment.py \ + --domain marketing \ + --name ad-copy-q2 \ + --target ads/google-search.md \ + --eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \ + --metric engagement_score \ + --direction higher \ + --evaluator llm_judge_copy +``` + +Edit `evaluate.py`: set `PLATFORM = "ad"`. --- -## Domain 3: Code Performance +## Domain: Content -**Best for:** Optimizing a specific hot module for speed +### Article Structure & Readability -**Requirements:** -- A Python module with measurable performance -- Existing tests (correctness must not regress) -- A benchmark harness - -**Setup:** ```bash -cd my-project - -# Create benchmark.py (fixed — never modify) -cat > benchmark.py << 'EOF' -#!/usr/bin/env python3 -# Fixed benchmark — DO NOT MODIFY -import time, statistics -from src.module import your_function -from tests.test_module import run_tests - -# Correctness check first -if not run_tests(): - print("TESTS FAILED") - sys.exit(1) - -# Benchmark -data = generate_test_data(n=10000) -times = [] -for _ in range(10): - t0 = time.perf_counter() - your_function(data) - times.append((time.perf_counter() - t0) * 1000) - -p50 = statistics.median(times) -print(f"p50_ms: {p50:.2f}") -print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}") -EOF - -python scripts/setup_experiment.py --domain code \ - --target src/module.py \ - --tag mar13 +python scripts/setup_experiment.py \ + --domain content \ + --name article-structure \ + --target drafts/my-article.md \ + --eval "python .autoresearch/content/article-structure/evaluate.py" \ + --metric ctr_score \ + --direction higher \ + --evaluator llm_judge_content ``` -**Metric:** `p50_ms` — median latency. Lower = faster. +### SEO Descriptions + +```bash +python scripts/setup_experiment.py \ + --domain content \ + --name seo-meta \ + --target seo/descriptions.md \ + --eval "python .autoresearch/content/seo-meta/evaluate.py" \ + --metric ctr_score \ + --direction higher \ + --evaluator llm_judge_content +``` --- -## Domain 4: Agent Skill Optimization +## Domain: Prompts -**Best for:** Improving the quality of claude-skills SKILL.md files +### System Prompt Optimization -**Requirements:** -- A SKILL.md to optimize -- A task evaluation suite (15-20 standardized tasks) -- An LLM judge for scoring - -**Setup:** ```bash -# Create a new skill research project -mkdir skill-research-{skill-name} && cd skill-research-{skill-name} -git init - -# Copy the skill to optimize -cp ~/.claude/skills/{skill-name}/SKILL.md . - -# Create evaluate.py -cat > scripts/skill_evaluator.py << 'EOF' -#!/usr/bin/env python3 -# Fixed evaluator — DO NOT MODIFY -# Runs SKILL.md against 15 standardized tasks using LLM judge -# Outputs: pass_rate: 0.80 (etc.) -EOF - -python scripts/setup_experiment.py --domain skill --tag mar13 +python scripts/setup_experiment.py \ + --domain prompts \ + --name support-bot \ + --target prompts/support-system.md \ + --eval "python .autoresearch/prompts/support-bot/evaluate.py" \ + --metric quality_score \ + --direction higher \ + --evaluator llm_judge_prompt ``` -**Metric:** `pass_rate` (0-1). Higher = better skill. +Requires `tests/cases.json` with test inputs and expected outputs: + +```json +[ + { + "input": "I can't log in to my account", + "expected": "Ask for email, check account status, offer password reset" + }, + { + "input": "How do I cancel my subscription?", + "expected": "Empathetic response, explain cancellation steps, offer retention" + } +] +``` + +### Agent Skill Optimization + +```bash +python scripts/setup_experiment.py \ + --domain prompts \ + --name skill-improvement \ + --target SKILL.md \ + --eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \ + --metric quality_score \ + --direction higher \ + --evaluator llm_judge_prompt +``` --- ## Choosing Your Domain -| Question | Recommendation | -|----------|---------------| -| Do I have a GPU and want to improve an LLM? | ML Training | -| Do I want to improve a prompt/system message? | Prompt Engineering | -| Do I have slow Python code I want to speed up? | Code Performance | -| Do I want to improve one of my claude-skills? | Skill Optimization | +| I want to... | Domain | Evaluator | Cost | +|-------------|--------|-----------|------| +| Speed up my code | engineering | benchmark_speed | Free | +| Shrink my bundle | engineering | benchmark_size | Free | +| Fix flaky tests | engineering | test_pass_rate | Free | +| Speed up Docker builds | engineering | build_speed | Free | +| Reduce memory usage | engineering | memory_usage | Free | +| Train ML models | engineering | (custom) | Free + GPU | +| Write better headlines | marketing | llm_judge_content | Subscription | +| Improve social posts | marketing | llm_judge_copy | Subscription | +| Optimize email subjects | marketing | llm_judge_copy | Subscription | +| Improve ad copy | marketing | llm_judge_copy | Subscription | +| Optimize article structure | content | llm_judge_content | Subscription | +| Improve SEO descriptions | content | llm_judge_content | Subscription | +| Optimize system prompts | prompts | llm_judge_prompt | Subscription | +| Improve agent skills | prompts | llm_judge_prompt | Subscription | -**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results. +**First time?** Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges. diff --git a/engineering/autoresearch-agent/scripts/log_results.py b/engineering/autoresearch-agent/scripts/log_results.py index 0186805..51b4d10 100644 --- a/engineering/autoresearch-agent/scripts/log_results.py +++ b/engineering/autoresearch-agent/scripts/log_results.py @@ -1,125 +1,389 @@ #!/usr/bin/env python3 """ -autoresearch-agent: Results Logger +autoresearch-agent: Results Viewer -View and analyze experiment results from results.tsv. +View experiment results in multiple formats: terminal, CSV, Markdown. +Supports single experiment, domain, or cross-experiment dashboard. Usage: - python scripts/log_results.py --summary # Print progress table - python scripts/log_results.py --best # Show best result - python scripts/log_results.py --history # Full experiment history - python scripts/log_results.py --record commit val status desc # Add entry manually + python scripts/log_results.py --experiment engineering/api-speed + python scripts/log_results.py --domain engineering + python scripts/log_results.py --dashboard + python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv + python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md + python scripts/log_results.py --dashboard --format markdown --output dashboard.md """ import argparse +import csv +import io import sys from pathlib import Path -def load_results(path): - tsv = Path(path) / "results.tsv" +def find_autoresearch_root(): + """Find .autoresearch/ in project or user home.""" + project_root = Path(".").resolve() / ".autoresearch" + if project_root.exists(): + return project_root + user_root = Path.home() / ".autoresearch" + if user_root.exists(): + return user_root + return None + + +def load_config(experiment_dir): + """Load config.cfg.""" + cfg_file = experiment_dir / "config.cfg" + config = {} + if cfg_file.exists(): + for line in cfg_file.read_text().splitlines(): + if ":" in line: + k, v = line.split(":", 1) + config[k.strip()] = v.strip() + return config + + +def load_results(experiment_dir): + """Load results.tsv into list of dicts.""" + tsv = experiment_dir / "results.tsv" if not tsv.exists(): return [] - lines = tsv.read_text().splitlines()[1:] # skip header results = [] - for line in lines: + for line in tsv.read_text().splitlines()[1:]: parts = line.split("\t") if len(parts) >= 4: try: - metric_val = float(parts[1]) if parts[1] != "N/A" else None + metric = float(parts[1]) if parts[1] != "N/A" else None except ValueError: - metric_val = None + metric = None results.append({ "commit": parts[0], - "metric": metric_val, + "metric": metric, "status": parts[2], - "description": parts[3] + "description": parts[3], }) return results -def print_summary(results, metric_name="metric", direction="lower"): - if not results: - print("No experiments logged yet.") - return - +def compute_stats(results, direction): + """Compute statistics from results.""" keeps = [r for r in results if r["status"] == "keep"] discards = [r for r in results if r["status"] == "discard"] crashes = [r for r in results if r["status"] == "crash"] - print(f"\n{'─'*60}") - print(f" autoresearch-agent — Results Summary") - print(f"{'─'*60}") - print(f" Total experiments: {len(results)}") - print(f" ✅ Keep: {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)") - print(f" ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)") - print(f" 💥 Crash: {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)") + valid_keeps = [r for r in keeps if r["metric"] is not None] + baseline = valid_keeps[0]["metric"] if valid_keeps else None + if valid_keeps: + best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps) + else: + best = None - if keeps: - valid = [r for r in keeps if r["metric"] is not None] - if valid: - baseline = valid[0]["metric"] - best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid) - best_run = next(r for r in valid if r["metric"] == best) - improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100) + pct_change = None + if baseline and best and baseline != 0: + if direction == "lower": + pct_change = (baseline - best) / baseline * 100 + else: + pct_change = (best - baseline) / baseline * 100 - print(f"\n {metric_name}:") - print(f" Baseline: {baseline:.6f}") - print(f" Best: {best:.6f} (commit: {best_run['commit']})") - print(f" Change: {improvement:+.2f}%") - - print(f"{'─'*60}\n") + return { + "total": len(results), + "keeps": len(keeps), + "discards": len(discards), + "crashes": len(crashes), + "baseline": baseline, + "best": best, + "pct_change": pct_change, + } -def print_history(results): +# --- Terminal Output --- + +def print_experiment(experiment_dir, experiment_path): + """Print single experiment results to terminal.""" + config = load_config(experiment_dir) + results = load_results(experiment_dir) + direction = config.get("metric_direction", "lower") + metric_name = config.get("metric", "metric") + if not results: - print("No experiments logged yet.") + print(f"No results for {experiment_path}") return - print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION") - print("─" * 60) - for r in results: - metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash " - status_icon = {"keep": "✅", "discard": "❌", "crash": "💥"}.get(r["status"], "?") - print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}") + stats = compute_stats(results, direction) + print(f"\n{'─' * 65}") + print(f" {experiment_path}") + print(f" Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})") + print(f"{'─' * 65}") + print(f" Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}") + + if stats["baseline"] is not None and stats["best"] is not None: + pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else "" + print(f" Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}") + + print(f"\n {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION") + print(f" {'─' * 60}") + for r in results: + m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A " + icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?") + print(f" {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}") + print() + + +def print_dashboard(root): + """Print cross-experiment dashboard.""" + experiments = [] + for domain_dir in sorted(root.iterdir()): + if not domain_dir.is_dir() or domain_dir.name.startswith("."): + continue + for exp_dir in sorted(domain_dir.iterdir()): + if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists(): + continue + config = load_config(exp_dir) + results = load_results(exp_dir) + direction = config.get("metric_direction", "lower") + stats = compute_stats(results, direction) + + # Determine status + status = "idle" + if stats["total"] > 0: + tsv = exp_dir / "results.tsv" + if tsv.exists(): + import time + age_hours = (time.time() - tsv.stat().st_mtime) / 3600 + status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done" + + best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—" + pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—" + + experiments.append({ + "domain": domain_dir.name, + "name": exp_dir.name, + "runs": stats["total"], + "kept": stats["keeps"], + "best": best_str, + "change": pct_str, + "status": status, + "metric": config.get("metric", "?"), + }) + + if not experiments: + print("No experiments found.") + return experiments + + print(f"\n{'─' * 90}") + print(f" autoresearch — Dashboard") + print(f"{'─' * 90}") + print(f" {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}") + print(f" {'─' * 85}") + for e in experiments: + print(f" {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}") + print() + return experiments + + +# --- CSV Export --- + +def export_experiment_csv(experiment_dir, experiment_path): + """Export single experiment as CSV string.""" + config = load_config(experiment_dir) + results = load_results(experiment_dir) + direction = config.get("metric_direction", "lower") + stats = compute_stats(results, direction) + + buf = io.StringIO() + writer = csv.writer(buf) + + # Header with metadata + writer.writerow(["# Experiment", experiment_path]) + writer.writerow(["# Target", config.get("target", "")]) + writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"]) + if stats["baseline"] is not None: + writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"]) + if stats["best"] is not None: + pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else "" + writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"]) + writer.writerow(["# Total", stats["total"]]) + writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"]) + writer.writerow([]) + + writer.writerow(["Commit", "Metric", "Status", "Description"]) + for r in results: + m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A" + writer.writerow([r["commit"], m, r["status"], r["description"]]) + + return buf.getvalue() + + +def export_dashboard_csv(root): + """Export dashboard as CSV string.""" + experiments = [] + for domain_dir in sorted(root.iterdir()): + if not domain_dir.is_dir() or domain_dir.name.startswith("."): + continue + for exp_dir in sorted(domain_dir.iterdir()): + if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists(): + continue + config = load_config(exp_dir) + results = load_results(exp_dir) + direction = config.get("metric_direction", "lower") + stats = compute_stats(results, direction) + best_str = f"{stats['best']:.6f}" if stats["best"] else "" + pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else "" + experiments.append([ + domain_dir.name, exp_dir.name, config.get("metric", ""), + stats["total"], stats["keeps"], stats["discards"], stats["crashes"], + best_str, pct_str + ]) + + buf = io.StringIO() + writer = csv.writer(buf) + writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"]) + for e in experiments: + writer.writerow(e) + return buf.getvalue() + + +# --- Markdown Export --- + +def export_experiment_markdown(experiment_dir, experiment_path): + """Export single experiment as Markdown string.""" + config = load_config(experiment_dir) + results = load_results(experiment_dir) + direction = config.get("metric_direction", "lower") + metric_name = config.get("metric", "metric") + stats = compute_stats(results, direction) + + lines = [] + lines.append(f"# Autoresearch: {experiment_path}\n") + lines.append(f"**Target:** `{config.get('target', '?')}` ") + lines.append(f"**Metric:** `{metric_name}` ({direction} is better) ") + lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n") + + if stats["baseline"] is not None and stats["best"] is not None: + pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else "" + lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n") + + lines.append(f"| Commit | Metric | Status | Description |") + lines.append(f"|--------|--------|--------|-------------|") + for r in results: + m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A" + lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |") + lines.append("") + + return "\n".join(lines) + + +def export_dashboard_markdown(root): + """Export dashboard as Markdown string.""" + lines = [] + lines.append("# Autoresearch Dashboard\n") + lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |") + lines.append("|--------|-----------|--------|------|------|------|--------|--------|") + + for domain_dir in sorted(root.iterdir()): + if not domain_dir.is_dir() or domain_dir.name.startswith("."): + continue + for exp_dir in sorted(domain_dir.iterdir()): + if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists(): + continue + config = load_config(exp_dir) + results = load_results(exp_dir) + direction = config.get("metric_direction", "lower") + stats = compute_stats(results, direction) + best = f"`{stats['best']:.4f}`" if stats["best"] else "—" + pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else "—" + + import time + tsv = exp_dir / "results.tsv" + status = "idle" + if tsv.exists() and stats["total"] > 0: + age_h = (time.time() - tsv.stat().st_mtime) / 3600 + status = "active" if age_h < 1 else "paused" if age_h < 24 else "done" + + lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |") + + lines.append("") + return "\n".join(lines) + + +# --- Main --- def main(): - parser = argparse.ArgumentParser() - parser.add_argument("--summary", action="store_true") - parser.add_argument("--best", action="store_true") - parser.add_argument("--history", action="store_true") - parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC")) - parser.add_argument("--path", default=".") - parser.add_argument("--metric", default="metric") - parser.add_argument("--direction", default="lower", choices=["lower", "higher"]) + parser = argparse.ArgumentParser(description="autoresearch-agent results viewer") + parser.add_argument("--experiment", help="Show one experiment: domain/name") + parser.add_argument("--domain", help="Show all experiments in a domain") + parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard") + parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal", + help="Output format (default: terminal)") + parser.add_argument("--output", "-o", help="Write to file instead of stdout") + parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)") args = parser.parse_args() - path = Path(args.path).resolve() + root = find_autoresearch_root() + if root is None: + print("No .autoresearch/ found. Run setup_experiment.py first.") + sys.exit(1) - if args.record: - commit, metric, status, desc = args.record - tsv = path / "results.tsv" - if not tsv.exists(): - tsv.write_text("commit\tmetric\tstatus\tdescription\n") - with open(tsv, "a") as f: - f.write(f"{commit}\t{metric}\t{status}\t{desc}\n") - print(f"✓ Logged: {commit} {metric} {status}") - return + output_text = None - results = load_results(path) + # Single experiment + if args.experiment: + experiment_dir = root / args.experiment + if not experiment_dir.exists(): + print(f"Experiment not found: {args.experiment}") + sys.exit(1) - if args.history: - print_history(results) - elif args.best: - keeps = [r for r in results if r["status"] == "keep" and r["metric"]] - if not keeps: - print("No successful experiments yet.") + if args.format == "csv": + output_text = export_experiment_csv(experiment_dir, args.experiment) + elif args.format == "markdown": + output_text = export_experiment_markdown(experiment_dir, args.experiment) + else: + print_experiment(experiment_dir, args.experiment) return - best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"]) - print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}") + + # Domain + elif args.domain: + domain_dir = root / args.domain + if not domain_dir.exists(): + print(f"Domain not found: {args.domain}") + sys.exit(1) + for exp_dir in sorted(domain_dir.iterdir()): + if exp_dir.is_dir() and (exp_dir / "config.cfg").exists(): + if args.format == "terminal": + print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}") + # For CSV/MD, fall through to dashboard with domain filter + if args.format != "terminal": + # Use dashboard export filtered to domain + output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root) + else: + return + + # Dashboard + elif args.dashboard or args.all: + if args.format == "csv": + output_text = export_dashboard_csv(root) + elif args.format == "markdown": + output_text = export_dashboard_markdown(root) + else: + print_dashboard(root) + return + else: - print_summary(results, args.metric, args.direction) + # Default: dashboard + if args.format == "terminal": + print_dashboard(root) + return + output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root) + + # Write output + if output_text: + if args.output: + Path(args.output).write_text(output_text) + print(f"Written to {args.output}") + else: + print(output_text) if __name__ == "__main__": diff --git a/engineering/autoresearch-agent/scripts/run_experiment.py b/engineering/autoresearch-agent/scripts/run_experiment.py index eb6a93d..b8264ea 100644 --- a/engineering/autoresearch-agent/scripts/run_experiment.py +++ b/engineering/autoresearch-agent/scripts/run_experiment.py @@ -2,17 +2,15 @@ """ autoresearch-agent: Experiment Runner -Executes the autonomous experiment loop: -- Reads .autoresearch.cfg for project config -- Runs the target evaluation -- Keeps improvements (git commit) or discards failures (git reset) -- Logs everything to results.tsv -- Loops indefinitely until interrupted +Executes the autonomous experiment loop for a specific experiment. +Reads config from .autoresearch/{domain}/{name}/config.cfg. Usage: - python scripts/run_experiment.py --loop # Run forever - python scripts/run_experiment.py --single # Run one experiment - python scripts/run_experiment.py --dry-run # Show what would happen + python scripts/run_experiment.py --experiment engineering/api-speed --loop + python scripts/run_experiment.py --experiment engineering/api-speed --single + python scripts/run_experiment.py --experiment marketing/medium-ctr --loop + python scripts/run_experiment.py --resume --loop + python scripts/run_experiment.py --experiment engineering/api-speed --dry-run """ import argparse @@ -25,11 +23,22 @@ from datetime import datetime from pathlib import Path -def load_config(path): - """Load .autoresearch.cfg""" - cfg_file = Path(path) / ".autoresearch.cfg" +def find_autoresearch_root(): + """Find .autoresearch/ in project or user home.""" + project_root = Path(".").resolve() / ".autoresearch" + if project_root.exists(): + return project_root + user_root = Path.home() / ".autoresearch" + if user_root.exists(): + return user_root + return None + + +def load_config(experiment_dir): + """Load config.cfg from experiment directory.""" + cfg_file = experiment_dir / "config.cfg" if not cfg_file.exists(): - print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.") + print(f" Error: no config.cfg in {experiment_dir}") sys.exit(1) config = {} for line in cfg_file.read_text().splitlines(): @@ -49,239 +58,293 @@ def run_cmd(cmd, cwd=None, timeout=None): def get_current_commit(path): + """Get short hash of current HEAD.""" _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path) return commit -def get_current_metric(path, metric_grep): - """Read the last recorded metric from results.tsv.""" - tsv = Path(path) / "results.tsv" +def get_best_metric(experiment_dir, direction): + """Read the best metric from results.tsv.""" + tsv = experiment_dir / "results.tsv" if not tsv.exists(): return None - lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l] + lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l] if not lines: return None - last = lines[-1].split("\t") - try: - return float(last[1]) - except (ValueError, IndexError): + metrics = [] + for line in lines: + parts = line.split("\t") + try: + if parts[1] != "N/A": + metrics.append(float(parts[1])) + except (ValueError, IndexError): + continue + if not metrics: return None + return min(metrics) if direction == "lower" else max(metrics) -def run_evaluation(path, evaluate_cmd, time_budget_minutes): - """Run evaluation with time limit.""" - hard_limit = time_budget_minutes * 60 * 2.5 # 2.5x as hard timeout +def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file): + """Run evaluation with time limit. Output goes to log_file.""" + hard_limit = time_budget_minutes * 60 * 2.5 t0 = time.time() try: code, _, _ = run_cmd( - f"{evaluate_cmd} > run.log 2>&1", - cwd=path, + f"{eval_cmd} > {log_file} 2>&1", + cwd=str(project_root), timeout=hard_limit ) elapsed = time.time() - t0 return code, elapsed except subprocess.TimeoutExpired: elapsed = time.time() - t0 - return -1, elapsed # -1 = timeout + return -1, elapsed -def extract_metric(path, metric_grep): - """Extract metric value from run.log.""" - code, out, _ = run_cmd( - f"grep '{metric_grep}' run.log | tail -1", - cwd=path - ) - if not out: - return None - try: - return float(out.split(":")[-1].strip()) - except ValueError: +def extract_metric(log_file, metric_grep): + """Extract metric value from log file.""" + log_path = Path(log_file) + if not log_path.exists(): return None + for line in reversed(log_path.read_text().splitlines()): + stripped = line.strip() + if stripped.startswith(metric_grep.lstrip("^")): + try: + return float(stripped.split(":")[-1].strip()) + except ValueError: + continue + return None def is_improvement(new_val, old_val, direction): """Check if new result is better than old.""" if old_val is None: - return True # First run always "improves" + return True if direction == "lower": return new_val < old_val - else: - return new_val > old_val + return new_val > old_val -def log_result(path, commit, metric_val, status, description): +def log_result(experiment_dir, commit, metric_val, status, description): """Append result to results.tsv.""" - tsv = Path(path) / "results.tsv" + tsv = experiment_dir / "results.tsv" metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A" with open(tsv, "a") as f: f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n") -def get_experiment_count(path): +def get_experiment_count(experiment_dir): """Count experiments run so far.""" - tsv = Path(path) / "results.tsv" + tsv = experiment_dir / "results.tsv" if not tsv.exists(): return 0 - lines = tsv.read_text().splitlines() - return max(0, len(lines) - 1) # subtract header + return max(0, len(tsv.read_text().splitlines()) - 1) -def run_single_experiment(path, config, exp_num, dry_run=False): +def get_last_active(root): + """Find the most recently modified experiment.""" + latest = None + latest_time = 0 + for domain_dir in root.iterdir(): + if not domain_dir.is_dir() or domain_dir.name.startswith("."): + continue + for exp_dir in domain_dir.iterdir(): + if not exp_dir.is_dir(): + continue + cfg = exp_dir / "config.cfg" + if cfg.exists() and cfg.stat().st_mtime > latest_time: + latest_time = cfg.stat().st_mtime + latest = f"{domain_dir.name}/{exp_dir.name}" + return latest + + +def run_single(project_root, experiment_dir, config, exp_num, dry_run=False): """Run one experiment iteration.""" direction = config.get("metric_direction", "lower") metric_grep = config.get("metric_grep", "^metric:") - evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py") + eval_cmd = config.get("evaluate_cmd", "python evaluate.py") time_budget = int(config.get("time_budget_minutes", 5)) metric_name = config.get("metric", "metric") + log_file = str(experiment_dir / "run.log") - best_so_far = get_current_metric(path, metric_grep) + best = get_best_metric(experiment_dir, direction) ts = datetime.now().strftime("%H:%M:%S") print(f"\n[{ts}] Experiment #{exp_num}") - print(f" Best {metric_name} so far: {best_so_far}") + print(f" Best {metric_name}: {best}") if dry_run: print(" [DRY RUN] Would run evaluation and check metric") return "dry_run" - # Save pre-experiment state for rollback - code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path) + # Save state for rollback + code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root)) if code != 0: - print(" ✗ Can't get git state. Is this a git repo with commits?") + print(" Error: can't get git state") return "error" # Run evaluation - print(f" Running: {evaluate_cmd} (budget: {time_budget} min)") - ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget) + print(f" Running: {eval_cmd} (budget: {time_budget}m)") + ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file) - # Handle timeout + commit = get_current_commit(str(project_root)) + + # Timeout if ret_code == -1: - print(f" ✗ TIMEOUT after {elapsed:.0f}s — discarding") - run_cmd("git checkout -- .", cwd=path) # revert uncommitted changes - # Commit was already made by the agent before evaluation - run_cmd(f"git reset --hard {pre_commit}", cwd=path) - curr_commit = get_current_commit(path) - log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s") + print(f" TIMEOUT after {elapsed:.0f}s — discarding") + run_cmd("git checkout -- .", cwd=str(project_root)) + run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root)) + log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s") return "crash" - # Handle non-zero exit + # Crash if ret_code != 0: - # Check if it crashed - code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path) - print(f" ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s") + _, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root)) + print(f" CRASH (exit {ret_code}) after {elapsed:.0f}s") print(f" Last output: {tail[:200]}") - run_cmd(f"git reset --hard {pre_commit}", cwd=path) - curr_commit = get_current_commit(path) - log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}") + run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root)) + log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}") return "crash" # Extract metric - metric_val = extract_metric(path, metric_grep) + metric_val = extract_metric(log_file, metric_grep) if metric_val is None: - print(f" ✗ Could not parse metric from run.log") - run_cmd(f"git reset --hard {pre_commit}", cwd=path) - curr_commit = get_current_commit(path) - log_result(path, curr_commit, None, "crash", "metric_parse_failed") + print(f" Could not parse {metric_name} from run.log") + run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root)) + log_result(experiment_dir, commit, None, "crash", "metric_parse_failed") return "crash" - curr_commit = get_current_commit(path) delta = "" - if best_so_far is not None: - diff = metric_val - best_so_far - delta = f" (Δ{diff:+.4f})" + if best is not None: + diff = metric_val - best + delta = f" (delta {diff:+.4f})" print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s") # Keep or discard - if is_improvement(metric_val, best_so_far, direction): - print(f" ✅ KEEP — improvement confirmed") - log_result(path, curr_commit, metric_val, "keep", - f"improvement_{metric_name}_{metric_val:.4f}") + if is_improvement(metric_val, best, direction): + print(f" KEEP — improvement") + log_result(experiment_dir, commit, metric_val, "keep", + f"improved_{metric_name}_{metric_val:.4f}") return "keep" else: - print(f" ❌ DISCARD — no improvement") - run_cmd(f"git reset --hard {pre_commit}", cwd=path) - curr_commit = get_current_commit(path) - log_result(path, curr_commit, metric_val, "discard", - f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}") + print(f" DISCARD — no improvement") + run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root)) + best_str = f"{best:.4f}" if best else "?" + log_result(experiment_dir, commit, metric_val, "discard", + f"no_improvement_{metric_val:.4f}_vs_{best_str}") return "discard" -def print_summary(path): - """Print experiment summary.""" - tsv = Path(path) / "results.tsv" +def print_summary(experiment_dir, config): + """Print session summary.""" + tsv = experiment_dir / "results.tsv" if not tsv.exists(): return - lines = tsv.read_text().splitlines()[1:] # skip header + lines = tsv.read_text().splitlines()[1:] if not lines: return keeps = [l for l in lines if "\tkeep\t" in l] discards = [l for l in lines if "\tdiscard\t" in l] crashes = [l for l in lines if "\tcrash\t" in l] + metric_name = config.get("metric", "metric") + direction = config.get("metric_direction", "lower") - print(f"\n{'='*50}") - print(f" Session Summary") + print(f"\n{'=' * 55}") + print(f" autoresearch — Session Summary") print(f" Experiments: {len(lines)} total") - print(f" ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}") + print(f" Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}") if keeps: try: - first_metric = float(keeps[0].split("\t")[1]) - last_metric = float(keeps[-1].split("\t")[1]) - direction = "↓" if last_metric < first_metric else "↑" - print(f" Best progress: {first_metric:.6f} → {last_metric:.6f} {direction}") + valid = [] + for l in keeps: + parts = l.split("\t") + if parts[1] != "N/A": + valid.append(float(parts[1])) + if len(valid) >= 2: + first, last = valid[0], valid[-1] + best = min(valid) if direction == "lower" else max(valid) + pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100) + print(f" {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)") except (ValueError, IndexError): pass - print(f"{'='*50}\n") + print(f"{'=' * 55}\n") def main(): parser = argparse.ArgumentParser(description="autoresearch-agent runner") + parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)") + parser.add_argument("--resume", action="store_true", help="Resume last active experiment") parser.add_argument("--loop", action="store_true", help="Run forever") parser.add_argument("--single", action="store_true", help="Run one experiment") - parser.add_argument("--dry-run", action="store_true", help="Dry run only") + parser.add_argument("--dry-run", action="store_true", help="Show what would happen") + parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)") parser.add_argument("--path", default=".", help="Project root") - parser.add_argument("--max-experiments", type=int, default=0, - help="Max experiments (0 = unlimited)") args = parser.parse_args() - path = Path(args.path).resolve() - config = load_config(path) + project_root = Path(args.path).resolve() + root = find_autoresearch_root() - print(f"\n🔬 autoresearch-agent") - print(f" Project: {path}") - print(f" Target: {config.get('target', '?')}") - print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)") - print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment") - print(f" Mode: {'loop' if args.loop else 'single'}") + if root is None: + print("No .autoresearch/ found. Run setup_experiment.py first.") + sys.exit(1) - if args.single: - exp_num = get_experiment_count(path) + 1 - run_single_experiment(path, config, exp_num, args.dry_run) + # Resolve experiment + experiment_path = args.experiment + if args.resume: + experiment_path = get_last_active(root) + if not experiment_path: + print("No experiments found to resume.") + sys.exit(1) + print(f"Resuming: {experiment_path}") + + if not experiment_path: + print("Specify --experiment domain/name or --resume") + sys.exit(1) + + experiment_dir = root / experiment_path + if not experiment_dir.exists(): + print(f"Experiment not found: {experiment_dir}") + print("Run: python scripts/setup_experiment.py --list") + sys.exit(1) + + config = load_config(experiment_dir) + + domain, name = experiment_path.split("/", 1) + print(f"\n autoresearch-agent") + print(f" Experiment: {experiment_path}") + print(f" Target: {config.get('target', '?')}") + print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)") + print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment") + print(f" Mode: {'loop' if args.loop else 'single'}") + + if args.single or args.dry_run: + exp_num = get_experiment_count(experiment_dir) + 1 + run_single(project_root, experiment_dir, config, exp_num, args.dry_run) return - if not args.loop and not args.dry_run: + if not args.loop: print("\nSpecify --loop (forever) or --single (one experiment)") sys.exit(1) - # Setup graceful shutdown + # Graceful shutdown def handle_interrupt(sig, frame): - print_summary(path) - print("\n⏹ Stopped by user.") + print_summary(experiment_dir, config) + print("\nStopped by user.") sys.exit(0) signal.signal(signal.SIGINT, handle_interrupt) signal.signal(signal.SIGTERM, handle_interrupt) - # Main loop consecutive_crashes = 0 - exp_num = get_experiment_count(path) + 1 + exp_num = get_experiment_count(experiment_dir) + 1 - print(f"\nStarting loop. Ctrl+C to stop and print summary.\n") + print(f"\nStarting loop. Ctrl+C to stop.\n") while True: - result = run_single_experiment(path, config, exp_num, args.dry_run) + result = run_single(project_root, experiment_dir, config, exp_num, False) exp_num += 1 if result == "crash": @@ -289,21 +352,16 @@ def main(): else: consecutive_crashes = 0 - # Bail if 5 consecutive crashes if consecutive_crashes >= 5: - print("\n⚠ 5 consecutive crashes. Pausing for investigation.") - print(" Check run.log for the last error.") + print("\n 5 consecutive crashes. Pausing.") + print(" Check .autoresearch/{}/run.log".format(experiment_path)) break - # Check max experiments - if args.max_experiments > 0 and exp_num > args.max_experiments: - print(f"\n✓ Reached max experiments ({args.max_experiments})") + if 0 < args.max_experiments < exp_num: + print(f"\n Reached max experiments ({args.max_experiments})") break - if args.single: - break - - print_summary(path) + print_summary(experiment_dir, config) if __name__ == "__main__": diff --git a/engineering/autoresearch-agent/scripts/setup_experiment.py b/engineering/autoresearch-agent/scripts/setup_experiment.py index 2898f13..2029775 100644 --- a/engineering/autoresearch-agent/scripts/setup_experiment.py +++ b/engineering/autoresearch-agent/scripts/setup_experiment.py @@ -1,65 +1,52 @@ #!/usr/bin/env python3 """ -autoresearch-agent: Setup Wizard +autoresearch-agent: Setup Experiment -Initializes a new research run: -1. Validates the project structure -2. Creates a git branch -3. Runs the baseline experiment -4. Initializes results.tsv +Initialize a new experiment with domain, target, evaluator, and git branch. +Creates the .autoresearch/{domain}/{name}/ directory structure. Usage: - python scripts/setup_experiment.py [--config experiment.yaml] - python scripts/setup_experiment.py --domain ml|prompt|code|skill + python scripts/setup_experiment.py --domain engineering --name api-speed \ + --target src/api/search.py --eval "pytest bench.py" \ + --metric p50_ms --direction lower + + python scripts/setup_experiment.py --domain marketing --name medium-ctr \ + --target content/titles.md --eval "python evaluate.py" \ + --metric ctr_score --direction higher --evaluator llm_judge_content + + python scripts/setup_experiment.py --list # List all experiments + python scripts/setup_experiment.py --list-evaluators # List available evaluators """ import argparse import os +import shutil import subprocess import sys import time from datetime import datetime from pathlib import Path +DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"] -DOMAINS = { - "ml": { - "target": "train.py", - "evaluate_cmd": "uv run train.py", - "metric": "val_bpb", - "metric_direction": "lower", - "time_budget_minutes": 5, - "metric_grep": "^val_bpb:", - }, - "prompt": { - "target": "prompt.md", - "evaluate_cmd": "python evaluate.py", - "metric": "eval_score", - "metric_direction": "higher", - "time_budget_minutes": 2, - "metric_grep": "^eval_score:", - }, - "code": { - "target": "src/module.py", - "evaluate_cmd": "python benchmark.py", - "metric": "p50_ms", - "metric_direction": "lower", - "time_budget_minutes": 10, - "metric_grep": "^p50_ms:", - }, - "skill": { - "target": "SKILL.md", - "evaluate_cmd": "python scripts/skill_evaluator.py", - "metric": "pass_rate", - "metric_direction": "higher", - "time_budget_minutes": 5, - "metric_grep": "^pass_rate:", - }, -} +EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators" + +DEFAULT_CONFIG = """# autoresearch global config +default_time_budget_minutes: 5 +default_scope: project +dashboard_format: markdown +""" + +GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state +**/results.tsv +**/run.log +**/run.*.log +config.yaml +""" def run_cmd(cmd, cwd=None, timeout=None): - """Run a shell command and return (returncode, stdout, stderr).""" + """Run shell command, return (returncode, stdout, stderr).""" result = subprocess.run( cmd, shell=True, capture_output=True, text=True, cwd=cwd, timeout=timeout @@ -67,188 +54,315 @@ def run_cmd(cmd, cwd=None, timeout=None): return result.returncode, result.stdout.strip(), result.stderr.strip() -def check_git_repo(path): - """Verify we're in a git repo.""" - code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path) - if code != 0: - print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'") +def get_autoresearch_root(scope, project_root=None): + """Get the .autoresearch root directory based on scope.""" + if scope == "user": + return Path.home() / ".autoresearch" + return Path(project_root or ".") / ".autoresearch" + + +def init_root(root): + """Initialize .autoresearch root if it doesn't exist.""" + created = False + if not root.exists(): + root.mkdir(parents=True) + created = True + print(f" Created {root}/") + + config_file = root / "config.yaml" + if not config_file.exists(): + config_file.write_text(DEFAULT_CONFIG) + print(f" Created {config_file}") + + gitignore = root / ".gitignore" + if not gitignore.exists(): + gitignore.write_text(GITIGNORE_CONTENT) + print(f" Created {gitignore}") + + return created + + +def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""): + """Generate a program.md template for the experiment.""" + direction_word = "Minimize" if direction == "lower" else "Maximize" + content = f"""# autoresearch — {name} + +## Goal +{direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better. + +## What the Agent Can Change +- Only `{target}` — this is the single file being optimized. +- Everything inside that file is fair game unless constrained below. + +## What the Agent Cannot Change +- The evaluation script (`evaluate.py` or the eval command). It is read-only. +- Dependencies — do not add new packages or imports that aren't already available. +- Any other files in the project unless explicitly noted here. +{f"- Additional constraints: {constraints}" if constraints else ""} + +## Strategy +1. First run: establish baseline. Do not change anything. +2. Profile/analyze the current state — understand why the metric is what it is. +3. Try the most obvious improvement first (low-hanging fruit). +4. If that works, push further in the same direction. +5. If stuck, try something orthogonal or radical. +6. Read the git log of previous experiments. Don't repeat failed approaches. + +## Simplicity Rule +A small improvement that adds ugly complexity is NOT worth it. +Equal performance with simpler code IS worth it. +Removing code that gets same results is the best outcome. + +## Stop When +You don't stop. The human will interrupt you when they're satisfied. +If no improvement in 20+ consecutive runs, change strategy drastically. +""" + (experiment_dir / "program.md").write_text(content) + + +def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget): + """Write experiment config.""" + content = f"""target: {target} +evaluate_cmd: {eval_cmd} +metric: {metric} +metric_direction: {direction} +metric_grep: ^{metric}: +time_budget_minutes: {time_budget} +created: {datetime.now().strftime('%Y-%m-%d %H:%M')} +""" + (experiment_dir / "config.cfg").write_text(content) + + +def init_results_tsv(experiment_dir): + """Create results.tsv with header.""" + tsv = experiment_dir / "results.tsv" + if tsv.exists(): + print(f" results.tsv already exists ({tsv.stat().st_size} bytes)") + return + tsv.write_text("commit\tmetric\tstatus\tdescription\n") + print(" Created results.tsv") + + +def copy_evaluator(experiment_dir, evaluator_name): + """Copy a built-in evaluator to the experiment directory.""" + source = EVALUATOR_DIR / f"{evaluator_name}.py" + if not source.exists(): + print(f" Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}") + print(f" Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}") return False - print("✓ Git repository found") + dest = experiment_dir / "evaluate.py" + shutil.copy2(source, dest) + print(f" Copied evaluator: {evaluator_name}.py -> evaluate.py") return True -def check_program_md(path): - """Check program.md exists and has content.""" - pm = Path(path) / "program.md" - if not pm.exists(): - print("⚠ program.md not found. Creating template...") - return False - content = pm.read_text() - if len(content) < 100: - print("⚠ program.md looks empty. Fill it out before running experiments.") - return False - print(f"✓ program.md found ({len(content)} chars)") - return True - - -def check_target_file(path, target): - """Check target file exists.""" - tf = Path(path) / target - if not tf.exists(): - print(f"✗ Target file not found: {target}") - return False - print(f"✓ Target file found: {target}") - return True - - -def check_evaluate_script(path): - """Check evaluate.py exists.""" - ev = Path(path) / "evaluate.py" - if not ev.exists(): - print("⚠ evaluate.py not found. You need a fixed evaluation function.") - print(" Create evaluate.py that outputs: metric_name: ") - return False - print("✓ evaluate.py found") - return True - - -def create_branch(path, tag): +def create_branch(path, domain, name): """Create and checkout the experiment branch.""" - branch = f"autoresearch/{tag}" - code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path) + branch = f"autoresearch/{domain}/{name}" + code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path) if code != 0: if "already exists" in err: - print(f"✗ Branch '{branch}' already exists. Use a different tag.") - else: - print(f"✗ Failed to create branch: {err}") + print(f" Branch '{branch}' already exists. Checking out...") + run_cmd(f"git checkout {branch}", cwd=path) + return branch + print(f" Warning: could not create branch: {err}") return None - print(f"✓ Created branch: {branch}") + print(f" Created branch: {branch}") return branch -def init_results_tsv(path): - """Create results.tsv with header.""" - tsv = Path(path) / "results.tsv" - if tsv.exists(): - print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)") +def list_experiments(root): + """List all experiments across all domains.""" + if not root.exists(): + print("No experiments found. Run setup to create your first experiment.") return - tsv.write_text("commit\tmetric\tstatus\tdescription\n") - print("✓ Created results.tsv") + + experiments = [] + for domain_dir in sorted(root.iterdir()): + if not domain_dir.is_dir() or domain_dir.name.startswith("."): + continue + for exp_dir in sorted(domain_dir.iterdir()): + if not exp_dir.is_dir(): + continue + cfg_file = exp_dir / "config.cfg" + if not cfg_file.exists(): + continue + config = {} + for line in cfg_file.read_text().splitlines(): + if ":" in line: + k, v = line.split(":", 1) + config[k.strip()] = v.strip() + + # Count results + tsv = exp_dir / "results.tsv" + runs = 0 + if tsv.exists(): + runs = max(0, len(tsv.read_text().splitlines()) - 1) + + experiments.append({ + "domain": domain_dir.name, + "name": exp_dir.name, + "target": config.get("target", "?"), + "metric": config.get("metric", "?"), + "runs": runs, + }) + + if not experiments: + print("No experiments found.") + return + + print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}") + print("-" * 95) + for e in experiments: + print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}") + print(f"\nTotal: {len(experiments)} experiments") -def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes): - """Run the baseline experiment.""" - print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...") - timeout = time_budget_minutes * 60 * 2.5 # 2.5x budget as hard limit +def list_evaluators(): + """List available built-in evaluators.""" + if not EVALUATOR_DIR.exists(): + print("No evaluators directory found.") + return - t0 = time.time() - code, out, err = run_cmd( - f"{evaluate_cmd} > run.log 2>&1", - cwd=path, - timeout=timeout - ) - elapsed = time.time() - t0 - - if code != 0: - print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log") - return None - - # Extract metric - grep_code, grep_out, _ = run_cmd( - f"grep '{metric_grep}' run.log | tail -1", - cwd=path - ) - if not grep_out: - print("✗ Could not extract metric from run.log. Check metric_grep pattern.") - return None - - metric_value = grep_out.split(":")[-1].strip() - print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}") - return metric_value + print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n") + for f in sorted(EVALUATOR_DIR.glob("*.py")): + # Read first docstring line + desc = "" + for line in f.read_text().splitlines(): + if line.strip().startswith('"""') or line.strip().startswith("'''"): + continue + if line.strip() and not line.startswith("#!"): + desc = line.strip().strip('"').strip("'") + break + print(f" {f.stem:<25} {desc}") def main(): parser = argparse.ArgumentParser(description="autoresearch-agent setup") - parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain") + parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain") + parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)") parser.add_argument("--target", help="Target file to optimize") - parser.add_argument("--evaluate-cmd", help="Evaluation command") - parser.add_argument("--metric", help="Metric name") - parser.add_argument("--direction", choices=["lower", "higher"], default="lower") - parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes") - parser.add_argument("--tag", help="Run tag (used in branch name)") + parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command") + parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')") + parser.add_argument("--direction", choices=["lower", "higher"], default="lower", + help="Is lower or higher better?") + parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)") + parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)") + parser.add_argument("--scope", choices=["project", "user"], default="project", + help="Where to store experiments: project (./) or user (~/)") + parser.add_argument("--constraints", default="", help="Additional constraints for program.md") parser.add_argument("--path", default=".", help="Project root path") - parser.add_argument("--skip-baseline", action="store_true") + parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run") + parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch") + parser.add_argument("--list", action="store_true", help="List all experiments") + parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators") args = parser.parse_args() - path = Path(args.path).resolve() - print(f"\n🔬 autoresearch-agent setup") - print(f" Project: {path}") - print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n") + project_root = Path(args.path).resolve() - # Get config from domain or args - if args.domain: - config = DOMAINS[args.domain].copy() + # List mode + if args.list: + root = get_autoresearch_root("project", project_root) + list_experiments(root) + user_root = get_autoresearch_root("user") + if user_root.exists() and user_root != root: + print(f"\n--- User-level experiments ({user_root}) ---") + list_experiments(user_root) + return + + if args.list_evaluators: + list_evaluators() + return + + # Validate required args for setup + if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]): + parser.error("Required: --domain, --name, --target, --eval, --metric") + + root = get_autoresearch_root(args.scope, project_root) + + print(f"\n autoresearch-agent setup") + print(f" Project: {project_root}") + print(f" Scope: {args.scope}") + print(f" Domain: {args.domain}") + print(f" Experiment: {args.name}") + print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n") + + # Check git + code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root)) + if code != 0: + print(" Error: not a git repository. Run: git init && git add . && git commit -m 'initial'") + sys.exit(1) + print(" Git repository found") + + # Check target file + target_path = project_root / args.target + if not target_path.exists(): + print(f" Error: target file not found: {args.target}") + sys.exit(1) + print(f" Target file found: {args.target}") + + # Init root + init_root(root) + + # Create experiment directory + experiment_dir = root / args.domain / args.name + if experiment_dir.exists(): + print(f" Warning: experiment '{args.domain}/{args.name}' already exists.") + print(f" Use --name with a different name, or delete {experiment_dir}") + sys.exit(1) + experiment_dir.mkdir(parents=True) + print(f" Created {experiment_dir}/") + + # Create files + create_program_md(experiment_dir, args.domain, args.name, + args.target, args.metric, args.direction, args.constraints) + print(" Created program.md") + + create_config(experiment_dir, args.target, args.eval_cmd, + args.metric, args.direction, args.time_budget) + print(" Created config.cfg") + + init_results_tsv(experiment_dir) + + # Copy evaluator if specified + if args.evaluator: + copy_evaluator(experiment_dir, args.evaluator) + + # Create git branch + if not args.skip_branch: + create_branch(str(project_root), args.domain, args.name) + + # Test evaluation command + print(f"\n Testing evaluation: {args.eval_cmd}") + code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60) + if code != 0: + print(f" Warning: eval command failed (exit {code})") + if err: + print(f" stderr: {err[:200]}") + print(" Fix the eval command before running the experiment loop.") else: - config = { - "target": args.target or "target.py", - "evaluate_cmd": args.evaluate_cmd or "python evaluate.py", - "metric": args.metric or "score", - "metric_direction": args.direction, - "time_budget_minutes": args.budget, - "metric_grep": f"^{args.metric or 'score'}:", - } + # Check metric is parseable + full_output = out + "\n" + err + metric_found = False + for line in full_output.splitlines(): + if line.strip().startswith(f"{args.metric}:"): + metric_found = True + print(f" Eval works. Baseline: {line.strip()}") + break + if not metric_found: + print(f" Warning: eval ran but '{args.metric}:' not found in output.") + print(f" Make sure your eval command outputs: {args.metric}: ") - tag = args.tag or datetime.now().strftime("%b%d").lower() - - # Validation checks - checks = [ - check_git_repo(path), - check_program_md(path), - check_target_file(path, config["target"]), - check_evaluate_script(path), - ] - - if not all(checks): - print("\n⚠ Fix the above issues before running experiments.") - sys.exit(1) - - # Create branch - branch = create_branch(path, tag) - if not branch: - sys.exit(1) - - # Init results TSV - init_results_tsv(path) - - # Save config for run_experiment.py - config_content = "\n".join(f"{k}: {v}" for k, v in config.items()) - (path / ".autoresearch.cfg").write_text(config_content + "\n") - print("✓ Saved .autoresearch.cfg") - - # Run baseline - if not args.skip_baseline: - baseline = run_baseline( - path, - config["evaluate_cmd"], - config["metric_grep"], - config["time_budget_minutes"] - ) - if baseline: - # Log baseline to TSV - code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path) - with open(path / "results.tsv", "a") as f: - f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n") - print(f"✓ Baseline logged to results.tsv") - - print(f"\n✅ Setup complete!") - print(f" Branch: {branch}") - print(f" Target: {config['target']}") - print(f" Metric: {config['metric']} ({config['metric_direction']} is better)") - print(f" Budget: {config['time_budget_minutes']} min/experiment") - print(f"\nTo start the autonomous loop:") - print(f" python scripts/run_experiment.py --loop") - print(f"\nOr run a single experiment:") - print(f" python scripts/run_experiment.py --single") + # Summary + print(f"\n Setup complete!") + print(f" Experiment: {args.domain}/{args.name}") + print(f" Target: {args.target}") + print(f" Metric: {args.metric} ({args.direction} is better)") + print(f" Budget: {args.time_budget} min/experiment") + if not args.skip_branch: + print(f" Branch: autoresearch/{args.domain}/{args.name}") + print(f"\n To start:") + print(f" python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop") if __name__ == "__main__":