refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators
Major rewrite based on deep study of Karpathy's autoresearch repo.
Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation
New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed
Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output
Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view
SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy
This commit is contained in:
@@ -1,9 +1,9 @@
|
|||||||
---
|
---
|
||||||
name: "autoresearch-agent"
|
name: "autoresearch-agent"
|
||||||
description: "Autonomous experiment loop that runs overnight research without human intervention. Inspired by Karpathy's autoresearch: agent modifies a target file, runs an evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when the user wants to: autonomously optimize ML training code, improve prompts by eval score, benchmark-drive code performance, or run any experiment loop with a measurable metric. Requires: a target file to modify, a fixed evaluation function, and a git repo."
|
description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
|
||||||
license: MIT
|
license: MIT
|
||||||
metadata:
|
metadata:
|
||||||
version: 1.0.0
|
version: 2.0.0
|
||||||
author: Alireza Rezvani
|
author: Alireza Rezvani
|
||||||
category: engineering
|
category: engineering
|
||||||
updated: 2026-03-13
|
updated: 2026-03-13
|
||||||
@@ -13,194 +13,233 @@ metadata:
|
|||||||
|
|
||||||
> You sleep. The agent experiments. You wake up to results.
|
> You sleep. The agent experiments. You wake up to results.
|
||||||
|
|
||||||
Autonomous experiment loop inspired by Andrej Karpathy's [autoresearch](https://github.com/karpathy/autoresearch). The agent proposes changes, runs a fixed-time evaluation, keeps improvements via git, discards failures, and loops indefinitely — no human in the loop.
|
Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
|
||||||
|
|
||||||
**Works for any domain with a measurable metric:**
|
Not one guess — fifty measured attempts, compounding.
|
||||||
- ML training optimization (original use case — optimize `train.py` by `val_bpb`)
|
|
||||||
- Prompt engineering (optimize system prompts by LLM-eval quality score)
|
|
||||||
- Code performance (optimize a module by benchmark runtime/score)
|
|
||||||
- Agent skill improvement (optimize `SKILL.md` by task completion rate)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Before Starting
|
## When This Skill Activates
|
||||||
|
|
||||||
Check for `program.md` in the project root. If it exists, read it — it defines the experiment objectives, constraints, and what the agent should optimize. Only ask for what's missing.
|
Recognize these patterns from the user:
|
||||||
|
|
||||||
If no `program.md` exists, run the **Setup Wizard** below.
|
- "Make this faster / smaller / better"
|
||||||
|
- "Optimize [file] for [metric]"
|
||||||
|
- "Improve my [headlines / copy / prompts]"
|
||||||
|
- "Run experiments overnight"
|
||||||
|
- "I want to get [metric] from X to Y"
|
||||||
|
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
|
||||||
|
|
||||||
|
If the user describes a target file + a way to measure success → this skill applies.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Setup Wizard
|
## Setup
|
||||||
|
|
||||||
Answer these 5 questions to configure the experiment:
|
### First Time — Create the Experiment
|
||||||
|
|
||||||
### 1. What are we optimizing?
|
Run the setup script. The user decides where experiments live:
|
||||||
The **target** — one file the agent is allowed to modify:
|
|
||||||
- `train.py` — ML training loop (Karpathy-style)
|
|
||||||
- `prompt.md` — system prompt for an LLM
|
|
||||||
- `src/module.py` — a specific code module
|
|
||||||
- `SKILL.md` — an agent skill definition
|
|
||||||
|
|
||||||
### 2. What's the metric?
|
**Project-level** (inside repo, git-tracked, shareable with team):
|
||||||
A **single number** that defines success. Lower or higher = better:
|
```bash
|
||||||
- `val_bpb` — validation bits per byte (ML, lower is better)
|
python scripts/setup_experiment.py \
|
||||||
- `eval_score` — LLM quality score 0-100 (higher is better)
|
--domain engineering \
|
||||||
- `p50_ms` — median latency in milliseconds (lower is better)
|
--name api-speed \
|
||||||
- `pass_rate` — test pass rate 0-1 (higher is better)
|
--target src/api/search.py \
|
||||||
|
--eval "pytest bench.py --tb=no -q" \
|
||||||
### 3. What's the time budget per experiment?
|
--metric p50_ms \
|
||||||
How long one experiment run takes:
|
--direction lower \
|
||||||
- `5m` — fast iteration (Karpathy default, ~12 experiments/hour)
|
--scope project
|
||||||
- `10m` — moderate (6/hour)
|
|
||||||
- `30m` — slow but thorough (2/hour)
|
|
||||||
|
|
||||||
### 4. What can the agent change?
|
|
||||||
Constraints on the target file:
|
|
||||||
- Architecture only? Hyperparameters only? Everything?
|
|
||||||
- What packages/imports are available?
|
|
||||||
- What's explicitly off-limits?
|
|
||||||
|
|
||||||
### 5. What's the evaluation function?
|
|
||||||
How we score each experiment:
|
|
||||||
- Fixed script that outputs the metric (e.g. `python evaluate.py`)
|
|
||||||
- API call that returns a score
|
|
||||||
- Test suite with a pass rate
|
|
||||||
|
|
||||||
Once answered, run: `python scripts/setup_experiment.py` to initialize.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## The Three Files
|
|
||||||
|
|
||||||
Every autoresearch project has the same structure:
|
|
||||||
|
|
||||||
```
|
|
||||||
project/
|
|
||||||
├── program.md ← Human writes this: objectives, constraints, strategy
|
|
||||||
├── target.* ← Agent modifies this: the thing being optimized
|
|
||||||
├── evaluate.py ← Fixed: the measurement function (never touch)
|
|
||||||
├── results.tsv ← Auto-generated: experiment log (git-tracked for continuity)
|
|
||||||
└── scripts/
|
|
||||||
├── setup_experiment.py ← Initialize a new run
|
|
||||||
├── run_experiment.py ← Execute one experiment iteration
|
|
||||||
└── log_results.py ← Record results to TSV
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### `program.md` — Your Research Directions
|
**User-level** (personal, in `~/.autoresearch/`):
|
||||||
Write this once. The agent reads it before every experiment. It should contain:
|
```bash
|
||||||
- **Goal:** What you want to achieve (minimize loss, maximize score, simplify code)
|
python scripts/setup_experiment.py \
|
||||||
- **Strategy:** What directions to explore first
|
--domain marketing \
|
||||||
- **Constraints:** What the agent cannot change
|
--name medium-ctr \
|
||||||
- **Stopping criteria:** When a result is "good enough"
|
--target content/titles.md \
|
||||||
|
--eval "python evaluate.py" \
|
||||||
|
--metric ctr_score \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator llm_judge_content \
|
||||||
|
--scope user
|
||||||
|
```
|
||||||
|
|
||||||
See `references/program-template.md` for domain-specific templates.
|
The `--scope` flag determines where `.autoresearch/` lives:
|
||||||
|
- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
|
||||||
|
- `user` → `~/.autoresearch/` in the home directory. Everything is personal.
|
||||||
|
|
||||||
### Target File — The Only File the Agent Edits
|
### What Setup Creates
|
||||||
Whatever you're optimizing. Strict scope: **one file, one metric**.
|
|
||||||
|
|
||||||
### `evaluate.py` — Fixed Evaluation (Never Modified)
|
```
|
||||||
The measurement function. Outputs the metric value to stdout. The agent reads this output — it cannot change how it's measured.
|
.autoresearch/
|
||||||
|
├── config.yaml ← Global settings
|
||||||
|
├── .gitignore ← Ignores results.tsv, *.log
|
||||||
|
└── {domain}/{experiment-name}/
|
||||||
|
├── program.md ← Objectives, constraints, strategy
|
||||||
|
├── config.cfg ← Target, eval cmd, metric, direction
|
||||||
|
├── results.tsv ← Experiment log (gitignored)
|
||||||
|
└── evaluate.py ← Evaluation script (if --evaluator used)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Domains
|
||||||
|
|
||||||
|
| Domain | Use Cases |
|
||||||
|
|--------|-----------|
|
||||||
|
| `engineering` | Code speed, memory, bundle size, test pass rate, build time |
|
||||||
|
| `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
|
||||||
|
| `content` | Article structure, SEO descriptions, readability, CTR |
|
||||||
|
| `prompts` | System prompts, chatbot tone, agent instructions |
|
||||||
|
| `custom` | Anything else with a measurable metric |
|
||||||
|
|
||||||
|
### If `program.md` Already Exists
|
||||||
|
|
||||||
|
The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## The Experiment Loop
|
## The Experiment Loop
|
||||||
|
|
||||||
Run: `python scripts/run_experiment.py --loop`
|
### Starting an Experiment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run specific experiment
|
||||||
|
python scripts/run_experiment.py --experiment engineering/api-speed --loop
|
||||||
|
|
||||||
|
# Single iteration (test setup)
|
||||||
|
python scripts/run_experiment.py --experiment engineering/api-speed --single
|
||||||
|
|
||||||
|
# Resume last active experiment
|
||||||
|
python scripts/run_experiment.py --resume --loop
|
||||||
|
|
||||||
|
# Dry run (show what would happen)
|
||||||
|
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
|
||||||
|
```
|
||||||
|
|
||||||
|
### The Loop Protocol
|
||||||
|
|
||||||
```
|
```
|
||||||
LOOP FOREVER:
|
LOOP FOREVER:
|
||||||
|
|
||||||
1. Read program.md for current strategy
|
1. Read program.md for current strategy and constraints
|
||||||
2. Review git history: what has been tried? What worked?
|
2. Review git log: what has been tried? What worked? What crashed?
|
||||||
3. Propose ONE change to the target file
|
3. Review results.tsv: current best metric, trend, recent failures
|
||||||
4. Apply the change
|
4. Propose ONE change to the target file
|
||||||
5. git commit (with descriptive message)
|
5. Apply the change
|
||||||
6. Run evaluation: python evaluate.py > run.log 2>&1
|
6. git commit -m "experiment: [short description of what changed]"
|
||||||
7. Parse metric from run.log
|
7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
|
||||||
8. If metric improved → ADVANCE (keep commit, log "keep")
|
8. Parse metric from run.log (grep for metric_name: value)
|
||||||
9. If metric equal/worse → REVERT (git reset, log "discard")
|
9. Decision:
|
||||||
10. If crash → attempt fix, if unfixable log "crash" and revert
|
- Metric improved → KEEP (advance branch, log "keep")
|
||||||
11. Update results.tsv
|
- Metric equal or worse → REVERT (git reset --hard, log "discard")
|
||||||
12. Go to 1
|
- Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
|
||||||
|
10. Append result to results.tsv
|
||||||
|
11. Go to 1
|
||||||
```
|
```
|
||||||
|
|
||||||
### Rules (from Karpathy's original)
|
### Rules
|
||||||
|
|
||||||
- **NEVER STOP** — once the loop starts, do not ask the human if you should continue. They may be asleep. Run until manually interrupted.
|
- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
|
||||||
- **Simplicity criterion** — a small improvement that adds ugly complexity is not worth it. Removing code and getting equal results is a win.
|
- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
|
||||||
- **One change per experiment** — don't change 5 things at once. You won't know what worked.
|
- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
|
||||||
- **Crash = discard** — OOM, error, timeout → log "crash", revert, move on.
|
- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
|
||||||
- **Time limit** — if run exceeds 2.5× the time budget, kill it and treat as crash.
|
- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
|
||||||
- **No new dependencies** — only use what's already available.
|
- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
|
||||||
|
- **No new dependencies.** Only use what's already available in the project.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Results Log
|
## Evaluators
|
||||||
|
|
||||||
`results.tsv` (tab-separated, not git-tracked):
|
Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.
|
||||||
|
|
||||||
|
### Free Evaluators (no API cost)
|
||||||
|
|
||||||
|
| Evaluator | Metric | Use Case |
|
||||||
|
|-----------|--------|----------|
|
||||||
|
| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
|
||||||
|
| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
|
||||||
|
| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
|
||||||
|
| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
|
||||||
|
| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
|
||||||
|
|
||||||
|
### LLM Judge Evaluators (uses your subscription)
|
||||||
|
|
||||||
|
| Evaluator | Metric | Use Case |
|
||||||
|
|-----------|--------|----------|
|
||||||
|
| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
|
||||||
|
| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
|
||||||
|
| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |
|
||||||
|
|
||||||
|
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
|
||||||
|
|
||||||
|
The user's existing subscription covers the cost:
|
||||||
|
- Claude Code Max → unlimited Claude calls for evaluation
|
||||||
|
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
|
||||||
|
- Gemini CLI (free tier) → free evaluation calls
|
||||||
|
|
||||||
|
### Custom Evaluators
|
||||||
|
|
||||||
|
If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
# My custom evaluator — DO NOT MODIFY after experiment starts
|
||||||
|
import subprocess
|
||||||
|
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
|
||||||
|
# Parse and output
|
||||||
|
print(f"my_metric: {parse_score(result.stdout)}")
|
||||||
```
|
```
|
||||||
commit metric status description
|
|
||||||
a1b2c3d 0.9979 keep baseline
|
|
||||||
b2c3d4e 0.9932 keep increased learning rate
|
|
||||||
c3d4e5f 1.0050 discard switched to GeLU (worse)
|
|
||||||
d4e5f6g 0.0000 crash doubled model width (OOM)
|
|
||||||
```
|
|
||||||
|
|
||||||
Run `python scripts/log_results.py --summary` for a visual summary.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Domain-Specific Configurations
|
## Viewing Results
|
||||||
|
|
||||||
### ML Training (Karpathy-style)
|
```bash
|
||||||
```yaml
|
# Single experiment
|
||||||
target: train.py
|
python scripts/log_results.py --experiment engineering/api-speed
|
||||||
evaluate: uv run prepare.py --eval-only
|
|
||||||
metric: val_bpb (lower is better)
|
# All experiments in a domain
|
||||||
time_budget: 5m
|
python scripts/log_results.py --domain engineering
|
||||||
git_branch: autoresearch/{date}-{tag}
|
|
||||||
|
# Cross-experiment dashboard
|
||||||
|
python scripts/log_results.py --dashboard
|
||||||
|
|
||||||
|
# Export formats
|
||||||
|
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
|
||||||
|
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
|
||||||
|
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
|
||||||
```
|
```
|
||||||
|
|
||||||
### Prompt Engineering
|
### Dashboard Output
|
||||||
```yaml
|
|
||||||
target: prompt.md
|
```
|
||||||
evaluate: python evaluate.py --model gpt-4o --test-cases tests/
|
DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
|
||||||
metric: eval_score (0-100, higher is better)
|
engineering api-speed 47 14 185ms -76.9% active
|
||||||
time_budget: 2m
|
engineering bundle-size 23 8 412KB -58.3% paused
|
||||||
git_branch: prompt-research/{date}
|
marketing medium-ctr 31 11 8.4/10 +68.0% active
|
||||||
|
prompts support-tone 15 6 82/100 +46.4% done
|
||||||
```
|
```
|
||||||
|
|
||||||
### Code Performance
|
### Export Formats
|
||||||
```yaml
|
|
||||||
target: src/hot_module.py
|
|
||||||
evaluate: python benchmark.py --runs 5 --warmup 1
|
|
||||||
metric: p50_ms (lower is better)
|
|
||||||
time_budget: 10m
|
|
||||||
git_branch: perf-research/{date}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Agent Skill Optimization
|
- **TSV** — default, tab-separated (compatible with spreadsheets)
|
||||||
```yaml
|
- **CSV** — comma-separated, with proper quoting
|
||||||
target: SKILL.md
|
- **Markdown** — formatted table, readable in GitHub/docs
|
||||||
evaluate: python scripts/skill_evaluator.py --task-suite tests/
|
|
||||||
metric: pass_rate (0-1, higher is better)
|
|
||||||
time_budget: 5m
|
|
||||||
git_branch: skill-research/{date}
|
|
||||||
```
|
|
||||||
|
|
||||||
See `references/experiment-domains.md` for full setup guides per domain.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Scripts
|
## Proactive Triggers
|
||||||
|
|
||||||
| Script | Purpose |
|
Flag these without being asked:
|
||||||
|--------|---------|
|
|
||||||
| `setup_experiment.py` | Initialize a new research run: create branch, verify setup, baseline run |
|
- **No evaluation command works** → Test it before starting the loop. Run once, verify output.
|
||||||
| `run_experiment.py` | Execute the autonomous loop (single run or `--loop` for infinite) |
|
- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
|
||||||
| `log_results.py` | Record results to TSV; `--summary` prints progress table |
|
- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
|
||||||
|
- **Time budget too short** → If eval takes longer than budget, every run crashes.
|
||||||
|
- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
|
||||||
|
- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
|
||||||
|
- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -214,7 +253,6 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
|
|||||||
|
|
||||||
### Multi-tool install
|
### Multi-tool install
|
||||||
```bash
|
```bash
|
||||||
# Clone the repo, then use the convert script for your tool:
|
|
||||||
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
|
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -225,22 +263,9 @@ clawhub install autoresearch-agent
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Proactive Triggers
|
|
||||||
|
|
||||||
Flag these issues without being asked:
|
|
||||||
|
|
||||||
- **No `evaluate.py` exists** → Experiment can't run. Offer to create one from a template.
|
|
||||||
- **Target file has no git history** → `git init` and commit baseline first.
|
|
||||||
- **Metric direction unclear** → Ask: is lower or higher better? Agent must know before starting.
|
|
||||||
- **Time budget too short** → If evaluation takes longer than budget, experiments will always crash.
|
|
||||||
- **`results.tsv` in `.gitignore`** → It shouldn't be. The log must persist across sessions.
|
|
||||||
- **Agent modifying `evaluate.py`** → Hard stop. This invalidates all comparisons.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Related Skills
|
## Related Skills
|
||||||
|
|
||||||
- **self-improving-agent**: Use when improving an agent's own memory/rules over time. NOT for structured experiment loops with metrics.
|
- **self-improving-agent** — improves an agent's own memory/rules over time. NOT for structured experiment loops.
|
||||||
- **senior-ml-engineer**: Use for ML architecture decisions and training setup. NOT for autonomous overnight loops.
|
- **senior-ml-engineer** — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
|
||||||
- **skill-security-auditor**: Use to audit skills before publishing. NOT for optimization loops.
|
- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
|
||||||
- **tdd-guide**: Use when you want tests to drive development. Complementary — can use tests as the evaluation function.
|
- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.
|
||||||
|
|||||||
56
engineering/autoresearch-agent/evaluators/benchmark_size.py
Normal file
56
engineering/autoresearch-agent/evaluators/benchmark_size.py
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Measure file, bundle, or Docker image size.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# --- CONFIGURE ONE OF THESE ---
|
||||||
|
# Option 1: File size
|
||||||
|
TARGET_FILE = "dist/main.js"
|
||||||
|
|
||||||
|
# Option 2: Directory size (uncomment to use)
|
||||||
|
# TARGET_DIR = "dist/"
|
||||||
|
|
||||||
|
# Option 3: Docker image (uncomment to use)
|
||||||
|
# DOCKER_IMAGE = "myapp:latest"
|
||||||
|
# DOCKER_BUILD_CMD = "docker build -t myapp:latest ."
|
||||||
|
|
||||||
|
# Option 4: Build first, then measure (uncomment to use)
|
||||||
|
# BUILD_CMD = "npm run build"
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
# Build if needed
|
||||||
|
if "BUILD_CMD" in dir() or "BUILD_CMD" in globals():
|
||||||
|
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True)
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f"Build failed: {result.stderr.decode()[:200]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Measure
|
||||||
|
if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
|
||||||
|
if "DOCKER_BUILD_CMD" in dir():
|
||||||
|
subprocess.run(DOCKER_BUILD_CMD, shell=True, capture_output=True)
|
||||||
|
result = subprocess.run(
|
||||||
|
f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
|
||||||
|
shell=True, capture_output=True, text=True
|
||||||
|
)
|
||||||
|
size_bytes = int(result.stdout.strip())
|
||||||
|
elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
|
||||||
|
size_bytes = sum(
|
||||||
|
os.path.getsize(os.path.join(dp, f))
|
||||||
|
for dp, _, fns in os.walk(TARGET_DIR) for f in fns
|
||||||
|
)
|
||||||
|
elif os.path.exists(TARGET_FILE):
|
||||||
|
size_bytes = os.path.getsize(TARGET_FILE)
|
||||||
|
else:
|
||||||
|
print(f"Target not found: {TARGET_FILE}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
size_kb = size_bytes / 1024
|
||||||
|
size_mb = size_bytes / (1024 * 1024)
|
||||||
|
|
||||||
|
print(f"size_bytes: {size_bytes}")
|
||||||
|
print(f"size_kb: {size_kb:.1f}")
|
||||||
|
print(f"size_mb: {size_mb:.2f}")
|
||||||
40
engineering/autoresearch-agent/evaluators/benchmark_speed.py
Normal file
40
engineering/autoresearch-agent/evaluators/benchmark_speed.py
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Measure execution speed of a target function or command.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import statistics
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
COMMAND = "python src/module.py" # Command to benchmark
|
||||||
|
RUNS = 5 # Number of runs
|
||||||
|
WARMUP = 1 # Warmup runs (not counted)
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
times = []
|
||||||
|
|
||||||
|
# Warmup
|
||||||
|
for _ in range(WARMUP):
|
||||||
|
subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
|
||||||
|
|
||||||
|
# Benchmark
|
||||||
|
for i in range(RUNS):
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
result = subprocess.run(COMMAND, shell=True, capture_output=True, timeout=120)
|
||||||
|
elapsed = (time.perf_counter() - t0) * 1000 # ms
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f"Run {i+1} failed (exit {result.returncode})", file=sys.stderr)
|
||||||
|
print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
times.append(elapsed)
|
||||||
|
|
||||||
|
p50 = statistics.median(times)
|
||||||
|
p95 = sorted(times)[int(len(times) * 0.95)] if len(times) >= 5 else max(times)
|
||||||
|
|
||||||
|
print(f"p50_ms: {p50:.2f}")
|
||||||
|
print(f"p95_ms: {p95:.2f}")
|
||||||
|
print(f"runs: {RUNS}")
|
||||||
39
engineering/autoresearch-agent/evaluators/build_speed.py
Normal file
39
engineering/autoresearch-agent/evaluators/build_speed.py
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Measure build/compile time.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
BUILD_CMD = "npm run build" # or: docker build -t test .
|
||||||
|
CLEAN_CMD = "" # optional: npm run clean (run before each build)
|
||||||
|
RUNS = 3 # Number of builds to average
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
times = []
|
||||||
|
|
||||||
|
for i in range(RUNS):
|
||||||
|
# Clean if configured
|
||||||
|
if CLEAN_CMD:
|
||||||
|
subprocess.run(CLEAN_CMD, shell=True, capture_output=True, timeout=60)
|
||||||
|
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
result = subprocess.run(BUILD_CMD, shell=True, capture_output=True, timeout=600)
|
||||||
|
elapsed = time.perf_counter() - t0
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f"Build {i+1} failed (exit {result.returncode})", file=sys.stderr)
|
||||||
|
print(f"stderr: {result.stderr.decode()[:200]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
times.append(elapsed)
|
||||||
|
|
||||||
|
import statistics
|
||||||
|
avg = statistics.mean(times)
|
||||||
|
median = statistics.median(times)
|
||||||
|
|
||||||
|
print(f"build_seconds: {median:.2f}")
|
||||||
|
print(f"build_avg: {avg:.2f}")
|
||||||
|
print(f"runs: {RUNS}")
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""LLM judge for content quality (headlines, titles, descriptions).
|
||||||
|
Uses the user's existing CLI tool (claude, codex, gemini) for evaluation.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
TARGET_FILE = "content/titles.md" # File being optimized
|
||||||
|
CLI_TOOL = "claude" # or: codex, gemini
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
# The judge prompt is FIXED — the agent cannot change how it's evaluated
|
||||||
|
JUDGE_PROMPT = """You are a content quality evaluator. Score the following content strictly.
|
||||||
|
|
||||||
|
Criteria (each scored 1-10):
|
||||||
|
|
||||||
|
1. CURIOSITY GAP — Does this make you want to click? Is there an information gap
|
||||||
|
that can only be resolved by reading? Generic titles score 1-3. Specific,
|
||||||
|
intriguing titles score 7-10.
|
||||||
|
|
||||||
|
2. SPECIFICITY — Are there concrete numbers, tools, or details? "How I improved
|
||||||
|
performance" = 2. "How I reduced API latency from 800ms to 185ms" = 9.
|
||||||
|
|
||||||
|
3. EMOTIONAL PULL — Does it trigger curiosity, surprise, fear of missing out,
|
||||||
|
or recognition? Flat titles score 1-3. Emotionally charged score 7-10.
|
||||||
|
|
||||||
|
4. SCROLL-STOP POWER — Would this stop someone scrolling through a feed or
|
||||||
|
search results? Would they pause on this headline? Rate honestly.
|
||||||
|
|
||||||
|
5. SEO KEYWORD PRESENCE — Are searchable, high-intent terms present naturally?
|
||||||
|
Keyword-stuffed = 3. Natural integration of search terms = 8-10.
|
||||||
|
|
||||||
|
Output EXACTLY this format (nothing else):
|
||||||
|
curiosity: <score>
|
||||||
|
specificity: <score>
|
||||||
|
emotional: <score>
|
||||||
|
scroll_stop: <score>
|
||||||
|
seo: <score>
|
||||||
|
ctr_score: <average of all 5 scores>
|
||||||
|
|
||||||
|
Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
|
||||||
|
|
||||||
|
content = Path(TARGET_FILE).read_text()
|
||||||
|
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
|
||||||
|
|
||||||
|
# Call the user's CLI tool
|
||||||
|
result = subprocess.run(
|
||||||
|
[CLI_TOOL, "-p", full_prompt],
|
||||||
|
capture_output=True, text=True, timeout=120
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Parse output — look for ctr_score line
|
||||||
|
output = result.stdout
|
||||||
|
for line in output.splitlines():
|
||||||
|
line = line.strip()
|
||||||
|
if line.startswith("ctr_score:"):
|
||||||
|
print(line)
|
||||||
|
elif line.startswith(("curiosity:", "specificity:", "emotional:", "scroll_stop:", "seo:")):
|
||||||
|
print(line)
|
||||||
|
|
||||||
|
# Verify ctr_score was found
|
||||||
|
if "ctr_score:" not in output:
|
||||||
|
print("Could not parse ctr_score from LLM output", file=sys.stderr)
|
||||||
|
print(f"Raw output: {output[:500]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
88
engineering/autoresearch-agent/evaluators/llm_judge_copy.py
Normal file
88
engineering/autoresearch-agent/evaluators/llm_judge_copy.py
Normal file
@@ -0,0 +1,88 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""LLM judge for marketing copy (social posts, ads, emails).
|
||||||
|
Uses the user's existing CLI tool for evaluation.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
TARGET_FILE = "posts.md" # Copy being optimized
|
||||||
|
CLI_TOOL = "claude" # or: codex, gemini
|
||||||
|
PLATFORM = "twitter" # twitter, linkedin, instagram, email, ad
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
JUDGE_PROMPTS = {
|
||||||
|
"twitter": """Score this Twitter/X post strictly:
|
||||||
|
1. HOOK (1-10) — Does the first line stop the scroll?
|
||||||
|
2. VALUE (1-10) — Does it provide insight, entertainment, or utility?
|
||||||
|
3. ENGAGEMENT (1-10) — Would people reply, retweet, or like?
|
||||||
|
4. BREVITY (1-10) — Is every word earning its place? No filler?
|
||||||
|
5. CTA (1-10) — Is there a clear next action (even implicit)?""",
|
||||||
|
|
||||||
|
"linkedin": """Score this LinkedIn post strictly:
|
||||||
|
1. HOOK (1-10) — Does the first line make you click "see more"?
|
||||||
|
2. STORYTELLING (1-10) — Is there a narrative arc or just statements?
|
||||||
|
3. CREDIBILITY (1-10) — Does it demonstrate expertise without bragging?
|
||||||
|
4. ENGAGEMENT (1-10) — Would professionals comment or share?
|
||||||
|
5. CTA (1-10) — Does it invite discussion or action?""",
|
||||||
|
|
||||||
|
"instagram": """Score this Instagram caption strictly:
|
||||||
|
1. HOOK (1-10) — Does the first line grab attention?
|
||||||
|
2. RELATABILITY (1-10) — Does the audience see themselves in this?
|
||||||
|
3. VISUAL MATCH (1-10) — Does the copy complement visual content?
|
||||||
|
4. HASHTAG STRATEGY (1-10) — Are hashtags relevant and not spammy?
|
||||||
|
5. CTA (1-10) — Does it encourage saves, shares, or comments?""",
|
||||||
|
|
||||||
|
"email": """Score this email subject + preview strictly:
|
||||||
|
1. OPEN INCENTIVE (1-10) — Would you open this in a crowded inbox?
|
||||||
|
2. SPECIFICITY (1-10) — Is it concrete or vague?
|
||||||
|
3. URGENCY (1-10) — Is there a reason to open now vs later?
|
||||||
|
4. PERSONALIZATION (1-10) — Does it feel written for someone, not everyone?
|
||||||
|
5. PREVIEW SYNC (1-10) — Does the preview text complement the subject?""",
|
||||||
|
|
||||||
|
"ad": """Score this ad copy strictly:
|
||||||
|
1. ATTENTION (1-10) — Does it stop someone scrolling past ads?
|
||||||
|
2. DESIRE (1-10) — Does it create want for the product/service?
|
||||||
|
3. PROOF (1-10) — Is there credibility (numbers, social proof)?
|
||||||
|
4. ACTION (1-10) — Is the CTA clear and compelling?
|
||||||
|
5. OBJECTION HANDLING (1-10) — Does it preempt "why not"?""",
|
||||||
|
}
|
||||||
|
|
||||||
|
platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])
|
||||||
|
|
||||||
|
JUDGE_PROMPT = f"""{platform_prompt}
|
||||||
|
|
||||||
|
Output EXACTLY this format:
|
||||||
|
criterion_1: <score>
|
||||||
|
criterion_2: <score>
|
||||||
|
criterion_3: <score>
|
||||||
|
criterion_4: <score>
|
||||||
|
criterion_5: <score>
|
||||||
|
engagement_score: <average of all 5>
|
||||||
|
|
||||||
|
Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""
|
||||||
|
|
||||||
|
content = Path(TARGET_FILE).read_text()
|
||||||
|
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
[CLI_TOOL, "-p", full_prompt],
|
||||||
|
capture_output=True, text=True, timeout=120
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f"LLM judge failed: {result.stderr[:200]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
output = result.stdout
|
||||||
|
for line in output.splitlines():
|
||||||
|
line = line.strip()
|
||||||
|
if line.startswith("engagement_score:") or line.startswith("criterion_"):
|
||||||
|
print(line)
|
||||||
|
|
||||||
|
if "engagement_score:" not in output:
|
||||||
|
print("Could not parse engagement_score from LLM output", file=sys.stderr)
|
||||||
|
print(f"Raw: {output[:500]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
@@ -0,0 +1,99 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""LLM judge for prompt/instruction quality.
|
||||||
|
Uses the user's existing CLI tool for evaluation.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
TARGET_FILE = "prompt.md" # Prompt being optimized
|
||||||
|
TEST_CASES_FILE = "tests/cases.json" # Test cases: [{"input": "...", "expected": "..."}]
|
||||||
|
CLI_TOOL = "claude" # or: codex, gemini
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
JUDGE_PROMPT_TEMPLATE = """You are evaluating a system prompt's effectiveness.
|
||||||
|
|
||||||
|
SYSTEM PROMPT BEING TESTED:
|
||||||
|
{prompt}
|
||||||
|
|
||||||
|
TEST INPUT:
|
||||||
|
{input}
|
||||||
|
|
||||||
|
EXPECTED OUTPUT (reference):
|
||||||
|
{expected}
|
||||||
|
|
||||||
|
ACTUAL OUTPUT:
|
||||||
|
{actual}
|
||||||
|
|
||||||
|
Score the actual output on these criteria (each 1-10):
|
||||||
|
1. ACCURACY — Does it match the expected output's intent and facts?
|
||||||
|
2. COMPLETENESS — Does it cover all required elements?
|
||||||
|
3. CLARITY — Is it well-structured and easy to understand?
|
||||||
|
4. INSTRUCTION_FOLLOWING — Does it follow the system prompt's guidelines?
|
||||||
|
|
||||||
|
Output EXACTLY: quality_score: <average of all 4>
|
||||||
|
Nothing else."""
|
||||||
|
|
||||||
|
prompt = Path(TARGET_FILE).read_text()
|
||||||
|
test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
|
||||||
|
|
||||||
|
scores = []
|
||||||
|
|
||||||
|
for i, case in enumerate(test_cases):
|
||||||
|
# Generate output using the prompt
|
||||||
|
gen_prompt = f"{prompt}\n\n{case['input']}"
|
||||||
|
gen_result = subprocess.run(
|
||||||
|
[CLI_TOOL, "-p", gen_prompt],
|
||||||
|
capture_output=True, text=True, timeout=60
|
||||||
|
)
|
||||||
|
if gen_result.returncode != 0:
|
||||||
|
print(f"Generation failed for case {i+1}", file=sys.stderr)
|
||||||
|
scores.append(0)
|
||||||
|
continue
|
||||||
|
|
||||||
|
actual = gen_result.stdout.strip()
|
||||||
|
|
||||||
|
# Judge the output
|
||||||
|
judge_prompt = JUDGE_PROMPT_TEMPLATE.format(
|
||||||
|
prompt=prompt[:500],
|
||||||
|
input=case["input"],
|
||||||
|
expected=case.get("expected", "N/A"),
|
||||||
|
actual=actual[:500]
|
||||||
|
)
|
||||||
|
|
||||||
|
judge_result = subprocess.run(
|
||||||
|
[CLI_TOOL, "-p", judge_prompt],
|
||||||
|
capture_output=True, text=True, timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
if judge_result.returncode != 0:
|
||||||
|
scores.append(0)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Parse score
|
||||||
|
for line in judge_result.stdout.splitlines():
|
||||||
|
if "quality_score:" in line:
|
||||||
|
try:
|
||||||
|
score = float(line.split(":")[-1].strip())
|
||||||
|
scores.append(score)
|
||||||
|
except ValueError:
|
||||||
|
scores.append(0)
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
scores.append(0)
|
||||||
|
|
||||||
|
print(f" Case {i+1}/{len(test_cases)}: {scores[-1]:.1f}", file=sys.stderr)
|
||||||
|
|
||||||
|
if not scores:
|
||||||
|
print("No test cases evaluated", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
avg = sum(scores) / len(scores)
|
||||||
|
quality = avg * 10 # Scale to 0-100
|
||||||
|
|
||||||
|
print(f"quality_score: {quality:.2f}")
|
||||||
|
print(f"cases_tested: {len(scores)}")
|
||||||
|
print(f"avg_per_case: {avg:.2f}")
|
||||||
52
engineering/autoresearch-agent/evaluators/memory_usage.py
Normal file
52
engineering/autoresearch-agent/evaluators/memory_usage.py
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Measure peak memory usage of a command.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import platform
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
COMMAND = "python src/module.py" # Command to measure
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
system = platform.system()
|
||||||
|
|
||||||
|
if system == "Linux":
|
||||||
|
# Use /usr/bin/time for peak RSS
|
||||||
|
result = subprocess.run(
|
||||||
|
f"/usr/bin/time -v {COMMAND}",
|
||||||
|
shell=True, capture_output=True, text=True, timeout=300
|
||||||
|
)
|
||||||
|
output = result.stderr
|
||||||
|
for line in output.splitlines():
|
||||||
|
if "Maximum resident set size" in line:
|
||||||
|
kb = int(line.split(":")[-1].strip())
|
||||||
|
mb = kb / 1024
|
||||||
|
print(f"peak_mb: {mb:.1f}")
|
||||||
|
print(f"peak_kb: {kb}")
|
||||||
|
sys.exit(0)
|
||||||
|
print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
elif system == "Darwin":
|
||||||
|
# macOS: use /usr/bin/time -l
|
||||||
|
result = subprocess.run(
|
||||||
|
f"/usr/bin/time -l {COMMAND}",
|
||||||
|
shell=True, capture_output=True, text=True, timeout=300
|
||||||
|
)
|
||||||
|
output = result.stderr
|
||||||
|
for line in output.splitlines():
|
||||||
|
if "maximum resident set size" in line.lower():
|
||||||
|
# macOS reports in bytes
|
||||||
|
val = int(line.strip().split()[0])
|
||||||
|
mb = val / (1024 * 1024)
|
||||||
|
print(f"peak_mb: {mb:.1f}")
|
||||||
|
sys.exit(0)
|
||||||
|
print("Could not parse memory from time output", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
else:
|
||||||
|
print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
55
engineering/autoresearch-agent/evaluators/test_pass_rate.py
Normal file
55
engineering/autoresearch-agent/evaluators/test_pass_rate.py
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Measure test suite pass rate.
|
||||||
|
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||||
|
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# --- CONFIGURE THESE ---
|
||||||
|
TEST_CMD = "pytest tests/ --tb=no -q" # Test command
|
||||||
|
# --- END CONFIG ---
|
||||||
|
|
||||||
|
result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
|
||||||
|
output = result.stdout + "\n" + result.stderr
|
||||||
|
|
||||||
|
# Try to parse pytest output: "X passed, Y failed, Z errors"
|
||||||
|
passed = failed = errors = 0
|
||||||
|
|
||||||
|
# pytest short format: "5 passed, 2 failed in 1.23s"
|
||||||
|
match = re.search(r"(\d+) passed", output)
|
||||||
|
if match:
|
||||||
|
passed = int(match.group(1))
|
||||||
|
match = re.search(r"(\d+) failed", output)
|
||||||
|
if match:
|
||||||
|
failed = int(match.group(1))
|
||||||
|
match = re.search(r"(\d+) error", output)
|
||||||
|
if match:
|
||||||
|
errors = int(match.group(1))
|
||||||
|
|
||||||
|
total = passed + failed + errors
|
||||||
|
if total == 0:
|
||||||
|
# Try unittest format: "Ran X tests"
|
||||||
|
match = re.search(r"Ran (\d+) test", output)
|
||||||
|
if match:
|
||||||
|
total = int(match.group(1))
|
||||||
|
if result.returncode == 0:
|
||||||
|
passed = total
|
||||||
|
else:
|
||||||
|
# Count failures from output
|
||||||
|
fail_match = re.search(r"FAILED \(failures=(\d+)", output)
|
||||||
|
if fail_match:
|
||||||
|
failed = int(fail_match.group(1))
|
||||||
|
passed = total - failed
|
||||||
|
|
||||||
|
if total == 0:
|
||||||
|
print("Could not parse test results", file=sys.stderr)
|
||||||
|
print(f"Output: {output[:500]}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
rate = passed / total
|
||||||
|
|
||||||
|
print(f"pass_rate: {rate:.4f}")
|
||||||
|
print(f"passed: {passed}")
|
||||||
|
print(f"failed: {failed}")
|
||||||
|
print(f"total: {total}")
|
||||||
@@ -1,175 +1,255 @@
|
|||||||
# Experiment Domains Guide
|
# Experiment Domains Guide
|
||||||
|
|
||||||
## Domain 1: ML Training (Karpathy-style)
|
## Domain: Engineering
|
||||||
|
|
||||||
**Best for:** LLM/neural net training optimization on a single GPU
|
### Code Speed Optimization
|
||||||
|
|
||||||
**Requirements:**
|
|
||||||
- NVIDIA GPU (H100 recommended, A100/RTX also work)
|
|
||||||
- CUDA + PyTorch
|
|
||||||
- `uv` package manager
|
|
||||||
- ~50GB disk (training data)
|
|
||||||
|
|
||||||
**Setup:**
|
|
||||||
```bash
|
```bash
|
||||||
# Clone autoresearch repo (the ML training environment)
|
python scripts/setup_experiment.py \
|
||||||
git clone https://github.com/karpathy/autoresearch my-ml-research
|
--domain engineering \
|
||||||
cd my-ml-research
|
--name api-speed \
|
||||||
uv sync
|
--target src/api/search.py \
|
||||||
uv run prepare.py # one-time data download + tokenizer (~2 min)
|
--eval "python -m pytest tests/bench_search.py --tb=no -q" \
|
||||||
|
--metric p50_ms \
|
||||||
# Initialize autoresearch skill
|
--direction lower \
|
||||||
cp -r ~/.claude/skills/autoresearch-agent/scripts ./scripts
|
--evaluator benchmark_speed
|
||||||
|
|
||||||
# Configure
|
|
||||||
python scripts/setup_experiment.py --domain ml --tag mar13
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Metric:** `val_bpb` — validation bits per byte. Lower = better model.
|
**What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O.
|
||||||
|
**Cost:** Free — just runs benchmarks.
|
||||||
|
**Speed:** ~5 min/experiment, ~12/hour, ~100 overnight.
|
||||||
|
|
||||||
**What the agent can change in `train.py`:**
|
### Bundle Size Reduction
|
||||||
- Model depth, width, attention heads
|
|
||||||
- Learning rate, scheduler, warmup
|
|
||||||
- Optimizer (Muon, AdamW, variants)
|
|
||||||
- Batch size, gradient accumulation
|
|
||||||
- Architecture (attention patterns, FFN type)
|
|
||||||
|
|
||||||
**Tip for smaller GPUs (Mac M-series, RTX 3090 etc):**
|
```bash
|
||||||
Karpathy recommends forks for non-H100 hardware. Lower `DEPTH` to 4, use TinyStories dataset, lower `MAX_SEQ_LEN` to 256.
|
python scripts/setup_experiment.py \
|
||||||
|
--domain engineering \
|
||||||
|
--name bundle-size \
|
||||||
|
--target webpack.config.js \
|
||||||
|
--eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
|
||||||
|
--metric size_bytes \
|
||||||
|
--direction lower \
|
||||||
|
--evaluator benchmark_size
|
||||||
|
```
|
||||||
|
|
||||||
|
Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`.
|
||||||
|
|
||||||
|
### Test Pass Rate
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain engineering \
|
||||||
|
--name fix-flaky-tests \
|
||||||
|
--target src/utils/parser.py \
|
||||||
|
--eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
|
||||||
|
--metric pass_rate \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator test_pass_rate
|
||||||
|
```
|
||||||
|
|
||||||
|
### Docker Build Speed
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain engineering \
|
||||||
|
--name docker-build \
|
||||||
|
--target Dockerfile \
|
||||||
|
--eval "python .autoresearch/engineering/docker-build/evaluate.py" \
|
||||||
|
--metric build_seconds \
|
||||||
|
--direction lower \
|
||||||
|
--evaluator build_speed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory Optimization
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain engineering \
|
||||||
|
--name memory-usage \
|
||||||
|
--target src/processor.py \
|
||||||
|
--eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
|
||||||
|
--metric peak_mb \
|
||||||
|
--direction lower \
|
||||||
|
--evaluator memory_usage
|
||||||
|
```
|
||||||
|
|
||||||
|
### ML Training (Karpathy-style)
|
||||||
|
|
||||||
|
Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain engineering \
|
||||||
|
--name ml-training \
|
||||||
|
--target train.py \
|
||||||
|
--eval "uv run train.py" \
|
||||||
|
--metric val_bpb \
|
||||||
|
--direction lower \
|
||||||
|
--time-budget 5
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Domain 2: Prompt Engineering
|
## Domain: Marketing
|
||||||
|
|
||||||
**Best for:** Optimizing system prompts for quality/accuracy/tone
|
### Medium Article Headlines
|
||||||
|
|
||||||
**Requirements:**
|
|
||||||
- LLM API access (OpenAI, Anthropic, etc.)
|
|
||||||
- Test cases with expected outputs
|
|
||||||
- An LLM judge for scoring (can be same model)
|
|
||||||
|
|
||||||
**Setup:**
|
|
||||||
```bash
|
```bash
|
||||||
mkdir my-prompt-research && cd my-prompt-research
|
python scripts/setup_experiment.py \
|
||||||
git init
|
--domain marketing \
|
||||||
|
--name medium-ctr \
|
||||||
# Create prompt.md (the thing being optimized)
|
--target content/titles.md \
|
||||||
echo "You are a helpful assistant." > prompt.md
|
--eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
|
||||||
|
--metric ctr_score \
|
||||||
# Create evaluate.py (fixed — never modify)
|
--direction higher \
|
||||||
cat > evaluate.py << 'EOF'
|
--evaluator llm_judge_content
|
||||||
#!/usr/bin/env python3
|
|
||||||
# Fixed evaluation harness — DO NOT MODIFY
|
|
||||||
import json, sys
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
PROMPT = Path("prompt.md").read_text()
|
|
||||||
# Load test cases
|
|
||||||
TEST_CASES = json.loads(Path("tests/cases.json").read_text())
|
|
||||||
|
|
||||||
# Run prompt against test cases, score with LLM judge
|
|
||||||
# ... (customize for your LLM + scoring logic)
|
|
||||||
total = sum(score_case(PROMPT, case) for case in TEST_CASES)
|
|
||||||
score = total / len(TEST_CASES) * 100
|
|
||||||
print(f"eval_score: {score:.2f}")
|
|
||||||
EOF
|
|
||||||
|
|
||||||
# Initialize
|
|
||||||
python scripts/setup_experiment.py --domain prompt --tag mar13
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Metric:** `eval_score` (0-100). Higher = better prompt.
|
Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`.
|
||||||
|
|
||||||
|
**What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers.
|
||||||
|
**Cost:** Uses your CLI subscription (Claude Max = unlimited).
|
||||||
|
**Speed:** ~2 min/experiment, ~30/hour.
|
||||||
|
|
||||||
|
### Social Media Copy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain marketing \
|
||||||
|
--name twitter-engagement \
|
||||||
|
--target social/tweets.md \
|
||||||
|
--eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
|
||||||
|
--metric engagement_score \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator llm_judge_copy
|
||||||
|
```
|
||||||
|
|
||||||
|
Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram).
|
||||||
|
|
||||||
|
### Email Subject Lines
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain marketing \
|
||||||
|
--name email-open-rate \
|
||||||
|
--target emails/subjects.md \
|
||||||
|
--eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
|
||||||
|
--metric engagement_score \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator llm_judge_copy
|
||||||
|
```
|
||||||
|
|
||||||
|
Edit `evaluate.py`: set `PLATFORM = "email"`.
|
||||||
|
|
||||||
|
### Ad Copy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain marketing \
|
||||||
|
--name ad-copy-q2 \
|
||||||
|
--target ads/google-search.md \
|
||||||
|
--eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
|
||||||
|
--metric engagement_score \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator llm_judge_copy
|
||||||
|
```
|
||||||
|
|
||||||
|
Edit `evaluate.py`: set `PLATFORM = "ad"`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Domain 3: Code Performance
|
## Domain: Content
|
||||||
|
|
||||||
**Best for:** Optimizing a specific hot module for speed
|
### Article Structure & Readability
|
||||||
|
|
||||||
**Requirements:**
|
|
||||||
- A Python module with measurable performance
|
|
||||||
- Existing tests (correctness must not regress)
|
|
||||||
- A benchmark harness
|
|
||||||
|
|
||||||
**Setup:**
|
|
||||||
```bash
|
```bash
|
||||||
cd my-project
|
python scripts/setup_experiment.py \
|
||||||
|
--domain content \
|
||||||
# Create benchmark.py (fixed — never modify)
|
--name article-structure \
|
||||||
cat > benchmark.py << 'EOF'
|
--target drafts/my-article.md \
|
||||||
#!/usr/bin/env python3
|
--eval "python .autoresearch/content/article-structure/evaluate.py" \
|
||||||
# Fixed benchmark — DO NOT MODIFY
|
--metric ctr_score \
|
||||||
import time, statistics
|
--direction higher \
|
||||||
from src.module import your_function
|
--evaluator llm_judge_content
|
||||||
from tests.test_module import run_tests
|
|
||||||
|
|
||||||
# Correctness check first
|
|
||||||
if not run_tests():
|
|
||||||
print("TESTS FAILED")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Benchmark
|
|
||||||
data = generate_test_data(n=10000)
|
|
||||||
times = []
|
|
||||||
for _ in range(10):
|
|
||||||
t0 = time.perf_counter()
|
|
||||||
your_function(data)
|
|
||||||
times.append((time.perf_counter() - t0) * 1000)
|
|
||||||
|
|
||||||
p50 = statistics.median(times)
|
|
||||||
print(f"p50_ms: {p50:.2f}")
|
|
||||||
print(f"p95_ms: {statistics.quantiles(times, n=20)[18]:.2f}")
|
|
||||||
EOF
|
|
||||||
|
|
||||||
python scripts/setup_experiment.py --domain code \
|
|
||||||
--target src/module.py \
|
|
||||||
--tag mar13
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Metric:** `p50_ms` — median latency. Lower = faster.
|
### SEO Descriptions
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain content \
|
||||||
|
--name seo-meta \
|
||||||
|
--target seo/descriptions.md \
|
||||||
|
--eval "python .autoresearch/content/seo-meta/evaluate.py" \
|
||||||
|
--metric ctr_score \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator llm_judge_content
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Domain 4: Agent Skill Optimization
|
## Domain: Prompts
|
||||||
|
|
||||||
**Best for:** Improving the quality of claude-skills SKILL.md files
|
### System Prompt Optimization
|
||||||
|
|
||||||
**Requirements:**
|
|
||||||
- A SKILL.md to optimize
|
|
||||||
- A task evaluation suite (15-20 standardized tasks)
|
|
||||||
- An LLM judge for scoring
|
|
||||||
|
|
||||||
**Setup:**
|
|
||||||
```bash
|
```bash
|
||||||
# Create a new skill research project
|
python scripts/setup_experiment.py \
|
||||||
mkdir skill-research-{skill-name} && cd skill-research-{skill-name}
|
--domain prompts \
|
||||||
git init
|
--name support-bot \
|
||||||
|
--target prompts/support-system.md \
|
||||||
# Copy the skill to optimize
|
--eval "python .autoresearch/prompts/support-bot/evaluate.py" \
|
||||||
cp ~/.claude/skills/{skill-name}/SKILL.md .
|
--metric quality_score \
|
||||||
|
--direction higher \
|
||||||
# Create evaluate.py
|
--evaluator llm_judge_prompt
|
||||||
cat > scripts/skill_evaluator.py << 'EOF'
|
|
||||||
#!/usr/bin/env python3
|
|
||||||
# Fixed evaluator — DO NOT MODIFY
|
|
||||||
# Runs SKILL.md against 15 standardized tasks using LLM judge
|
|
||||||
# Outputs: pass_rate: 0.80 (etc.)
|
|
||||||
EOF
|
|
||||||
|
|
||||||
python scripts/setup_experiment.py --domain skill --tag mar13
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Metric:** `pass_rate` (0-1). Higher = better skill.
|
Requires `tests/cases.json` with test inputs and expected outputs:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"input": "I can't log in to my account",
|
||||||
|
"expected": "Ask for email, check account status, offer password reset"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"input": "How do I cancel my subscription?",
|
||||||
|
"expected": "Empathetic response, explain cancellation steps, offer retention"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent Skill Optimization
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/setup_experiment.py \
|
||||||
|
--domain prompts \
|
||||||
|
--name skill-improvement \
|
||||||
|
--target SKILL.md \
|
||||||
|
--eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
|
||||||
|
--metric quality_score \
|
||||||
|
--direction higher \
|
||||||
|
--evaluator llm_judge_prompt
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Choosing Your Domain
|
## Choosing Your Domain
|
||||||
|
|
||||||
| Question | Recommendation |
|
| I want to... | Domain | Evaluator | Cost |
|
||||||
|----------|---------------|
|
|-------------|--------|-----------|------|
|
||||||
| Do I have a GPU and want to improve an LLM? | ML Training |
|
| Speed up my code | engineering | benchmark_speed | Free |
|
||||||
| Do I want to improve a prompt/system message? | Prompt Engineering |
|
| Shrink my bundle | engineering | benchmark_size | Free |
|
||||||
| Do I have slow Python code I want to speed up? | Code Performance |
|
| Fix flaky tests | engineering | test_pass_rate | Free |
|
||||||
| Do I want to improve one of my claude-skills? | Skill Optimization |
|
| Speed up Docker builds | engineering | build_speed | Free |
|
||||||
|
| Reduce memory usage | engineering | memory_usage | Free |
|
||||||
|
| Train ML models | engineering | (custom) | Free + GPU |
|
||||||
|
| Write better headlines | marketing | llm_judge_content | Subscription |
|
||||||
|
| Improve social posts | marketing | llm_judge_copy | Subscription |
|
||||||
|
| Optimize email subjects | marketing | llm_judge_copy | Subscription |
|
||||||
|
| Improve ad copy | marketing | llm_judge_copy | Subscription |
|
||||||
|
| Optimize article structure | content | llm_judge_content | Subscription |
|
||||||
|
| Improve SEO descriptions | content | llm_judge_content | Subscription |
|
||||||
|
| Optimize system prompts | prompts | llm_judge_prompt | Subscription |
|
||||||
|
| Improve agent skills | prompts | llm_judge_prompt | Subscription |
|
||||||
|
|
||||||
**First time?** Start with **Prompt Engineering** — no GPU required, fast experiments (2 min each), immediately applicable results.
|
**First time?** Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.
|
||||||
|
|||||||
@@ -1,125 +1,389 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
autoresearch-agent: Results Logger
|
autoresearch-agent: Results Viewer
|
||||||
|
|
||||||
View and analyze experiment results from results.tsv.
|
View experiment results in multiple formats: terminal, CSV, Markdown.
|
||||||
|
Supports single experiment, domain, or cross-experiment dashboard.
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
python scripts/log_results.py --summary # Print progress table
|
python scripts/log_results.py --experiment engineering/api-speed
|
||||||
python scripts/log_results.py --best # Show best result
|
python scripts/log_results.py --domain engineering
|
||||||
python scripts/log_results.py --history # Full experiment history
|
python scripts/log_results.py --dashboard
|
||||||
python scripts/log_results.py --record commit val status desc # Add entry manually
|
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
|
||||||
|
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
|
||||||
|
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
import sys
|
import sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
def load_results(path):
|
def find_autoresearch_root():
|
||||||
tsv = Path(path) / "results.tsv"
|
"""Find .autoresearch/ in project or user home."""
|
||||||
|
project_root = Path(".").resolve() / ".autoresearch"
|
||||||
|
if project_root.exists():
|
||||||
|
return project_root
|
||||||
|
user_root = Path.home() / ".autoresearch"
|
||||||
|
if user_root.exists():
|
||||||
|
return user_root
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(experiment_dir):
|
||||||
|
"""Load config.cfg."""
|
||||||
|
cfg_file = experiment_dir / "config.cfg"
|
||||||
|
config = {}
|
||||||
|
if cfg_file.exists():
|
||||||
|
for line in cfg_file.read_text().splitlines():
|
||||||
|
if ":" in line:
|
||||||
|
k, v = line.split(":", 1)
|
||||||
|
config[k.strip()] = v.strip()
|
||||||
|
return config
|
||||||
|
|
||||||
|
|
||||||
|
def load_results(experiment_dir):
|
||||||
|
"""Load results.tsv into list of dicts."""
|
||||||
|
tsv = experiment_dir / "results.tsv"
|
||||||
if not tsv.exists():
|
if not tsv.exists():
|
||||||
return []
|
return []
|
||||||
lines = tsv.read_text().splitlines()[1:] # skip header
|
|
||||||
results = []
|
results = []
|
||||||
for line in lines:
|
for line in tsv.read_text().splitlines()[1:]:
|
||||||
parts = line.split("\t")
|
parts = line.split("\t")
|
||||||
if len(parts) >= 4:
|
if len(parts) >= 4:
|
||||||
try:
|
try:
|
||||||
metric_val = float(parts[1]) if parts[1] != "N/A" else None
|
metric = float(parts[1]) if parts[1] != "N/A" else None
|
||||||
except ValueError:
|
except ValueError:
|
||||||
metric_val = None
|
metric = None
|
||||||
results.append({
|
results.append({
|
||||||
"commit": parts[0],
|
"commit": parts[0],
|
||||||
"metric": metric_val,
|
"metric": metric,
|
||||||
"status": parts[2],
|
"status": parts[2],
|
||||||
"description": parts[3]
|
"description": parts[3],
|
||||||
})
|
})
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
|
||||||
def print_summary(results, metric_name="metric", direction="lower"):
|
def compute_stats(results, direction):
|
||||||
if not results:
|
"""Compute statistics from results."""
|
||||||
print("No experiments logged yet.")
|
|
||||||
return
|
|
||||||
|
|
||||||
keeps = [r for r in results if r["status"] == "keep"]
|
keeps = [r for r in results if r["status"] == "keep"]
|
||||||
discards = [r for r in results if r["status"] == "discard"]
|
discards = [r for r in results if r["status"] == "discard"]
|
||||||
crashes = [r for r in results if r["status"] == "crash"]
|
crashes = [r for r in results if r["status"] == "crash"]
|
||||||
|
|
||||||
print(f"\n{'─'*60}")
|
valid_keeps = [r for r in keeps if r["metric"] is not None]
|
||||||
print(f" autoresearch-agent — Results Summary")
|
baseline = valid_keeps[0]["metric"] if valid_keeps else None
|
||||||
print(f"{'─'*60}")
|
if valid_keeps:
|
||||||
print(f" Total experiments: {len(results)}")
|
best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps)
|
||||||
print(f" ✅ Keep: {len(keeps):3d} ({len(keeps)/max(len(results),1)*100:.0f}%)")
|
else:
|
||||||
print(f" ❌ Discard: {len(discards):3d} ({len(discards)/max(len(results),1)*100:.0f}%)")
|
best = None
|
||||||
print(f" 💥 Crash: {len(crashes):3d} ({len(crashes)/max(len(results),1)*100:.0f}%)")
|
|
||||||
|
|
||||||
if keeps:
|
pct_change = None
|
||||||
valid = [r for r in keeps if r["metric"] is not None]
|
if baseline and best and baseline != 0:
|
||||||
if valid:
|
if direction == "lower":
|
||||||
baseline = valid[0]["metric"]
|
pct_change = (baseline - best) / baseline * 100
|
||||||
best = min(r["metric"] for r in valid) if direction == "lower" else max(r["metric"] for r in valid)
|
else:
|
||||||
best_run = next(r for r in valid if r["metric"] == best)
|
pct_change = (best - baseline) / baseline * 100
|
||||||
improvement = ((baseline - best) / baseline * 100) if direction == "lower" else ((best - baseline) / baseline * 100)
|
|
||||||
|
|
||||||
print(f"\n {metric_name}:")
|
return {
|
||||||
print(f" Baseline: {baseline:.6f}")
|
"total": len(results),
|
||||||
print(f" Best: {best:.6f} (commit: {best_run['commit']})")
|
"keeps": len(keeps),
|
||||||
print(f" Change: {improvement:+.2f}%")
|
"discards": len(discards),
|
||||||
|
"crashes": len(crashes),
|
||||||
print(f"{'─'*60}\n")
|
"baseline": baseline,
|
||||||
|
"best": best,
|
||||||
|
"pct_change": pct_change,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def print_history(results):
|
# --- Terminal Output ---
|
||||||
|
|
||||||
|
def print_experiment(experiment_dir, experiment_path):
|
||||||
|
"""Print single experiment results to terminal."""
|
||||||
|
config = load_config(experiment_dir)
|
||||||
|
results = load_results(experiment_dir)
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
metric_name = config.get("metric", "metric")
|
||||||
|
|
||||||
if not results:
|
if not results:
|
||||||
print("No experiments logged yet.")
|
print(f"No results for {experiment_path}")
|
||||||
return
|
return
|
||||||
|
|
||||||
print(f"\n{'COMMIT':8} {'METRIC':10} {'STATUS':8} DESCRIPTION")
|
stats = compute_stats(results, direction)
|
||||||
print("─" * 60)
|
|
||||||
for r in results:
|
|
||||||
metric_str = f"{r['metric']:.6f}" if r['metric'] is not None else "crash "
|
|
||||||
status_icon = {"keep": "✅", "discard": "❌", "crash": "💥"}.get(r["status"], "?")
|
|
||||||
print(f"{r['commit']:8} {metric_str:10} {status_icon} {r['description'][:40]}")
|
|
||||||
|
|
||||||
|
print(f"\n{'─' * 65}")
|
||||||
|
print(f" {experiment_path}")
|
||||||
|
print(f" Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})")
|
||||||
|
print(f"{'─' * 65}")
|
||||||
|
print(f" Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}")
|
||||||
|
|
||||||
|
if stats["baseline"] is not None and stats["best"] is not None:
|
||||||
|
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
|
||||||
|
print(f" Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}")
|
||||||
|
|
||||||
|
print(f"\n {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION")
|
||||||
|
print(f" {'─' * 60}")
|
||||||
|
for r in results:
|
||||||
|
m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A "
|
||||||
|
icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?")
|
||||||
|
print(f" {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def print_dashboard(root):
|
||||||
|
"""Print cross-experiment dashboard."""
|
||||||
|
experiments = []
|
||||||
|
for domain_dir in sorted(root.iterdir()):
|
||||||
|
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
|
||||||
|
continue
|
||||||
|
for exp_dir in sorted(domain_dir.iterdir()):
|
||||||
|
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
|
||||||
|
continue
|
||||||
|
config = load_config(exp_dir)
|
||||||
|
results = load_results(exp_dir)
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
stats = compute_stats(results, direction)
|
||||||
|
|
||||||
|
# Determine status
|
||||||
|
status = "idle"
|
||||||
|
if stats["total"] > 0:
|
||||||
|
tsv = exp_dir / "results.tsv"
|
||||||
|
if tsv.exists():
|
||||||
|
import time
|
||||||
|
age_hours = (time.time() - tsv.stat().st_mtime) / 3600
|
||||||
|
status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"
|
||||||
|
|
||||||
|
best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—"
|
||||||
|
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"
|
||||||
|
|
||||||
|
experiments.append({
|
||||||
|
"domain": domain_dir.name,
|
||||||
|
"name": exp_dir.name,
|
||||||
|
"runs": stats["total"],
|
||||||
|
"kept": stats["keeps"],
|
||||||
|
"best": best_str,
|
||||||
|
"change": pct_str,
|
||||||
|
"status": status,
|
||||||
|
"metric": config.get("metric", "?"),
|
||||||
|
})
|
||||||
|
|
||||||
|
if not experiments:
|
||||||
|
print("No experiments found.")
|
||||||
|
return experiments
|
||||||
|
|
||||||
|
print(f"\n{'─' * 90}")
|
||||||
|
print(f" autoresearch — Dashboard")
|
||||||
|
print(f"{'─' * 90}")
|
||||||
|
print(f" {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}")
|
||||||
|
print(f" {'─' * 85}")
|
||||||
|
for e in experiments:
|
||||||
|
print(f" {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}")
|
||||||
|
print()
|
||||||
|
return experiments
|
||||||
|
|
||||||
|
|
||||||
|
# --- CSV Export ---
|
||||||
|
|
||||||
|
def export_experiment_csv(experiment_dir, experiment_path):
|
||||||
|
"""Export single experiment as CSV string."""
|
||||||
|
config = load_config(experiment_dir)
|
||||||
|
results = load_results(experiment_dir)
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
stats = compute_stats(results, direction)
|
||||||
|
|
||||||
|
buf = io.StringIO()
|
||||||
|
writer = csv.writer(buf)
|
||||||
|
|
||||||
|
# Header with metadata
|
||||||
|
writer.writerow(["# Experiment", experiment_path])
|
||||||
|
writer.writerow(["# Target", config.get("target", "")])
|
||||||
|
writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"])
|
||||||
|
if stats["baseline"] is not None:
|
||||||
|
writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
|
||||||
|
if stats["best"] is not None:
|
||||||
|
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
|
||||||
|
writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
|
||||||
|
writer.writerow(["# Total", stats["total"]])
|
||||||
|
writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
|
||||||
|
writer.writerow([])
|
||||||
|
|
||||||
|
writer.writerow(["Commit", "Metric", "Status", "Description"])
|
||||||
|
for r in results:
|
||||||
|
m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A"
|
||||||
|
writer.writerow([r["commit"], m, r["status"], r["description"]])
|
||||||
|
|
||||||
|
return buf.getvalue()
|
||||||
|
|
||||||
|
|
||||||
|
def export_dashboard_csv(root):
|
||||||
|
"""Export dashboard as CSV string."""
|
||||||
|
experiments = []
|
||||||
|
for domain_dir in sorted(root.iterdir()):
|
||||||
|
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
|
||||||
|
continue
|
||||||
|
for exp_dir in sorted(domain_dir.iterdir()):
|
||||||
|
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
|
||||||
|
continue
|
||||||
|
config = load_config(exp_dir)
|
||||||
|
results = load_results(exp_dir)
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
stats = compute_stats(results, direction)
|
||||||
|
best_str = f"{stats['best']:.6f}" if stats["best"] else ""
|
||||||
|
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
|
||||||
|
experiments.append([
|
||||||
|
domain_dir.name, exp_dir.name, config.get("metric", ""),
|
||||||
|
stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
|
||||||
|
best_str, pct_str
|
||||||
|
])
|
||||||
|
|
||||||
|
buf = io.StringIO()
|
||||||
|
writer = csv.writer(buf)
|
||||||
|
writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"])
|
||||||
|
for e in experiments:
|
||||||
|
writer.writerow(e)
|
||||||
|
return buf.getvalue()
|
||||||
|
|
||||||
|
|
||||||
|
# --- Markdown Export ---
|
||||||
|
|
||||||
|
def export_experiment_markdown(experiment_dir, experiment_path):
|
||||||
|
"""Export single experiment as Markdown string."""
|
||||||
|
config = load_config(experiment_dir)
|
||||||
|
results = load_results(experiment_dir)
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
metric_name = config.get("metric", "metric")
|
||||||
|
stats = compute_stats(results, direction)
|
||||||
|
|
||||||
|
lines = []
|
||||||
|
lines.append(f"# Autoresearch: {experiment_path}\n")
|
||||||
|
lines.append(f"**Target:** `{config.get('target', '?')}` ")
|
||||||
|
lines.append(f"**Metric:** `{metric_name}` ({direction} is better) ")
|
||||||
|
lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")
|
||||||
|
|
||||||
|
if stats["baseline"] is not None and stats["best"] is not None:
|
||||||
|
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
|
||||||
|
lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")
|
||||||
|
|
||||||
|
lines.append(f"| Commit | Metric | Status | Description |")
|
||||||
|
lines.append(f"|--------|--------|--------|-------------|")
|
||||||
|
for r in results:
|
||||||
|
m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A"
|
||||||
|
lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def export_dashboard_markdown(root):
|
||||||
|
"""Export dashboard as Markdown string."""
|
||||||
|
lines = []
|
||||||
|
lines.append("# Autoresearch Dashboard\n")
|
||||||
|
lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |")
|
||||||
|
lines.append("|--------|-----------|--------|------|------|------|--------|--------|")
|
||||||
|
|
||||||
|
for domain_dir in sorted(root.iterdir()):
|
||||||
|
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
|
||||||
|
continue
|
||||||
|
for exp_dir in sorted(domain_dir.iterdir()):
|
||||||
|
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
|
||||||
|
continue
|
||||||
|
config = load_config(exp_dir)
|
||||||
|
results = load_results(exp_dir)
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
stats = compute_stats(results, direction)
|
||||||
|
best = f"`{stats['best']:.4f}`" if stats["best"] else "—"
|
||||||
|
pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else "—"
|
||||||
|
|
||||||
|
import time
|
||||||
|
tsv = exp_dir / "results.tsv"
|
||||||
|
status = "idle"
|
||||||
|
if tsv.exists() and stats["total"] > 0:
|
||||||
|
age_h = (time.time() - tsv.stat().st_mtime) / 3600
|
||||||
|
status = "active" if age_h < 1 else "paused" if age_h < 24 else "done"
|
||||||
|
|
||||||
|
lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
# --- Main ---
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
parser = argparse.ArgumentParser()
|
parser = argparse.ArgumentParser(description="autoresearch-agent results viewer")
|
||||||
parser.add_argument("--summary", action="store_true")
|
parser.add_argument("--experiment", help="Show one experiment: domain/name")
|
||||||
parser.add_argument("--best", action="store_true")
|
parser.add_argument("--domain", help="Show all experiments in a domain")
|
||||||
parser.add_argument("--history", action="store_true")
|
parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard")
|
||||||
parser.add_argument("--record", nargs=4, metavar=("COMMIT", "METRIC", "STATUS", "DESC"))
|
parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal",
|
||||||
parser.add_argument("--path", default=".")
|
help="Output format (default: terminal)")
|
||||||
parser.add_argument("--metric", default="metric")
|
parser.add_argument("--output", "-o", help="Write to file instead of stdout")
|
||||||
parser.add_argument("--direction", default="lower", choices=["lower", "higher"])
|
parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
path = Path(args.path).resolve()
|
root = find_autoresearch_root()
|
||||||
|
if root is None:
|
||||||
|
print("No .autoresearch/ found. Run setup_experiment.py first.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
if args.record:
|
output_text = None
|
||||||
commit, metric, status, desc = args.record
|
|
||||||
tsv = path / "results.tsv"
|
|
||||||
if not tsv.exists():
|
|
||||||
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
|
|
||||||
with open(tsv, "a") as f:
|
|
||||||
f.write(f"{commit}\t{metric}\t{status}\t{desc}\n")
|
|
||||||
print(f"✓ Logged: {commit} {metric} {status}")
|
|
||||||
return
|
|
||||||
|
|
||||||
results = load_results(path)
|
# Single experiment
|
||||||
|
if args.experiment:
|
||||||
|
experiment_dir = root / args.experiment
|
||||||
|
if not experiment_dir.exists():
|
||||||
|
print(f"Experiment not found: {args.experiment}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
if args.history:
|
if args.format == "csv":
|
||||||
print_history(results)
|
output_text = export_experiment_csv(experiment_dir, args.experiment)
|
||||||
elif args.best:
|
elif args.format == "markdown":
|
||||||
keeps = [r for r in results if r["status"] == "keep" and r["metric"]]
|
output_text = export_experiment_markdown(experiment_dir, args.experiment)
|
||||||
if not keeps:
|
else:
|
||||||
print("No successful experiments yet.")
|
print_experiment(experiment_dir, args.experiment)
|
||||||
return
|
return
|
||||||
best = min(keeps, key=lambda r: r["metric"]) if args.direction == "lower" else max(keeps, key=lambda r: r["metric"])
|
|
||||||
print(f"Best: {best['metric']:.6f} (commit: {best['commit']}) — {best['description']}")
|
# Domain
|
||||||
|
elif args.domain:
|
||||||
|
domain_dir = root / args.domain
|
||||||
|
if not domain_dir.exists():
|
||||||
|
print(f"Domain not found: {args.domain}")
|
||||||
|
sys.exit(1)
|
||||||
|
for exp_dir in sorted(domain_dir.iterdir()):
|
||||||
|
if exp_dir.is_dir() and (exp_dir / "config.cfg").exists():
|
||||||
|
if args.format == "terminal":
|
||||||
|
print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}")
|
||||||
|
# For CSV/MD, fall through to dashboard with domain filter
|
||||||
|
if args.format != "terminal":
|
||||||
|
# Use dashboard export filtered to domain
|
||||||
|
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
|
||||||
|
else:
|
||||||
|
return
|
||||||
|
|
||||||
|
# Dashboard
|
||||||
|
elif args.dashboard or args.all:
|
||||||
|
if args.format == "csv":
|
||||||
|
output_text = export_dashboard_csv(root)
|
||||||
|
elif args.format == "markdown":
|
||||||
|
output_text = export_dashboard_markdown(root)
|
||||||
|
else:
|
||||||
|
print_dashboard(root)
|
||||||
|
return
|
||||||
|
|
||||||
else:
|
else:
|
||||||
print_summary(results, args.metric, args.direction)
|
# Default: dashboard
|
||||||
|
if args.format == "terminal":
|
||||||
|
print_dashboard(root)
|
||||||
|
return
|
||||||
|
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
|
||||||
|
|
||||||
|
# Write output
|
||||||
|
if output_text:
|
||||||
|
if args.output:
|
||||||
|
Path(args.output).write_text(output_text)
|
||||||
|
print(f"Written to {args.output}")
|
||||||
|
else:
|
||||||
|
print(output_text)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -2,17 +2,15 @@
|
|||||||
"""
|
"""
|
||||||
autoresearch-agent: Experiment Runner
|
autoresearch-agent: Experiment Runner
|
||||||
|
|
||||||
Executes the autonomous experiment loop:
|
Executes the autonomous experiment loop for a specific experiment.
|
||||||
- Reads .autoresearch.cfg for project config
|
Reads config from .autoresearch/{domain}/{name}/config.cfg.
|
||||||
- Runs the target evaluation
|
|
||||||
- Keeps improvements (git commit) or discards failures (git reset)
|
|
||||||
- Logs everything to results.tsv
|
|
||||||
- Loops indefinitely until interrupted
|
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
python scripts/run_experiment.py --loop # Run forever
|
python scripts/run_experiment.py --experiment engineering/api-speed --loop
|
||||||
python scripts/run_experiment.py --single # Run one experiment
|
python scripts/run_experiment.py --experiment engineering/api-speed --single
|
||||||
python scripts/run_experiment.py --dry-run # Show what would happen
|
python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
|
||||||
|
python scripts/run_experiment.py --resume --loop
|
||||||
|
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
@@ -25,11 +23,22 @@ from datetime import datetime
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
def load_config(path):
|
def find_autoresearch_root():
|
||||||
"""Load .autoresearch.cfg"""
|
"""Find .autoresearch/ in project or user home."""
|
||||||
cfg_file = Path(path) / ".autoresearch.cfg"
|
project_root = Path(".").resolve() / ".autoresearch"
|
||||||
|
if project_root.exists():
|
||||||
|
return project_root
|
||||||
|
user_root = Path.home() / ".autoresearch"
|
||||||
|
if user_root.exists():
|
||||||
|
return user_root
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(experiment_dir):
|
||||||
|
"""Load config.cfg from experiment directory."""
|
||||||
|
cfg_file = experiment_dir / "config.cfg"
|
||||||
if not cfg_file.exists():
|
if not cfg_file.exists():
|
||||||
print("✗ No .autoresearch.cfg found. Run setup_experiment.py first.")
|
print(f" Error: no config.cfg in {experiment_dir}")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
config = {}
|
config = {}
|
||||||
for line in cfg_file.read_text().splitlines():
|
for line in cfg_file.read_text().splitlines():
|
||||||
@@ -49,239 +58,293 @@ def run_cmd(cmd, cwd=None, timeout=None):
|
|||||||
|
|
||||||
|
|
||||||
def get_current_commit(path):
|
def get_current_commit(path):
|
||||||
|
"""Get short hash of current HEAD."""
|
||||||
_, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
|
_, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
|
||||||
return commit
|
return commit
|
||||||
|
|
||||||
|
|
||||||
def get_current_metric(path, metric_grep):
|
def get_best_metric(experiment_dir, direction):
|
||||||
"""Read the last recorded metric from results.tsv."""
|
"""Read the best metric from results.tsv."""
|
||||||
tsv = Path(path) / "results.tsv"
|
tsv = experiment_dir / "results.tsv"
|
||||||
if not tsv.exists():
|
if not tsv.exists():
|
||||||
return None
|
return None
|
||||||
lines = [l for l in tsv.read_text().splitlines() if "\tkeep\t" in l]
|
lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l]
|
||||||
if not lines:
|
if not lines:
|
||||||
return None
|
return None
|
||||||
last = lines[-1].split("\t")
|
metrics = []
|
||||||
try:
|
for line in lines:
|
||||||
return float(last[1])
|
parts = line.split("\t")
|
||||||
except (ValueError, IndexError):
|
try:
|
||||||
|
if parts[1] != "N/A":
|
||||||
|
metrics.append(float(parts[1]))
|
||||||
|
except (ValueError, IndexError):
|
||||||
|
continue
|
||||||
|
if not metrics:
|
||||||
return None
|
return None
|
||||||
|
return min(metrics) if direction == "lower" else max(metrics)
|
||||||
|
|
||||||
|
|
||||||
def run_evaluation(path, evaluate_cmd, time_budget_minutes):
|
def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
|
||||||
"""Run evaluation with time limit."""
|
"""Run evaluation with time limit. Output goes to log_file."""
|
||||||
hard_limit = time_budget_minutes * 60 * 2.5 # 2.5x as hard timeout
|
hard_limit = time_budget_minutes * 60 * 2.5
|
||||||
t0 = time.time()
|
t0 = time.time()
|
||||||
try:
|
try:
|
||||||
code, _, _ = run_cmd(
|
code, _, _ = run_cmd(
|
||||||
f"{evaluate_cmd} > run.log 2>&1",
|
f"{eval_cmd} > {log_file} 2>&1",
|
||||||
cwd=path,
|
cwd=str(project_root),
|
||||||
timeout=hard_limit
|
timeout=hard_limit
|
||||||
)
|
)
|
||||||
elapsed = time.time() - t0
|
elapsed = time.time() - t0
|
||||||
return code, elapsed
|
return code, elapsed
|
||||||
except subprocess.TimeoutExpired:
|
except subprocess.TimeoutExpired:
|
||||||
elapsed = time.time() - t0
|
elapsed = time.time() - t0
|
||||||
return -1, elapsed # -1 = timeout
|
return -1, elapsed
|
||||||
|
|
||||||
|
|
||||||
def extract_metric(path, metric_grep):
|
def extract_metric(log_file, metric_grep):
|
||||||
"""Extract metric value from run.log."""
|
"""Extract metric value from log file."""
|
||||||
code, out, _ = run_cmd(
|
log_path = Path(log_file)
|
||||||
f"grep '{metric_grep}' run.log | tail -1",
|
if not log_path.exists():
|
||||||
cwd=path
|
|
||||||
)
|
|
||||||
if not out:
|
|
||||||
return None
|
|
||||||
try:
|
|
||||||
return float(out.split(":")[-1].strip())
|
|
||||||
except ValueError:
|
|
||||||
return None
|
return None
|
||||||
|
for line in reversed(log_path.read_text().splitlines()):
|
||||||
|
stripped = line.strip()
|
||||||
|
if stripped.startswith(metric_grep.lstrip("^")):
|
||||||
|
try:
|
||||||
|
return float(stripped.split(":")[-1].strip())
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
def is_improvement(new_val, old_val, direction):
|
def is_improvement(new_val, old_val, direction):
|
||||||
"""Check if new result is better than old."""
|
"""Check if new result is better than old."""
|
||||||
if old_val is None:
|
if old_val is None:
|
||||||
return True # First run always "improves"
|
return True
|
||||||
if direction == "lower":
|
if direction == "lower":
|
||||||
return new_val < old_val
|
return new_val < old_val
|
||||||
else:
|
return new_val > old_val
|
||||||
return new_val > old_val
|
|
||||||
|
|
||||||
|
|
||||||
def log_result(path, commit, metric_val, status, description):
|
def log_result(experiment_dir, commit, metric_val, status, description):
|
||||||
"""Append result to results.tsv."""
|
"""Append result to results.tsv."""
|
||||||
tsv = Path(path) / "results.tsv"
|
tsv = experiment_dir / "results.tsv"
|
||||||
metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
|
metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
|
||||||
with open(tsv, "a") as f:
|
with open(tsv, "a") as f:
|
||||||
f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
|
f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")
|
||||||
|
|
||||||
|
|
||||||
def get_experiment_count(path):
|
def get_experiment_count(experiment_dir):
|
||||||
"""Count experiments run so far."""
|
"""Count experiments run so far."""
|
||||||
tsv = Path(path) / "results.tsv"
|
tsv = experiment_dir / "results.tsv"
|
||||||
if not tsv.exists():
|
if not tsv.exists():
|
||||||
return 0
|
return 0
|
||||||
lines = tsv.read_text().splitlines()
|
return max(0, len(tsv.read_text().splitlines()) - 1)
|
||||||
return max(0, len(lines) - 1) # subtract header
|
|
||||||
|
|
||||||
|
|
||||||
def run_single_experiment(path, config, exp_num, dry_run=False):
|
def get_last_active(root):
|
||||||
|
"""Find the most recently modified experiment."""
|
||||||
|
latest = None
|
||||||
|
latest_time = 0
|
||||||
|
for domain_dir in root.iterdir():
|
||||||
|
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
|
||||||
|
continue
|
||||||
|
for exp_dir in domain_dir.iterdir():
|
||||||
|
if not exp_dir.is_dir():
|
||||||
|
continue
|
||||||
|
cfg = exp_dir / "config.cfg"
|
||||||
|
if cfg.exists() and cfg.stat().st_mtime > latest_time:
|
||||||
|
latest_time = cfg.stat().st_mtime
|
||||||
|
latest = f"{domain_dir.name}/{exp_dir.name}"
|
||||||
|
return latest
|
||||||
|
|
||||||
|
|
||||||
|
def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
|
||||||
"""Run one experiment iteration."""
|
"""Run one experiment iteration."""
|
||||||
direction = config.get("metric_direction", "lower")
|
direction = config.get("metric_direction", "lower")
|
||||||
metric_grep = config.get("metric_grep", "^metric:")
|
metric_grep = config.get("metric_grep", "^metric:")
|
||||||
evaluate_cmd = config.get("evaluate_cmd", "python evaluate.py")
|
eval_cmd = config.get("evaluate_cmd", "python evaluate.py")
|
||||||
time_budget = int(config.get("time_budget_minutes", 5))
|
time_budget = int(config.get("time_budget_minutes", 5))
|
||||||
metric_name = config.get("metric", "metric")
|
metric_name = config.get("metric", "metric")
|
||||||
|
log_file = str(experiment_dir / "run.log")
|
||||||
|
|
||||||
best_so_far = get_current_metric(path, metric_grep)
|
best = get_best_metric(experiment_dir, direction)
|
||||||
ts = datetime.now().strftime("%H:%M:%S")
|
ts = datetime.now().strftime("%H:%M:%S")
|
||||||
|
|
||||||
print(f"\n[{ts}] Experiment #{exp_num}")
|
print(f"\n[{ts}] Experiment #{exp_num}")
|
||||||
print(f" Best {metric_name} so far: {best_so_far}")
|
print(f" Best {metric_name}: {best}")
|
||||||
|
|
||||||
if dry_run:
|
if dry_run:
|
||||||
print(" [DRY RUN] Would run evaluation and check metric")
|
print(" [DRY RUN] Would run evaluation and check metric")
|
||||||
return "dry_run"
|
return "dry_run"
|
||||||
|
|
||||||
# Save pre-experiment state for rollback
|
# Save state for rollback
|
||||||
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=path)
|
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
|
||||||
if code != 0:
|
if code != 0:
|
||||||
print(" ✗ Can't get git state. Is this a git repo with commits?")
|
print(" Error: can't get git state")
|
||||||
return "error"
|
return "error"
|
||||||
|
|
||||||
# Run evaluation
|
# Run evaluation
|
||||||
print(f" Running: {evaluate_cmd} (budget: {time_budget} min)")
|
print(f" Running: {eval_cmd} (budget: {time_budget}m)")
|
||||||
ret_code, elapsed = run_evaluation(path, evaluate_cmd, time_budget)
|
ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file)
|
||||||
|
|
||||||
# Handle timeout
|
commit = get_current_commit(str(project_root))
|
||||||
|
|
||||||
|
# Timeout
|
||||||
if ret_code == -1:
|
if ret_code == -1:
|
||||||
print(f" ✗ TIMEOUT after {elapsed:.0f}s — discarding")
|
print(f" TIMEOUT after {elapsed:.0f}s — discarding")
|
||||||
run_cmd("git checkout -- .", cwd=path) # revert uncommitted changes
|
run_cmd("git checkout -- .", cwd=str(project_root))
|
||||||
# Commit was already made by the agent before evaluation
|
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
|
||||||
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
|
log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
|
||||||
curr_commit = get_current_commit(path)
|
|
||||||
log_result(path, curr_commit, None, "crash", f"timeout after {elapsed:.0f}s")
|
|
||||||
return "crash"
|
return "crash"
|
||||||
|
|
||||||
# Handle non-zero exit
|
# Crash
|
||||||
if ret_code != 0:
|
if ret_code != 0:
|
||||||
# Check if it crashed
|
_, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
|
||||||
code, tail, _ = run_cmd("tail -n 5 run.log", cwd=path)
|
print(f" CRASH (exit {ret_code}) after {elapsed:.0f}s")
|
||||||
print(f" ✗ CRASH (exit {ret_code}) after {elapsed:.0f}s")
|
|
||||||
print(f" Last output: {tail[:200]}")
|
print(f" Last output: {tail[:200]}")
|
||||||
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
|
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
|
||||||
curr_commit = get_current_commit(path)
|
log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
|
||||||
log_result(path, curr_commit, None, "crash", f"exit_code_{ret_code}")
|
|
||||||
return "crash"
|
return "crash"
|
||||||
|
|
||||||
# Extract metric
|
# Extract metric
|
||||||
metric_val = extract_metric(path, metric_grep)
|
metric_val = extract_metric(log_file, metric_grep)
|
||||||
if metric_val is None:
|
if metric_val is None:
|
||||||
print(f" ✗ Could not parse metric from run.log")
|
print(f" Could not parse {metric_name} from run.log")
|
||||||
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
|
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
|
||||||
curr_commit = get_current_commit(path)
|
log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
|
||||||
log_result(path, curr_commit, None, "crash", "metric_parse_failed")
|
|
||||||
return "crash"
|
return "crash"
|
||||||
|
|
||||||
curr_commit = get_current_commit(path)
|
|
||||||
delta = ""
|
delta = ""
|
||||||
if best_so_far is not None:
|
if best is not None:
|
||||||
diff = metric_val - best_so_far
|
diff = metric_val - best
|
||||||
delta = f" (Δ{diff:+.4f})"
|
delta = f" (delta {diff:+.4f})"
|
||||||
|
|
||||||
print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
|
print(f" {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")
|
||||||
|
|
||||||
# Keep or discard
|
# Keep or discard
|
||||||
if is_improvement(metric_val, best_so_far, direction):
|
if is_improvement(metric_val, best, direction):
|
||||||
print(f" ✅ KEEP — improvement confirmed")
|
print(f" KEEP — improvement")
|
||||||
log_result(path, curr_commit, metric_val, "keep",
|
log_result(experiment_dir, commit, metric_val, "keep",
|
||||||
f"improvement_{metric_name}_{metric_val:.4f}")
|
f"improved_{metric_name}_{metric_val:.4f}")
|
||||||
return "keep"
|
return "keep"
|
||||||
else:
|
else:
|
||||||
print(f" ❌ DISCARD — no improvement")
|
print(f" DISCARD — no improvement")
|
||||||
run_cmd(f"git reset --hard {pre_commit}", cwd=path)
|
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
|
||||||
curr_commit = get_current_commit(path)
|
best_str = f"{best:.4f}" if best else "?"
|
||||||
log_result(path, curr_commit, metric_val, "discard",
|
log_result(experiment_dir, commit, metric_val, "discard",
|
||||||
f"no_improvement_{metric_val:.4f}_vs_{best_so_far:.4f}")
|
f"no_improvement_{metric_val:.4f}_vs_{best_str}")
|
||||||
return "discard"
|
return "discard"
|
||||||
|
|
||||||
|
|
||||||
def print_summary(path):
|
def print_summary(experiment_dir, config):
|
||||||
"""Print experiment summary."""
|
"""Print session summary."""
|
||||||
tsv = Path(path) / "results.tsv"
|
tsv = experiment_dir / "results.tsv"
|
||||||
if not tsv.exists():
|
if not tsv.exists():
|
||||||
return
|
return
|
||||||
lines = tsv.read_text().splitlines()[1:] # skip header
|
lines = tsv.read_text().splitlines()[1:]
|
||||||
if not lines:
|
if not lines:
|
||||||
return
|
return
|
||||||
|
|
||||||
keeps = [l for l in lines if "\tkeep\t" in l]
|
keeps = [l for l in lines if "\tkeep\t" in l]
|
||||||
discards = [l for l in lines if "\tdiscard\t" in l]
|
discards = [l for l in lines if "\tdiscard\t" in l]
|
||||||
crashes = [l for l in lines if "\tcrash\t" in l]
|
crashes = [l for l in lines if "\tcrash\t" in l]
|
||||||
|
metric_name = config.get("metric", "metric")
|
||||||
|
direction = config.get("metric_direction", "lower")
|
||||||
|
|
||||||
print(f"\n{'='*50}")
|
print(f"\n{'=' * 55}")
|
||||||
print(f" Session Summary")
|
print(f" autoresearch — Session Summary")
|
||||||
print(f" Experiments: {len(lines)} total")
|
print(f" Experiments: {len(lines)} total")
|
||||||
print(f" ✅ Keep: {len(keeps)} | ❌ Discard: {len(discards)} | 💥 Crash: {len(crashes)}")
|
print(f" Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")
|
||||||
|
|
||||||
if keeps:
|
if keeps:
|
||||||
try:
|
try:
|
||||||
first_metric = float(keeps[0].split("\t")[1])
|
valid = []
|
||||||
last_metric = float(keeps[-1].split("\t")[1])
|
for l in keeps:
|
||||||
direction = "↓" if last_metric < first_metric else "↑"
|
parts = l.split("\t")
|
||||||
print(f" Best progress: {first_metric:.6f} → {last_metric:.6f} {direction}")
|
if parts[1] != "N/A":
|
||||||
|
valid.append(float(parts[1]))
|
||||||
|
if len(valid) >= 2:
|
||||||
|
first, last = valid[0], valid[-1]
|
||||||
|
best = min(valid) if direction == "lower" else max(valid)
|
||||||
|
pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
|
||||||
|
print(f" {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
|
||||||
except (ValueError, IndexError):
|
except (ValueError, IndexError):
|
||||||
pass
|
pass
|
||||||
print(f"{'='*50}\n")
|
print(f"{'=' * 55}\n")
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
parser = argparse.ArgumentParser(description="autoresearch-agent runner")
|
parser = argparse.ArgumentParser(description="autoresearch-agent runner")
|
||||||
|
parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
|
||||||
|
parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
|
||||||
parser.add_argument("--loop", action="store_true", help="Run forever")
|
parser.add_argument("--loop", action="store_true", help="Run forever")
|
||||||
parser.add_argument("--single", action="store_true", help="Run one experiment")
|
parser.add_argument("--single", action="store_true", help="Run one experiment")
|
||||||
parser.add_argument("--dry-run", action="store_true", help="Dry run only")
|
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
|
||||||
|
parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
|
||||||
parser.add_argument("--path", default=".", help="Project root")
|
parser.add_argument("--path", default=".", help="Project root")
|
||||||
parser.add_argument("--max-experiments", type=int, default=0,
|
|
||||||
help="Max experiments (0 = unlimited)")
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
path = Path(args.path).resolve()
|
project_root = Path(args.path).resolve()
|
||||||
config = load_config(path)
|
root = find_autoresearch_root()
|
||||||
|
|
||||||
print(f"\n🔬 autoresearch-agent")
|
if root is None:
|
||||||
print(f" Project: {path}")
|
print("No .autoresearch/ found. Run setup_experiment.py first.")
|
||||||
print(f" Target: {config.get('target', '?')}")
|
sys.exit(1)
|
||||||
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
|
|
||||||
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
|
|
||||||
print(f" Mode: {'loop' if args.loop else 'single'}")
|
|
||||||
|
|
||||||
if args.single:
|
# Resolve experiment
|
||||||
exp_num = get_experiment_count(path) + 1
|
experiment_path = args.experiment
|
||||||
run_single_experiment(path, config, exp_num, args.dry_run)
|
if args.resume:
|
||||||
|
experiment_path = get_last_active(root)
|
||||||
|
if not experiment_path:
|
||||||
|
print("No experiments found to resume.")
|
||||||
|
sys.exit(1)
|
||||||
|
print(f"Resuming: {experiment_path}")
|
||||||
|
|
||||||
|
if not experiment_path:
|
||||||
|
print("Specify --experiment domain/name or --resume")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
experiment_dir = root / experiment_path
|
||||||
|
if not experiment_dir.exists():
|
||||||
|
print(f"Experiment not found: {experiment_dir}")
|
||||||
|
print("Run: python scripts/setup_experiment.py --list")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
config = load_config(experiment_dir)
|
||||||
|
|
||||||
|
domain, name = experiment_path.split("/", 1)
|
||||||
|
print(f"\n autoresearch-agent")
|
||||||
|
print(f" Experiment: {experiment_path}")
|
||||||
|
print(f" Target: {config.get('target', '?')}")
|
||||||
|
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
|
||||||
|
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
|
||||||
|
print(f" Mode: {'loop' if args.loop else 'single'}")
|
||||||
|
|
||||||
|
if args.single or args.dry_run:
|
||||||
|
exp_num = get_experiment_count(experiment_dir) + 1
|
||||||
|
run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
|
||||||
return
|
return
|
||||||
|
|
||||||
if not args.loop and not args.dry_run:
|
if not args.loop:
|
||||||
print("\nSpecify --loop (forever) or --single (one experiment)")
|
print("\nSpecify --loop (forever) or --single (one experiment)")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
# Setup graceful shutdown
|
# Graceful shutdown
|
||||||
def handle_interrupt(sig, frame):
|
def handle_interrupt(sig, frame):
|
||||||
print_summary(path)
|
print_summary(experiment_dir, config)
|
||||||
print("\n⏹ Stopped by user.")
|
print("\nStopped by user.")
|
||||||
sys.exit(0)
|
sys.exit(0)
|
||||||
|
|
||||||
signal.signal(signal.SIGINT, handle_interrupt)
|
signal.signal(signal.SIGINT, handle_interrupt)
|
||||||
signal.signal(signal.SIGTERM, handle_interrupt)
|
signal.signal(signal.SIGTERM, handle_interrupt)
|
||||||
|
|
||||||
# Main loop
|
|
||||||
consecutive_crashes = 0
|
consecutive_crashes = 0
|
||||||
exp_num = get_experiment_count(path) + 1
|
exp_num = get_experiment_count(experiment_dir) + 1
|
||||||
|
|
||||||
print(f"\nStarting loop. Ctrl+C to stop and print summary.\n")
|
print(f"\nStarting loop. Ctrl+C to stop.\n")
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
result = run_single_experiment(path, config, exp_num, args.dry_run)
|
result = run_single(project_root, experiment_dir, config, exp_num, False)
|
||||||
exp_num += 1
|
exp_num += 1
|
||||||
|
|
||||||
if result == "crash":
|
if result == "crash":
|
||||||
@@ -289,21 +352,16 @@ def main():
|
|||||||
else:
|
else:
|
||||||
consecutive_crashes = 0
|
consecutive_crashes = 0
|
||||||
|
|
||||||
# Bail if 5 consecutive crashes
|
|
||||||
if consecutive_crashes >= 5:
|
if consecutive_crashes >= 5:
|
||||||
print("\n⚠ 5 consecutive crashes. Pausing for investigation.")
|
print("\n 5 consecutive crashes. Pausing.")
|
||||||
print(" Check run.log for the last error.")
|
print(" Check .autoresearch/{}/run.log".format(experiment_path))
|
||||||
break
|
break
|
||||||
|
|
||||||
# Check max experiments
|
if 0 < args.max_experiments < exp_num:
|
||||||
if args.max_experiments > 0 and exp_num > args.max_experiments:
|
print(f"\n Reached max experiments ({args.max_experiments})")
|
||||||
print(f"\n✓ Reached max experiments ({args.max_experiments})")
|
|
||||||
break
|
break
|
||||||
|
|
||||||
if args.single:
|
print_summary(experiment_dir, config)
|
||||||
break
|
|
||||||
|
|
||||||
print_summary(path)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -1,65 +1,52 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
autoresearch-agent: Setup Wizard
|
autoresearch-agent: Setup Experiment
|
||||||
|
|
||||||
Initializes a new research run:
|
Initialize a new experiment with domain, target, evaluator, and git branch.
|
||||||
1. Validates the project structure
|
Creates the .autoresearch/{domain}/{name}/ directory structure.
|
||||||
2. Creates a git branch
|
|
||||||
3. Runs the baseline experiment
|
|
||||||
4. Initializes results.tsv
|
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
python scripts/setup_experiment.py [--config experiment.yaml]
|
python scripts/setup_experiment.py --domain engineering --name api-speed \
|
||||||
python scripts/setup_experiment.py --domain ml|prompt|code|skill
|
--target src/api/search.py --eval "pytest bench.py" \
|
||||||
|
--metric p50_ms --direction lower
|
||||||
|
|
||||||
|
python scripts/setup_experiment.py --domain marketing --name medium-ctr \
|
||||||
|
--target content/titles.md --eval "python evaluate.py" \
|
||||||
|
--metric ctr_score --direction higher --evaluator llm_judge_content
|
||||||
|
|
||||||
|
python scripts/setup_experiment.py --list # List all experiments
|
||||||
|
python scripts/setup_experiment.py --list-evaluators # List available evaluators
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import os
|
import os
|
||||||
|
import shutil
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"]
|
||||||
|
|
||||||
DOMAINS = {
|
EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators"
|
||||||
"ml": {
|
|
||||||
"target": "train.py",
|
DEFAULT_CONFIG = """# autoresearch global config
|
||||||
"evaluate_cmd": "uv run train.py",
|
default_time_budget_minutes: 5
|
||||||
"metric": "val_bpb",
|
default_scope: project
|
||||||
"metric_direction": "lower",
|
dashboard_format: markdown
|
||||||
"time_budget_minutes": 5,
|
"""
|
||||||
"metric_grep": "^val_bpb:",
|
|
||||||
},
|
GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state
|
||||||
"prompt": {
|
**/results.tsv
|
||||||
"target": "prompt.md",
|
**/run.log
|
||||||
"evaluate_cmd": "python evaluate.py",
|
**/run.*.log
|
||||||
"metric": "eval_score",
|
config.yaml
|
||||||
"metric_direction": "higher",
|
"""
|
||||||
"time_budget_minutes": 2,
|
|
||||||
"metric_grep": "^eval_score:",
|
|
||||||
},
|
|
||||||
"code": {
|
|
||||||
"target": "src/module.py",
|
|
||||||
"evaluate_cmd": "python benchmark.py",
|
|
||||||
"metric": "p50_ms",
|
|
||||||
"metric_direction": "lower",
|
|
||||||
"time_budget_minutes": 10,
|
|
||||||
"metric_grep": "^p50_ms:",
|
|
||||||
},
|
|
||||||
"skill": {
|
|
||||||
"target": "SKILL.md",
|
|
||||||
"evaluate_cmd": "python scripts/skill_evaluator.py",
|
|
||||||
"metric": "pass_rate",
|
|
||||||
"metric_direction": "higher",
|
|
||||||
"time_budget_minutes": 5,
|
|
||||||
"metric_grep": "^pass_rate:",
|
|
||||||
},
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def run_cmd(cmd, cwd=None, timeout=None):
|
def run_cmd(cmd, cwd=None, timeout=None):
|
||||||
"""Run a shell command and return (returncode, stdout, stderr)."""
|
"""Run shell command, return (returncode, stdout, stderr)."""
|
||||||
result = subprocess.run(
|
result = subprocess.run(
|
||||||
cmd, shell=True, capture_output=True, text=True,
|
cmd, shell=True, capture_output=True, text=True,
|
||||||
cwd=cwd, timeout=timeout
|
cwd=cwd, timeout=timeout
|
||||||
@@ -67,188 +54,315 @@ def run_cmd(cmd, cwd=None, timeout=None):
|
|||||||
return result.returncode, result.stdout.strip(), result.stderr.strip()
|
return result.returncode, result.stdout.strip(), result.stderr.strip()
|
||||||
|
|
||||||
|
|
||||||
def check_git_repo(path):
|
def get_autoresearch_root(scope, project_root=None):
|
||||||
"""Verify we're in a git repo."""
|
"""Get the .autoresearch root directory based on scope."""
|
||||||
code, out, err = run_cmd("git rev-parse --is-inside-work-tree", cwd=path)
|
if scope == "user":
|
||||||
if code != 0:
|
return Path.home() / ".autoresearch"
|
||||||
print("✗ Not a git repository. Run: git init && git add . && git commit -m 'initial'")
|
return Path(project_root or ".") / ".autoresearch"
|
||||||
|
|
||||||
|
|
||||||
|
def init_root(root):
|
||||||
|
"""Initialize .autoresearch root if it doesn't exist."""
|
||||||
|
created = False
|
||||||
|
if not root.exists():
|
||||||
|
root.mkdir(parents=True)
|
||||||
|
created = True
|
||||||
|
print(f" Created {root}/")
|
||||||
|
|
||||||
|
config_file = root / "config.yaml"
|
||||||
|
if not config_file.exists():
|
||||||
|
config_file.write_text(DEFAULT_CONFIG)
|
||||||
|
print(f" Created {config_file}")
|
||||||
|
|
||||||
|
gitignore = root / ".gitignore"
|
||||||
|
if not gitignore.exists():
|
||||||
|
gitignore.write_text(GITIGNORE_CONTENT)
|
||||||
|
print(f" Created {gitignore}")
|
||||||
|
|
||||||
|
return created
|
||||||
|
|
||||||
|
|
||||||
|
def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""):
|
||||||
|
"""Generate a program.md template for the experiment."""
|
||||||
|
direction_word = "Minimize" if direction == "lower" else "Maximize"
|
||||||
|
content = f"""# autoresearch — {name}
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
{direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better.
|
||||||
|
|
||||||
|
## What the Agent Can Change
|
||||||
|
- Only `{target}` — this is the single file being optimized.
|
||||||
|
- Everything inside that file is fair game unless constrained below.
|
||||||
|
|
||||||
|
## What the Agent Cannot Change
|
||||||
|
- The evaluation script (`evaluate.py` or the eval command). It is read-only.
|
||||||
|
- Dependencies — do not add new packages or imports that aren't already available.
|
||||||
|
- Any other files in the project unless explicitly noted here.
|
||||||
|
{f"- Additional constraints: {constraints}" if constraints else ""}
|
||||||
|
|
||||||
|
## Strategy
|
||||||
|
1. First run: establish baseline. Do not change anything.
|
||||||
|
2. Profile/analyze the current state — understand why the metric is what it is.
|
||||||
|
3. Try the most obvious improvement first (low-hanging fruit).
|
||||||
|
4. If that works, push further in the same direction.
|
||||||
|
5. If stuck, try something orthogonal or radical.
|
||||||
|
6. Read the git log of previous experiments. Don't repeat failed approaches.
|
||||||
|
|
||||||
|
## Simplicity Rule
|
||||||
|
A small improvement that adds ugly complexity is NOT worth it.
|
||||||
|
Equal performance with simpler code IS worth it.
|
||||||
|
Removing code that gets same results is the best outcome.
|
||||||
|
|
||||||
|
## Stop When
|
||||||
|
You don't stop. The human will interrupt you when they're satisfied.
|
||||||
|
If no improvement in 20+ consecutive runs, change strategy drastically.
|
||||||
|
"""
|
||||||
|
(experiment_dir / "program.md").write_text(content)
|
||||||
|
|
||||||
|
|
||||||
|
def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget):
|
||||||
|
"""Write experiment config."""
|
||||||
|
content = f"""target: {target}
|
||||||
|
evaluate_cmd: {eval_cmd}
|
||||||
|
metric: {metric}
|
||||||
|
metric_direction: {direction}
|
||||||
|
metric_grep: ^{metric}:
|
||||||
|
time_budget_minutes: {time_budget}
|
||||||
|
created: {datetime.now().strftime('%Y-%m-%d %H:%M')}
|
||||||
|
"""
|
||||||
|
(experiment_dir / "config.cfg").write_text(content)
|
||||||
|
|
||||||
|
|
||||||
|
def init_results_tsv(experiment_dir):
|
||||||
|
"""Create results.tsv with header."""
|
||||||
|
tsv = experiment_dir / "results.tsv"
|
||||||
|
if tsv.exists():
|
||||||
|
print(f" results.tsv already exists ({tsv.stat().st_size} bytes)")
|
||||||
|
return
|
||||||
|
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
|
||||||
|
print(" Created results.tsv")
|
||||||
|
|
||||||
|
|
||||||
|
def copy_evaluator(experiment_dir, evaluator_name):
|
||||||
|
"""Copy a built-in evaluator to the experiment directory."""
|
||||||
|
source = EVALUATOR_DIR / f"{evaluator_name}.py"
|
||||||
|
if not source.exists():
|
||||||
|
print(f" Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}")
|
||||||
|
print(f" Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}")
|
||||||
return False
|
return False
|
||||||
print("✓ Git repository found")
|
dest = experiment_dir / "evaluate.py"
|
||||||
|
shutil.copy2(source, dest)
|
||||||
|
print(f" Copied evaluator: {evaluator_name}.py -> evaluate.py")
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
||||||
def check_program_md(path):
|
def create_branch(path, domain, name):
|
||||||
"""Check program.md exists and has content."""
|
|
||||||
pm = Path(path) / "program.md"
|
|
||||||
if not pm.exists():
|
|
||||||
print("⚠ program.md not found. Creating template...")
|
|
||||||
return False
|
|
||||||
content = pm.read_text()
|
|
||||||
if len(content) < 100:
|
|
||||||
print("⚠ program.md looks empty. Fill it out before running experiments.")
|
|
||||||
return False
|
|
||||||
print(f"✓ program.md found ({len(content)} chars)")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def check_target_file(path, target):
|
|
||||||
"""Check target file exists."""
|
|
||||||
tf = Path(path) / target
|
|
||||||
if not tf.exists():
|
|
||||||
print(f"✗ Target file not found: {target}")
|
|
||||||
return False
|
|
||||||
print(f"✓ Target file found: {target}")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def check_evaluate_script(path):
|
|
||||||
"""Check evaluate.py exists."""
|
|
||||||
ev = Path(path) / "evaluate.py"
|
|
||||||
if not ev.exists():
|
|
||||||
print("⚠ evaluate.py not found. You need a fixed evaluation function.")
|
|
||||||
print(" Create evaluate.py that outputs: metric_name: <value>")
|
|
||||||
return False
|
|
||||||
print("✓ evaluate.py found")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def create_branch(path, tag):
|
|
||||||
"""Create and checkout the experiment branch."""
|
"""Create and checkout the experiment branch."""
|
||||||
branch = f"autoresearch/{tag}"
|
branch = f"autoresearch/{domain}/{name}"
|
||||||
code, out, err = run_cmd(f"git checkout -b {branch}", cwd=path)
|
code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
|
||||||
if code != 0:
|
if code != 0:
|
||||||
if "already exists" in err:
|
if "already exists" in err:
|
||||||
print(f"✗ Branch '{branch}' already exists. Use a different tag.")
|
print(f" Branch '{branch}' already exists. Checking out...")
|
||||||
else:
|
run_cmd(f"git checkout {branch}", cwd=path)
|
||||||
print(f"✗ Failed to create branch: {err}")
|
return branch
|
||||||
|
print(f" Warning: could not create branch: {err}")
|
||||||
return None
|
return None
|
||||||
print(f"✓ Created branch: {branch}")
|
print(f" Created branch: {branch}")
|
||||||
return branch
|
return branch
|
||||||
|
|
||||||
|
|
||||||
def init_results_tsv(path):
|
def list_experiments(root):
|
||||||
"""Create results.tsv with header."""
|
"""List all experiments across all domains."""
|
||||||
tsv = Path(path) / "results.tsv"
|
if not root.exists():
|
||||||
if tsv.exists():
|
print("No experiments found. Run setup to create your first experiment.")
|
||||||
print(f"✓ results.tsv already exists ({tsv.stat().st_size} bytes)")
|
|
||||||
return
|
return
|
||||||
tsv.write_text("commit\tmetric\tstatus\tdescription\n")
|
|
||||||
print("✓ Created results.tsv")
|
experiments = []
|
||||||
|
for domain_dir in sorted(root.iterdir()):
|
||||||
|
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
|
||||||
|
continue
|
||||||
|
for exp_dir in sorted(domain_dir.iterdir()):
|
||||||
|
if not exp_dir.is_dir():
|
||||||
|
continue
|
||||||
|
cfg_file = exp_dir / "config.cfg"
|
||||||
|
if not cfg_file.exists():
|
||||||
|
continue
|
||||||
|
config = {}
|
||||||
|
for line in cfg_file.read_text().splitlines():
|
||||||
|
if ":" in line:
|
||||||
|
k, v = line.split(":", 1)
|
||||||
|
config[k.strip()] = v.strip()
|
||||||
|
|
||||||
|
# Count results
|
||||||
|
tsv = exp_dir / "results.tsv"
|
||||||
|
runs = 0
|
||||||
|
if tsv.exists():
|
||||||
|
runs = max(0, len(tsv.read_text().splitlines()) - 1)
|
||||||
|
|
||||||
|
experiments.append({
|
||||||
|
"domain": domain_dir.name,
|
||||||
|
"name": exp_dir.name,
|
||||||
|
"target": config.get("target", "?"),
|
||||||
|
"metric": config.get("metric", "?"),
|
||||||
|
"runs": runs,
|
||||||
|
})
|
||||||
|
|
||||||
|
if not experiments:
|
||||||
|
print("No experiments found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}")
|
||||||
|
print("-" * 95)
|
||||||
|
for e in experiments:
|
||||||
|
print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}")
|
||||||
|
print(f"\nTotal: {len(experiments)} experiments")
|
||||||
|
|
||||||
|
|
||||||
def run_baseline(path, evaluate_cmd, metric_grep, time_budget_minutes):
|
def list_evaluators():
|
||||||
"""Run the baseline experiment."""
|
"""List available built-in evaluators."""
|
||||||
print(f"\nRunning baseline experiment (~{time_budget_minutes} min)...")
|
if not EVALUATOR_DIR.exists():
|
||||||
timeout = time_budget_minutes * 60 * 2.5 # 2.5x budget as hard limit
|
print("No evaluators directory found.")
|
||||||
|
return
|
||||||
|
|
||||||
t0 = time.time()
|
print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n")
|
||||||
code, out, err = run_cmd(
|
for f in sorted(EVALUATOR_DIR.glob("*.py")):
|
||||||
f"{evaluate_cmd} > run.log 2>&1",
|
# Read first docstring line
|
||||||
cwd=path,
|
desc = ""
|
||||||
timeout=timeout
|
for line in f.read_text().splitlines():
|
||||||
)
|
if line.strip().startswith('"""') or line.strip().startswith("'''"):
|
||||||
elapsed = time.time() - t0
|
continue
|
||||||
|
if line.strip() and not line.startswith("#!"):
|
||||||
if code != 0:
|
desc = line.strip().strip('"').strip("'")
|
||||||
print(f"✗ Baseline run failed after {elapsed:.0f}s. Check run.log")
|
break
|
||||||
return None
|
print(f" {f.stem:<25} {desc}")
|
||||||
|
|
||||||
# Extract metric
|
|
||||||
grep_code, grep_out, _ = run_cmd(
|
|
||||||
f"grep '{metric_grep}' run.log | tail -1",
|
|
||||||
cwd=path
|
|
||||||
)
|
|
||||||
if not grep_out:
|
|
||||||
print("✗ Could not extract metric from run.log. Check metric_grep pattern.")
|
|
||||||
return None
|
|
||||||
|
|
||||||
metric_value = grep_out.split(":")[-1].strip()
|
|
||||||
print(f"✓ Baseline complete in {elapsed:.0f}s — metric: {metric_value}")
|
|
||||||
return metric_value
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
parser = argparse.ArgumentParser(description="autoresearch-agent setup")
|
parser = argparse.ArgumentParser(description="autoresearch-agent setup")
|
||||||
parser.add_argument("--domain", choices=list(DOMAINS.keys()), help="Experiment domain")
|
parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain")
|
||||||
|
parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)")
|
||||||
parser.add_argument("--target", help="Target file to optimize")
|
parser.add_argument("--target", help="Target file to optimize")
|
||||||
parser.add_argument("--evaluate-cmd", help="Evaluation command")
|
parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command")
|
||||||
parser.add_argument("--metric", help="Metric name")
|
parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')")
|
||||||
parser.add_argument("--direction", choices=["lower", "higher"], default="lower")
|
parser.add_argument("--direction", choices=["lower", "higher"], default="lower",
|
||||||
parser.add_argument("--budget", type=int, default=5, help="Time budget in minutes")
|
help="Is lower or higher better?")
|
||||||
parser.add_argument("--tag", help="Run tag (used in branch name)")
|
parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)")
|
||||||
|
parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)")
|
||||||
|
parser.add_argument("--scope", choices=["project", "user"], default="project",
|
||||||
|
help="Where to store experiments: project (./) or user (~/)")
|
||||||
|
parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
|
||||||
parser.add_argument("--path", default=".", help="Project root path")
|
parser.add_argument("--path", default=".", help="Project root path")
|
||||||
parser.add_argument("--skip-baseline", action="store_true")
|
parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
|
||||||
|
parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
|
||||||
|
parser.add_argument("--list", action="store_true", help="List all experiments")
|
||||||
|
parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
path = Path(args.path).resolve()
|
project_root = Path(args.path).resolve()
|
||||||
print(f"\n🔬 autoresearch-agent setup")
|
|
||||||
print(f" Project: {path}")
|
|
||||||
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
|
|
||||||
|
|
||||||
# Get config from domain or args
|
# List mode
|
||||||
if args.domain:
|
if args.list:
|
||||||
config = DOMAINS[args.domain].copy()
|
root = get_autoresearch_root("project", project_root)
|
||||||
|
list_experiments(root)
|
||||||
|
user_root = get_autoresearch_root("user")
|
||||||
|
if user_root.exists() and user_root != root:
|
||||||
|
print(f"\n--- User-level experiments ({user_root}) ---")
|
||||||
|
list_experiments(user_root)
|
||||||
|
return
|
||||||
|
|
||||||
|
if args.list_evaluators:
|
||||||
|
list_evaluators()
|
||||||
|
return
|
||||||
|
|
||||||
|
# Validate required args for setup
|
||||||
|
if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]):
|
||||||
|
parser.error("Required: --domain, --name, --target, --eval, --metric")
|
||||||
|
|
||||||
|
root = get_autoresearch_root(args.scope, project_root)
|
||||||
|
|
||||||
|
print(f"\n autoresearch-agent setup")
|
||||||
|
print(f" Project: {project_root}")
|
||||||
|
print(f" Scope: {args.scope}")
|
||||||
|
print(f" Domain: {args.domain}")
|
||||||
|
print(f" Experiment: {args.name}")
|
||||||
|
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
|
||||||
|
|
||||||
|
# Check git
|
||||||
|
code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
|
||||||
|
if code != 0:
|
||||||
|
print(" Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
|
||||||
|
sys.exit(1)
|
||||||
|
print(" Git repository found")
|
||||||
|
|
||||||
|
# Check target file
|
||||||
|
target_path = project_root / args.target
|
||||||
|
if not target_path.exists():
|
||||||
|
print(f" Error: target file not found: {args.target}")
|
||||||
|
sys.exit(1)
|
||||||
|
print(f" Target file found: {args.target}")
|
||||||
|
|
||||||
|
# Init root
|
||||||
|
init_root(root)
|
||||||
|
|
||||||
|
# Create experiment directory
|
||||||
|
experiment_dir = root / args.domain / args.name
|
||||||
|
if experiment_dir.exists():
|
||||||
|
print(f" Warning: experiment '{args.domain}/{args.name}' already exists.")
|
||||||
|
print(f" Use --name with a different name, or delete {experiment_dir}")
|
||||||
|
sys.exit(1)
|
||||||
|
experiment_dir.mkdir(parents=True)
|
||||||
|
print(f" Created {experiment_dir}/")
|
||||||
|
|
||||||
|
# Create files
|
||||||
|
create_program_md(experiment_dir, args.domain, args.name,
|
||||||
|
args.target, args.metric, args.direction, args.constraints)
|
||||||
|
print(" Created program.md")
|
||||||
|
|
||||||
|
create_config(experiment_dir, args.target, args.eval_cmd,
|
||||||
|
args.metric, args.direction, args.time_budget)
|
||||||
|
print(" Created config.cfg")
|
||||||
|
|
||||||
|
init_results_tsv(experiment_dir)
|
||||||
|
|
||||||
|
# Copy evaluator if specified
|
||||||
|
if args.evaluator:
|
||||||
|
copy_evaluator(experiment_dir, args.evaluator)
|
||||||
|
|
||||||
|
# Create git branch
|
||||||
|
if not args.skip_branch:
|
||||||
|
create_branch(str(project_root), args.domain, args.name)
|
||||||
|
|
||||||
|
# Test evaluation command
|
||||||
|
print(f"\n Testing evaluation: {args.eval_cmd}")
|
||||||
|
code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60)
|
||||||
|
if code != 0:
|
||||||
|
print(f" Warning: eval command failed (exit {code})")
|
||||||
|
if err:
|
||||||
|
print(f" stderr: {err[:200]}")
|
||||||
|
print(" Fix the eval command before running the experiment loop.")
|
||||||
else:
|
else:
|
||||||
config = {
|
# Check metric is parseable
|
||||||
"target": args.target or "target.py",
|
full_output = out + "\n" + err
|
||||||
"evaluate_cmd": args.evaluate_cmd or "python evaluate.py",
|
metric_found = False
|
||||||
"metric": args.metric or "score",
|
for line in full_output.splitlines():
|
||||||
"metric_direction": args.direction,
|
if line.strip().startswith(f"{args.metric}:"):
|
||||||
"time_budget_minutes": args.budget,
|
metric_found = True
|
||||||
"metric_grep": f"^{args.metric or 'score'}:",
|
print(f" Eval works. Baseline: {line.strip()}")
|
||||||
}
|
break
|
||||||
|
if not metric_found:
|
||||||
|
print(f" Warning: eval ran but '{args.metric}:' not found in output.")
|
||||||
|
print(f" Make sure your eval command outputs: {args.metric}: <value>")
|
||||||
|
|
||||||
tag = args.tag or datetime.now().strftime("%b%d").lower()
|
# Summary
|
||||||
|
print(f"\n Setup complete!")
|
||||||
# Validation checks
|
print(f" Experiment: {args.domain}/{args.name}")
|
||||||
checks = [
|
print(f" Target: {args.target}")
|
||||||
check_git_repo(path),
|
print(f" Metric: {args.metric} ({args.direction} is better)")
|
||||||
check_program_md(path),
|
print(f" Budget: {args.time_budget} min/experiment")
|
||||||
check_target_file(path, config["target"]),
|
if not args.skip_branch:
|
||||||
check_evaluate_script(path),
|
print(f" Branch: autoresearch/{args.domain}/{args.name}")
|
||||||
]
|
print(f"\n To start:")
|
||||||
|
print(f" python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")
|
||||||
if not all(checks):
|
|
||||||
print("\n⚠ Fix the above issues before running experiments.")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Create branch
|
|
||||||
branch = create_branch(path, tag)
|
|
||||||
if not branch:
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Init results TSV
|
|
||||||
init_results_tsv(path)
|
|
||||||
|
|
||||||
# Save config for run_experiment.py
|
|
||||||
config_content = "\n".join(f"{k}: {v}" for k, v in config.items())
|
|
||||||
(path / ".autoresearch.cfg").write_text(config_content + "\n")
|
|
||||||
print("✓ Saved .autoresearch.cfg")
|
|
||||||
|
|
||||||
# Run baseline
|
|
||||||
if not args.skip_baseline:
|
|
||||||
baseline = run_baseline(
|
|
||||||
path,
|
|
||||||
config["evaluate_cmd"],
|
|
||||||
config["metric_grep"],
|
|
||||||
config["time_budget_minutes"]
|
|
||||||
)
|
|
||||||
if baseline:
|
|
||||||
# Log baseline to TSV
|
|
||||||
code, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
|
|
||||||
with open(path / "results.tsv", "a") as f:
|
|
||||||
f.write(f"{commit}\t{baseline}\tkeep\tbaseline\n")
|
|
||||||
print(f"✓ Baseline logged to results.tsv")
|
|
||||||
|
|
||||||
print(f"\n✅ Setup complete!")
|
|
||||||
print(f" Branch: {branch}")
|
|
||||||
print(f" Target: {config['target']}")
|
|
||||||
print(f" Metric: {config['metric']} ({config['metric_direction']} is better)")
|
|
||||||
print(f" Budget: {config['time_budget_minutes']} min/experiment")
|
|
||||||
print(f"\nTo start the autonomous loop:")
|
|
||||||
print(f" python scripts/run_experiment.py --loop")
|
|
||||||
print(f"\nOr run a single experiment:")
|
|
||||||
print(f" python scripts/run_experiment.py --single")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
Reference in New Issue
Block a user