feat(autoresearch-agent): fix critical bugs, package as plugin with 5 slash commands

**Bug fixes (run_experiment.py):**
- Fix broken revert logic: was saving HEAD as pre_commit (no-op revert),
  now uses git reset --hard HEAD~1 for correct rollback
- Remove broken --loop mode (agent IS the loop, script handles one iteration)
- Fix shell injection: all git commands use subprocess list form
- Replace shell tail with Python file read

**Bug fixes (other scripts):**
- setup_experiment.py: fix shell injection in git branch creation,
  remove dead --skip-baseline flag, fix evaluator docstring parsing
- log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None),
  add domain_filter to CSV/markdown export, move import time to top
- evaluators: add FileNotFoundError handling, fix output format mismatch
  in llm_judge_copy, add peak_kb on macOS, add ValueError handling

**Plugin packaging (NEW):**
- plugin.json, settings.json, CLAUDE.md for plugin registry
- 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume
- /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly)
- experiment-runner agent for autonomous loop iterations
- Registered in marketplace.json as plugin #20

**SKILL.md rewrite:**
- Replace ambiguous "Loop Protocol" with clear "Agent Protocol"
- Add results.tsv format spec, strategy escalation, self-improvement
- Replace "NEVER STOP" with resumable stopping logic

**Docs & sync:**
- Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill
- 6 new MkDocs pages, mkdocs.yml nav updated
- Counts updated: 17 agents, 22 slash commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Reza Rezvani
2026-03-13 14:38:59 +01:00
parent 6dc25df8fa
commit 7911cf957a
39 changed files with 1779 additions and 234 deletions

View File

@@ -0,0 +1,13 @@
{
"name": "autoresearch-agent",
"description": "Autonomous experiment loop that optimizes any file by a measurable metric. 5 slash commands, 8 evaluators, configurable loop intervals (10min to monthly).",
"version": "2.1.2",
"author": {
"name": "Alireza Rezvani",
"url": "https://alirezarezvani.com"
},
"homepage": "https://github.com/alirezarezvani/claude-skills/tree/main/engineering/autoresearch-agent",
"repository": "https://github.com/alirezarezvani/claude-skills",
"license": "MIT",
"skills": "./"
}

View File

@@ -0,0 +1,66 @@
# Autoresearch Agent — Claude Code Instructions
This plugin runs autonomous experiment loops that optimize any file by a measurable metric.
## Commands
Use the `/ar:` namespace for all commands:
- `/ar:setup` — Set up a new experiment interactively
- `/ar:run` — Run a single experiment iteration
- `/ar:loop` — Start an autonomous loop with user-selected interval
- `/ar:status` — Show dashboard and results
- `/ar:resume` — Resume a paused experiment
## How it works
You (the AI agent) are the experiment loop. The scripts handle evaluation and git rollback.
1. You edit the target file with ONE change
2. You commit it
3. You call `run_experiment.py --single` — it evaluates and prints KEEP/DISCARD/CRASH
4. You repeat
Results persist in `results.tsv` and git log. Sessions can be resumed.
## When to use each command
### Starting fresh
```
/ar:setup
```
Creates the experiment directory, config, program.md, results.tsv, and git branch.
### Running one iteration at a time
```
/ar:run engineering/api-speed
```
Read history, make one change, evaluate, report result.
### Autonomous background loop
```
/ar:loop engineering/api-speed
```
Prompts for interval (10min, 1h, daily, weekly, monthly), then creates a recurring job.
### Checking progress
```
/ar:status
```
Shows the dashboard across all experiments with metrics and trends.
### Resuming after context limit or break
```
/ar:resume engineering/api-speed
```
Reads results history, checks out the branch, and continues where you left off.
## Agents
- **experiment-runner**: Spawned for each loop iteration. Reads config, results history, decides what to try, edits target, commits, evaluates.
## Key principle
**One change per experiment. Measure everything. Compound improvements.**
The agent never modifies the evaluator. The evaluator is ground truth.

View File

@@ -19,6 +19,18 @@ Not one guess — fifty measured attempts, compounding.
---
## Slash Commands
| Command | What it does |
|---------|-------------|
| `/ar:setup` | Set up a new experiment interactively |
| `/ar:run` | Run a single experiment iteration |
| `/ar:loop` | Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
| `/ar:status` | Show dashboard and results |
| `/ar:resume` | Resume a paused experiment |
---
## When This Skill Activates
Recognize these patterns from the user:
@@ -82,6 +94,12 @@ The `--scope` flag determines where `.autoresearch/` lives:
└── evaluate.py ← Evaluation script (if --evaluator used)
```
**results.tsv columns:** `commit | metric | status | description`
- `commit` — short git hash
- `metric` — float value or "N/A" for crashes
- `status` — keep | discard | crash
- `description` — what changed or why it crashed
### Domains
| Domain | Use Cases |
@@ -98,48 +116,67 @@ The user may have written their own `program.md`. If found in the experiment dir
---
## The Experiment Loop
## Agent Protocol
You are the loop. The scripts handle setup and evaluation — you handle the creative work.
### Before Starting
1. Read `.autoresearch/{domain}/{name}/config.cfg` to get:
- `target` — the file you edit
- `evaluate_cmd` — the command that measures your changes
- `metric` — the metric name to look for in eval output
- `metric_direction` — "lower" or "higher" is better
- `time_budget_minutes` — max time per evaluation
2. Read `program.md` for strategy, constraints, and what you can/cannot change
3. Read `results.tsv` for experiment history (columns: commit, metric, status, description)
4. Checkout the experiment branch: `git checkout autoresearch/{domain}/{name}`
### Each Iteration
1. Review results.tsv — what worked? What failed? What hasn't been tried?
2. Decide ONE change to the target file. One variable per experiment.
3. Edit the target file
4. Commit: `git add {target} && git commit -m "experiment: {description}"`
5. Evaluate: `python scripts/run_experiment.py --experiment {domain}/{name} --single`
6. Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
7. Go to step 1
### What the Script Handles (you don't)
- Running the eval command with timeout
- Parsing the metric from eval output
- Comparing to previous best
- Reverting the commit on failure (`git reset --hard HEAD~1`)
- Logging the result to results.tsv
### Starting an Experiment
```bash
# Run specific experiment
python scripts/run_experiment.py --experiment engineering/api-speed --loop
# Single iteration (test setup)
# Single iteration (the agent calls this repeatedly)
python scripts/run_experiment.py --experiment engineering/api-speed --single
# Resume last active experiment
python scripts/run_experiment.py --resume --loop
# Dry run (show what would happen)
# Dry run (test setup before starting)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
```
### The Loop Protocol
### Strategy Escalation
- Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
- Runs 6-15: Systematic exploration (vary one parameter at a time)
- Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
- Runs 30+: Radical experiments (completely different approaches)
- If no improvement in 20+ runs: update program.md Strategy section
```
LOOP FOREVER:
### Self-Improvement
After every 10 experiments, review results.tsv for patterns. Update the
Strategy section of program.md with what you learned (e.g., "caching changes
consistently improve by 5-10%", "refactoring attempts never improve the metric").
Future iterations benefit from this accumulated knowledge.
1. Read program.md for current strategy and constraints
2. Review git log: what has been tried? What worked? What crashed?
3. Review results.tsv: current best metric, trend, recent failures
4. Propose ONE change to the target file
5. Apply the change
6. git commit -m "experiment: [short description of what changed]"
7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
8. Parse metric from run.log (grep for metric_name: value)
9. Decision:
- Metric improved → KEEP (advance branch, log "keep")
- Metric equal or worse → REVERT (git reset --hard, log "discard")
- Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
10. Append result to results.tsv
11. Go to 1
```
### Stopping
- Run until interrupted by the user, context limit reached, or goal in program.md is met
- Before stopping: ensure results.tsv is up to date
- On context limit: the next session can resume — results.tsv and git log persist
### Rules
- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
@@ -258,7 +295,7 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
### OpenClaw
```bash
clawhub install autoresearch-agent
clawhub install cs-autoresearch-agent
```
---

View File

@@ -0,0 +1,87 @@
# Experiment Runner Agent
You are an autonomous experimenter. Your job is to optimize a target file by a measurable metric, one change at a time.
## Your Role
You are spawned for each iteration of an autoresearch experiment loop. You:
1. Read the experiment state (config, strategy, results history)
2. Decide what to try based on accumulated evidence
3. Make ONE change to the target file
4. Commit and evaluate
5. Report the result
## Process
### 1. Read experiment state
```bash
# Config: what to optimize and how to measure
cat .autoresearch/{domain}/{name}/config.cfg
# Strategy: what you can/cannot change, current approach
cat .autoresearch/{domain}/{name}/program.md
# History: every experiment ever run, with outcomes
cat .autoresearch/{domain}/{name}/results.tsv
# Recent changes: what the code looks like now
git log --oneline -10
git diff HEAD~1 --stat # last change if any
```
### 2. Analyze results history
From results.tsv, identify:
- **What worked** (status=keep): What do these changes have in common?
- **What failed** (status=discard): What approaches should you avoid?
- **What crashed** (status=crash): Are there fragile areas to be careful with?
- **Trends**: Is the metric plateauing? Accelerating? Oscillating?
### 3. Select strategy based on experiment count
| Run Count | Strategy | Risk Level |
|-----------|----------|------------|
| 1-5 | Low-hanging fruit: obvious improvements, simple optimizations | Low |
| 6-15 | Systematic exploration: vary one parameter at a time | Medium |
| 16-30 | Structural changes: algorithm swaps, architecture shifts | High |
| 30+ | Radical experiments: completely different approaches | Very High |
If no improvement in the last 20 runs, it's time to update the Strategy section of program.md and try something fundamentally different.
### 4. Make ONE change
- Edit only the target file (from config.cfg)
- Change one variable, one approach, one parameter
- Keep it simple — equal results with simpler code is a win
- No new dependencies
### 5. Commit and evaluate
```bash
git add {target}
git commit -m "experiment: {description}"
python {skill_path}/scripts/run_experiment.py --experiment {domain}/{name} --single
```
### 6. Self-improvement
After every 10th experiment, update program.md's Strategy section:
- Which approaches consistently work? Double down.
- Which approaches consistently fail? Stop trying.
- Any new hypotheses based on the data?
## Hard Rules
- **ONE change per experiment.** Multiple changes = you won't know what worked.
- **NEVER modify the evaluator.** evaluate.py is the ground truth. Modifying it invalidates all comparisons. If you catch yourself doing this, stop immediately.
- **5 consecutive crashes → stop.** Alert the user. Don't burn cycles on a broken setup.
- **Simplicity criterion.** A small improvement that adds ugly complexity is NOT worth it. Removing code that gets same results is the best outcome.
- **No new dependencies.** Only use what's already available.
## Constraints
- Never read or modify files outside the target file and program.md
- Never push to remote — all work stays local
- Never skip the evaluation step — every change must be measured
- Be concise in commit messages — they become the experiment log

View File

@@ -36,7 +36,11 @@ if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
shell=True, capture_output=True, text=True
)
size_bytes = int(result.stdout.strip())
try:
size_bytes = int(result.stdout.strip())
except ValueError:
print(f"Could not parse size from: {result.stdout[:100]}", file=sys.stderr)
sys.exit(1)
elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
size_bytes = sum(
os.path.getsize(os.path.join(dp, f))

View File

@@ -43,7 +43,12 @@ ctr_score: <average of all 5 scores>
Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""
content = Path(TARGET_FILE).read_text()
try:
content = Path(TARGET_FILE).read_text()
except FileNotFoundError:
print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
sys.exit(1)
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"
# Call the user's CLI tool

View File

@@ -54,6 +54,9 @@ platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])
JUDGE_PROMPT = f"""{platform_prompt}
IMPORTANT: You MUST use criterion_1 through criterion_5 as labels, NOT the criterion names.
Do NOT output "hook: 7" — output "criterion_1: 7".
Output EXACTLY this format:
criterion_1: <score>
criterion_2: <score>
@@ -64,7 +67,12 @@ engagement_score: <average of all 5>
Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""
content = Path(TARGET_FILE).read_text()
try:
content = Path(TARGET_FILE).read_text()
except FileNotFoundError:
print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
sys.exit(1)
full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"
result = subprocess.run(
@@ -77,12 +85,29 @@ if result.returncode != 0:
sys.exit(1)
output = result.stdout
found_scores = False
for line in output.splitlines():
line = line.strip()
if line.startswith("engagement_score:") or line.startswith("criterion_"):
print(line)
found_scores = True
if "engagement_score:" not in output:
# Fallback: if no criterion_ lines found, try parsing any "word: digit" lines
if not found_scores:
import re
fallback_scores = []
for line in output.splitlines():
line = line.strip()
match = re.match(r'^(\w[\w\s]*?):\s*(\d+(?:\.\d+)?)\s*$', line)
if match and match.group(1).lower() not in ("engagement_score",):
fallback_scores.append(float(match.group(2)))
print(f"criterion_{len(fallback_scores)}: {match.group(2)}")
if fallback_scores:
avg = sum(fallback_scores) / len(fallback_scores)
print(f"engagement_score: {avg:.1f}")
found_scores = True
if "engagement_score:" not in output and not found_scores:
print("Could not parse engagement_score from LLM output", file=sys.stderr)
print(f"Raw: {output[:500]}", file=sys.stderr)
sys.exit(1)

View File

@@ -37,8 +37,17 @@ Score the actual output on these criteria (each 1-10):
Output EXACTLY: quality_score: <average of all 4>
Nothing else."""
prompt = Path(TARGET_FILE).read_text()
test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
try:
prompt = Path(TARGET_FILE).read_text()
except FileNotFoundError:
print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
sys.exit(1)
try:
test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
except FileNotFoundError:
print(f"Test cases file not found: {TEST_CASES_FILE}", file=sys.stderr)
sys.exit(1)
scores = []
@@ -92,7 +101,7 @@ if not scores:
sys.exit(1)
avg = sum(scores) / len(scores)
quality = avg * 10 # Scale to 0-100
quality = avg * 10 # 1-10 scores → 10-100 range
print(f"quality_score: {quality:.2f}")
print(f"cases_tested: {len(scores)}")

View File

@@ -2,7 +2,6 @@
"""Measure peak memory usage of a command.
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
import os
import platform
import subprocess
import sys
@@ -41,8 +40,10 @@ elif system == "Darwin":
if "maximum resident set size" in line.lower():
# macOS reports in bytes
val = int(line.strip().split()[0])
kb = val / 1024
mb = val / (1024 * 1024)
print(f"peak_mb: {mb:.1f}")
print(f"peak_kb: {int(kb)}")
sys.exit(0)
print("Could not parse memory from time output", file=sys.stderr)
sys.exit(1)

View File

@@ -75,8 +75,8 @@ Maximize eval_score on the test suite. Higher is better (0-100).
## Evaluation
- evaluate.py runs the prompt against 20 test cases
- Each test case is scored 0-5 by GPT-4o
- eval_score = average * 20 (maps to 0-100)
- Each test case is scored 1-10 by your CLI tool (Claude, Codex, or Gemini)
- quality_score = average * 10 (maps to 10-100)
- Run log shows which test cases failed
## Stop When
@@ -144,14 +144,14 @@ Maximize pass_rate on the task evaluation suite. Higher is better (0-1).
- Proactive trigger conditions
## What You Cannot Change
- scripts/skill_evaluator.py (fixed evaluation)
- your custom evaluate.py (see Custom Evaluators in SKILL.md)
- Test tasks in tests/ (ground truth benchmark)
- Skill name (used for routing)
- License or metadata
## Evaluation
- skill_evaluator.py runs SKILL.md against 15 standardized tasks
- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- evaluate.py runs SKILL.md against 15 standardized tasks
- Your CLI tool scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- pass_rate = sum(scores) / 15
## Strategy

View File

@@ -18,6 +18,7 @@ import argparse
import csv
import io
import sys
import time
from pathlib import Path
@@ -80,7 +81,7 @@ def compute_stats(results, direction):
best = None
pct_change = None
if baseline and best and baseline != 0:
if baseline is not None and best is not None and baseline != 0:
if direction == "lower":
pct_change = (baseline - best) / baseline * 100
else:
@@ -145,18 +146,17 @@ def print_dashboard(root):
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best_str = f"{stats['best']:.4f}" if stats["best"] is not None else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
# Determine status
status = "idle"
if stats["total"] > 0:
tsv = exp_dir / "results.tsv"
if tsv.exists():
import time
age_hours = (time.time() - tsv.stat().st_mtime) / 3600
status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"
best_str = f"{stats['best']:.4f}" if stats["best"] is not None else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
experiments.append({
"domain": domain_dir.name,
"name": exp_dir.name,
@@ -202,7 +202,7 @@ def export_experiment_csv(experiment_dir, experiment_path):
if stats["baseline"] is not None:
writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
if stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
writer.writerow(["# Total", stats["total"]])
writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
@@ -216,12 +216,14 @@ def export_experiment_csv(experiment_dir, experiment_path):
return buf.getvalue()
def export_dashboard_csv(root):
def export_dashboard_csv(root, domain_filter=None):
"""Export dashboard as CSV string."""
experiments = []
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
if domain_filter and domain_dir.name != domain_filter:
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
@@ -229,8 +231,8 @@ def export_dashboard_csv(root):
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best_str = f"{stats['best']:.6f}" if stats["best"] else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
best_str = f"{stats['best']:.6f}" if stats["best"] is not None else ""
pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
experiments.append([
domain_dir.name, exp_dir.name, config.get("metric", ""),
stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
@@ -262,7 +264,7 @@ def export_experiment_markdown(experiment_dir, experiment_path):
lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")
if stats["baseline"] is not None and stats["best"] is not None:
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")
lines.append(f"| Commit | Metric | Status | Description |")
@@ -275,7 +277,7 @@ def export_experiment_markdown(experiment_dir, experiment_path):
return "\n".join(lines)
def export_dashboard_markdown(root):
def export_dashboard_markdown(root, domain_filter=None):
"""Export dashboard as Markdown string."""
lines = []
lines.append("# Autoresearch Dashboard\n")
@@ -285,6 +287,8 @@ def export_dashboard_markdown(root):
for domain_dir in sorted(root.iterdir()):
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
if domain_filter and domain_dir.name != domain_filter:
continue
for exp_dir in sorted(domain_dir.iterdir()):
if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
continue
@@ -292,10 +296,9 @@ def export_dashboard_markdown(root):
results = load_results(exp_dir)
direction = config.get("metric_direction", "lower")
stats = compute_stats(results, direction)
best = f"`{stats['best']:.4f}`" if stats["best"] else ""
pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
best = f"`{stats['best']:.4f}`" if stats["best"] is not None else ""
pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
import time
tsv = exp_dir / "results.tsv"
status = "idle"
if tsv.exists() and stats["total"] > 0:
@@ -356,7 +359,7 @@ def main():
# For CSV/MD, fall through to dashboard with domain filter
if args.format != "terminal":
# Use dashboard export filtered to domain
output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
output_text = export_dashboard_csv(root, domain_filter=args.domain) if args.format == "csv" else export_dashboard_markdown(root, domain_filter=args.domain)
else:
return

View File

@@ -2,20 +2,17 @@
"""
autoresearch-agent: Experiment Runner
Executes the autonomous experiment loop for a specific experiment.
Reads config from .autoresearch/{domain}/{name}/config.cfg.
Executes a single experiment iteration. The AI agent is the loop —
it calls this script repeatedly. The script handles evaluation,
metric parsing, keep/discard decisions, and git rollback on failure.
Usage:
python scripts/run_experiment.py --experiment engineering/api-speed --loop
python scripts/run_experiment.py --experiment engineering/api-speed --single
python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
python scripts/run_experiment.py --resume --loop
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
python scripts/run_experiment.py --experiment engineering/api-speed --single --description "added caching"
"""
import argparse
import os
import signal
import subprocess
import sys
import time
@@ -48,10 +45,11 @@ def load_config(experiment_dir):
return config
def run_cmd(cmd, cwd=None, timeout=None):
"""Run shell command, return (returncode, stdout, stderr)."""
def run_git(args, cwd=None, timeout=30):
"""Run a git command safely (no shell injection). Returns (returncode, stdout, stderr)."""
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True,
["git"] + args,
capture_output=True, text=True,
cwd=cwd, timeout=timeout
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
@@ -59,7 +57,7 @@ def run_cmd(cmd, cwd=None, timeout=None):
def get_current_commit(path):
"""Get short hash of current HEAD."""
_, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
_, commit, _ = run_git(["rev-parse", "--short", "HEAD"], cwd=path)
return commit
@@ -85,17 +83,23 @@ def get_best_metric(experiment_dir, direction):
def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
"""Run evaluation with time limit. Output goes to log_file."""
"""Run evaluation with time limit. Output goes to log_file.
Note: shell=True is intentional here — eval_cmd is user-provided and
may contain pipes, redirects, or chained commands.
"""
hard_limit = time_budget_minutes * 60 * 2.5
t0 = time.time()
try:
code, _, _ = run_cmd(
f"{eval_cmd} > {log_file} 2>&1",
cwd=str(project_root),
timeout=hard_limit
)
with open(log_file, "w") as lf:
result = subprocess.run(
eval_cmd, shell=True,
stdout=lf, stderr=subprocess.STDOUT,
cwd=str(project_root),
timeout=hard_limit
)
elapsed = time.time() - t0
return code, elapsed
return result.returncode, elapsed
except subprocess.TimeoutExpired:
elapsed = time.time() - t0
return -1, elapsed
@@ -141,24 +145,24 @@ def get_experiment_count(experiment_dir):
return max(0, len(tsv.read_text().splitlines()) - 1)
def get_last_active(root):
"""Find the most recently modified experiment."""
latest = None
latest_time = 0
for domain_dir in root.iterdir():
if not domain_dir.is_dir() or domain_dir.name.startswith("."):
continue
for exp_dir in domain_dir.iterdir():
if not exp_dir.is_dir():
continue
cfg = exp_dir / "config.cfg"
if cfg.exists() and cfg.stat().st_mtime > latest_time:
latest_time = cfg.stat().st_mtime
latest = f"{domain_dir.name}/{exp_dir.name}"
return latest
def get_description_from_diff(project_root):
"""Auto-generate a description from git diff --stat HEAD~1."""
code, diff_stat, _ = run_git(["diff", "--stat", "HEAD~1"], cwd=str(project_root))
if code == 0 and diff_stat:
return diff_stat.split("\n")[0][:50]
return "experiment"
def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
def read_last_lines(filepath, n=5):
"""Read last n lines of a file (replaces tail shell command)."""
path = Path(filepath)
if not path.exists():
return ""
lines = path.read_text().splitlines()
return "\n".join(lines[-n:])
def run_single(project_root, experiment_dir, config, exp_num, dry_run=False, description=None):
"""Run one experiment iteration."""
direction = config.get("metric_direction", "lower")
metric_grep = config.get("metric_grep", "^metric:")
@@ -177,11 +181,9 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
print(" [DRY RUN] Would run evaluation and check metric")
return "dry_run"
# Save state for rollback
code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
if code != 0:
print(" Error: can't get git state")
return "error"
# Auto-generate description if not provided
if not description:
description = get_description_from_diff(str(project_root))
# Run evaluation
print(f" Running: {eval_cmd} (budget: {time_budget}m)")
@@ -192,17 +194,17 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
# Timeout
if ret_code == -1:
print(f" TIMEOUT after {elapsed:.0f}s — discarding")
run_cmd("git checkout -- .", cwd=str(project_root))
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
run_git(["checkout", "--", "."], cwd=str(project_root))
run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
return "crash"
# Crash
if ret_code != 0:
_, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
tail = read_last_lines(log_file, 5)
print(f" CRASH (exit {ret_code}) after {elapsed:.0f}s")
print(f" Last output: {tail[:200]}")
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
return "crash"
@@ -210,7 +212,7 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
metric_val = extract_metric(log_file, metric_grep)
if metric_val is None:
print(f" Could not parse {metric_name} from run.log")
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
return "crash"
@@ -224,63 +226,23 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
# Keep or discard
if is_improvement(metric_val, best, direction):
print(f" KEEP — improvement")
log_result(experiment_dir, commit, metric_val, "keep",
f"improved_{metric_name}_{metric_val:.4f}")
log_result(experiment_dir, commit, metric_val, "keep", description)
return "keep"
else:
print(f" DISCARD — no improvement")
run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
best_str = f"{best:.4f}" if best else "?"
run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
best_str = f"{best:.4f}" if best is not None else "?"
log_result(experiment_dir, commit, metric_val, "discard",
f"no_improvement_{metric_val:.4f}_vs_{best_str}")
return "discard"
def print_summary(experiment_dir, config):
"""Print session summary."""
tsv = experiment_dir / "results.tsv"
if not tsv.exists():
return
lines = tsv.read_text().splitlines()[1:]
if not lines:
return
keeps = [l for l in lines if "\tkeep\t" in l]
discards = [l for l in lines if "\tdiscard\t" in l]
crashes = [l for l in lines if "\tcrash\t" in l]
metric_name = config.get("metric", "metric")
direction = config.get("metric_direction", "lower")
print(f"\n{'=' * 55}")
print(f" autoresearch — Session Summary")
print(f" Experiments: {len(lines)} total")
print(f" Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")
if keeps:
try:
valid = []
for l in keeps:
parts = l.split("\t")
if parts[1] != "N/A":
valid.append(float(parts[1]))
if len(valid) >= 2:
first, last = valid[0], valid[-1]
best = min(valid) if direction == "lower" else max(valid)
pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
print(f" {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
except (ValueError, IndexError):
pass
print(f"{'=' * 55}\n")
def main():
parser = argparse.ArgumentParser(description="autoresearch-agent runner")
parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
parser.add_argument("--loop", action="store_true", help="Run forever")
parser.add_argument("--single", action="store_true", help="Run one experiment")
parser.add_argument("--single", action="store_true", help="Run one experiment iteration")
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
parser.add_argument("--description", help="Description of the change (auto-generated from git diff if omitted)")
parser.add_argument("--path", default=".", help="Project root")
args = parser.parse_args()
@@ -291,20 +253,11 @@ def main():
print("No .autoresearch/ found. Run setup_experiment.py first.")
sys.exit(1)
# Resolve experiment
experiment_path = args.experiment
if args.resume:
experiment_path = get_last_active(root)
if not experiment_path:
print("No experiments found to resume.")
sys.exit(1)
print(f"Resuming: {experiment_path}")
if not experiment_path:
print("Specify --experiment domain/name or --resume")
if not args.experiment:
print("Specify --experiment domain/name")
sys.exit(1)
experiment_dir = root / experiment_path
experiment_dir = root / args.experiment
if not experiment_dir.exists():
print(f"Experiment not found: {experiment_dir}")
print("Run: python scripts/setup_experiment.py --list")
@@ -312,56 +265,15 @@ def main():
config = load_config(experiment_dir)
domain, name = experiment_path.split("/", 1)
print(f"\n autoresearch-agent")
print(f" Experiment: {experiment_path}")
print(f" Experiment: {args.experiment}")
print(f" Target: {config.get('target', '?')}")
print(f" Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
print(f" Budget: {config.get('time_budget_minutes', '?')} min/experiment")
print(f" Mode: {'loop' if args.loop else 'single'}")
print(f" Mode: {'dry-run' if args.dry_run else 'single'}")
if args.single or args.dry_run:
exp_num = get_experiment_count(experiment_dir) + 1
run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
return
if not args.loop:
print("\nSpecify --loop (forever) or --single (one experiment)")
sys.exit(1)
# Graceful shutdown
def handle_interrupt(sig, frame):
print_summary(experiment_dir, config)
print("\nStopped by user.")
sys.exit(0)
signal.signal(signal.SIGINT, handle_interrupt)
signal.signal(signal.SIGTERM, handle_interrupt)
consecutive_crashes = 0
exp_num = get_experiment_count(experiment_dir) + 1
print(f"\nStarting loop. Ctrl+C to stop.\n")
while True:
result = run_single(project_root, experiment_dir, config, exp_num, False)
exp_num += 1
if result == "crash":
consecutive_crashes += 1
else:
consecutive_crashes = 0
if consecutive_crashes >= 5:
print("\n 5 consecutive crashes. Pausing.")
print(" Check .autoresearch/{}/run.log".format(experiment_path))
break
if 0 < args.max_experiments < exp_num:
print(f"\n Reached max experiments ({args.max_experiments})")
break
print_summary(experiment_dir, config)
run_single(project_root, experiment_dir, config, exp_num, args.dry_run, args.description)
if __name__ == "__main__":

View File

@@ -19,11 +19,9 @@ Usage:
"""
import argparse
import os
import shutil
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
@@ -159,13 +157,19 @@ def copy_evaluator(experiment_dir, evaluator_name):
def create_branch(path, domain, name):
"""Create and checkout the experiment branch."""
branch = f"autoresearch/{domain}/{name}"
code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
if code != 0:
if "already exists" in err:
result = subprocess.run(
["git", "checkout", "-b", branch],
cwd=path, capture_output=True, text=True
)
if result.returncode != 0:
if "already exists" in result.stderr:
print(f" Branch '{branch}' already exists. Checking out...")
run_cmd(f"git checkout {branch}", cwd=path)
subprocess.run(
["git", "checkout", branch],
cwd=path, capture_output=True, text=True
)
return branch
print(f" Warning: could not create branch: {err}")
print(f" Warning: could not create branch: {result.stderr}")
return None
print(f" Created branch: {branch}")
return branch
@@ -229,10 +233,17 @@ def list_evaluators():
# Read first docstring line
desc = ""
for line in f.read_text().splitlines():
if line.strip().startswith('"""') or line.strip().startswith("'''"):
stripped = line.strip()
if stripped.startswith('"""') or stripped.startswith("'''"):
quote = stripped[:3]
# Single-line docstring: """Description."""
after_quote = stripped[3:]
if after_quote and after_quote.rstrip(quote[0]).strip():
desc = after_quote.rstrip('"').rstrip("'").strip()
break
continue
if line.strip() and not line.startswith("#!"):
desc = line.strip().strip('"').strip("'")
if stripped and not line.startswith("#!"):
desc = stripped.strip('"').strip("'")
break
print(f" {f.stem:<25} {desc}")
@@ -252,7 +263,6 @@ def main():
help="Where to store experiments: project (./) or user (~/)")
parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
parser.add_argument("--path", default=".", help="Project root path")
parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
parser.add_argument("--list", action="store_true", help="List all experiments")
parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
@@ -288,7 +298,11 @@ def main():
print(f" Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
# Check git
code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
result = subprocess.run(
["git", "rev-parse", "--is-inside-work-tree"],
cwd=str(project_root), capture_output=True, text=True
)
code = result.returncode
if code != 0:
print(" Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
sys.exit(1)
@@ -362,7 +376,7 @@ def main():
if not args.skip_branch:
print(f" Branch: autoresearch/{args.domain}/{args.name}")
print(f"\n To start:")
print(f" python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")
print(f" python scripts/run_experiment.py --experiment {args.domain}/{args.name} --single")
if __name__ == "__main__":

View File

@@ -0,0 +1,22 @@
{
"name": "autoresearch-agent",
"displayName": "Autoresearch Agent",
"version": "2.1.2",
"description": "Autonomous experiment loop — optimize any file by a measurable metric.",
"author": "Alireza Rezvani",
"license": "MIT",
"platforms": ["claude-code", "openclaw", "codex"],
"category": "engineering",
"tags": ["optimization", "experiments", "benchmarks", "autoresearch", "loop", "metrics"],
"repository": "https://github.com/alirezarezvani/claude-skills",
"commands": {
"setup": "/ar:setup",
"run": "/ar:run",
"loop": "/ar:loop",
"status": "/ar:status",
"resume": "/ar:resume"
},
"agents": [
"experiment-runner"
]
}

View File

@@ -0,0 +1,122 @@
---
name: "loop"
description: "Start an autonomous experiment loop with user-selected interval (10min, 1h, daily, weekly, monthly). Uses CronCreate for scheduling."
command: /ar:loop
---
# /ar:loop — Autonomous Experiment Loop
Start a recurring experiment loop that runs at a user-selected interval.
## Usage
```
/ar:loop engineering/api-speed # Start loop (prompts for interval)
/ar:loop engineering/api-speed 10m # Every 10 minutes
/ar:loop engineering/api-speed 1h # Every hour
/ar:loop engineering/api-speed daily # Daily at ~9am
/ar:loop engineering/api-speed weekly # Weekly on Monday ~9am
/ar:loop engineering/api-speed monthly # Monthly on 1st ~9am
/ar:loop stop engineering/api-speed # Stop an active loop
```
## What It Does
### Step 1: Resolve experiment
If no experiment specified, list experiments and let user pick.
### Step 2: Select interval
If interval not provided as argument, present options:
```
Select loop interval:
1. Every 10 minutes (rapid — stay and watch)
2. Every hour (background — check back later)
3. Daily at ~9am (overnight experiments)
4. Weekly on Monday (long-running experiments)
5. Monthly on 1st (slow experiments)
```
Map to cron expressions:
| Interval | Cron Expression | Shorthand |
|----------|----------------|-----------|
| 10 minutes | `*/10 * * * *` | `10m` |
| 1 hour | `7 * * * *` | `1h` |
| Daily | `57 8 * * *` | `daily` |
| Weekly | `57 8 * * 1` | `weekly` |
| Monthly | `57 8 1 * *` | `monthly` |
### Step 3: Create the recurring job
Use `CronCreate` with this prompt (fill in the experiment details):
```
You are running autoresearch experiment "{domain}/{name}".
1. Read .autoresearch/{domain}/{name}/config.cfg for: target, evaluate_cmd, metric, metric_direction
2. Read .autoresearch/{domain}/{name}/program.md for strategy and constraints
3. Read .autoresearch/{domain}/{name}/results.tsv for experiment history
4. Run: git checkout autoresearch/{domain}/{name}
Then do exactly ONE iteration:
- Review results.tsv: what worked, what failed, what hasn't been tried
- Edit the target file with ONE change (strategy escalation based on run count)
- Commit: git add {target} && git commit -m "experiment: {description}"
- Evaluate: python {skill_path}/scripts/run_experiment.py --experiment {domain}/{name} --single
- Read the output (KEEP/DISCARD/CRASH)
Rules:
- ONE change per experiment
- NEVER modify the evaluator
- If 5 consecutive crashes in results.tsv, delete this cron job (CronDelete) and alert
- After every 10 experiments, update Strategy section of program.md
Current best metric: {read from results.tsv or "no baseline yet"}
Total experiments so far: {count from results.tsv}
```
### Step 4: Store loop metadata
Write to `.autoresearch/{domain}/{name}/loop.json`:
```json
{
"cron_id": "{id from CronCreate}",
"interval": "{user selection}",
"started": "{ISO timestamp}",
"experiment": "{domain}/{name}"
}
```
### Step 5: Confirm to user
```
Loop started for {domain}/{name}
Interval: {interval description}
Cron ID: {id}
Auto-expires: 3 days (CronCreate limit)
To check progress: /ar:status
To stop the loop: /ar:loop stop {domain}/{name}
Note: Recurring jobs auto-expire after 3 days.
Run /ar:loop again to restart after expiry.
```
## Stopping a Loop
When user runs `/ar:loop stop {experiment}`:
1. Read `.autoresearch/{domain}/{name}/loop.json` to get the cron ID
2. Call `CronDelete` with that ID
3. Delete `loop.json`
4. Confirm: "Loop stopped for {experiment}. {n} experiments completed."
## Important Limitations
- **3-day auto-expiry**: CronCreate jobs expire after 3 days. For longer experiments, the user must re-run `/ar:loop` to restart. Results persist — the new loop picks up where the old one left off.
- **One loop per experiment**: Don't start multiple loops for the same experiment.
- **Concurrent experiments**: Multiple experiments can loop simultaneously ONLY if they're on different git branches (which they are by default — each experiment gets `autoresearch/{domain}/{name}`).

View File

@@ -0,0 +1,77 @@
---
name: "resume"
description: "Resume a paused experiment. Checkout the experiment branch, read results history, continue iterating."
command: /ar:resume
---
# /ar:resume — Resume Experiment
Resume a paused or context-limited experiment. Reads all history and continues where you left off.
## Usage
```
/ar:resume # List experiments, let user pick
/ar:resume engineering/api-speed # Resume specific experiment
```
## What It Does
### Step 1: List experiments if needed
If no experiment specified:
```bash
python {skill_path}/scripts/setup_experiment.py --list
```
Show status for each (active/paused/done based on results.tsv age). Let user pick.
### Step 2: Load full context
```bash
# Checkout the experiment branch
git checkout autoresearch/{domain}/{name}
# Read config
cat .autoresearch/{domain}/{name}/config.cfg
# Read strategy
cat .autoresearch/{domain}/{name}/program.md
# Read full results history
cat .autoresearch/{domain}/{name}/results.tsv
# Read recent git log for the branch
git log --oneline -20
```
### Step 3: Report current state
Summarize for the user:
```
Resuming: engineering/api-speed
Target: src/api/search.py
Metric: p50_ms (lower is better)
Experiments: 23 total — 8 kept, 12 discarded, 3 crashed
Best: 185ms (-42% from baseline of 320ms)
Last experiment: "added response caching" → KEEP (185ms)
Recent patterns:
- Caching changes: 3 kept, 1 discarded (consistently helpful)
- Algorithm changes: 2 discarded, 1 crashed (high risk, low reward so far)
- I/O optimization: 2 kept (promising direction)
```
### Step 4: Ask next action
```
How would you like to continue?
1. Single iteration (/ar:run) — I'll make one change and evaluate
2. Start a loop (/ar:loop) — Autonomous with scheduled interval
3. Just show me the results — I'll review and decide
```
If the user picks loop, hand off to `/ar:loop` with the experiment pre-selected.
If single, hand off to `/ar:run`.

View File

@@ -0,0 +1,84 @@
---
name: "run"
description: "Run a single experiment iteration. Edit the target file, evaluate, keep or discard."
command: /ar:run
---
# /ar:run — Single Experiment Iteration
Run exactly ONE experiment iteration: review history, decide a change, edit, commit, evaluate.
## Usage
```
/ar:run engineering/api-speed # Run one iteration
/ar:run # List experiments, let user pick
```
## What It Does
### Step 1: Resolve experiment
If no experiment specified, run `python {skill_path}/scripts/setup_experiment.py --list` and ask the user to pick.
### Step 2: Load context
```bash
# Read experiment config
cat .autoresearch/{domain}/{name}/config.cfg
# Read strategy and constraints
cat .autoresearch/{domain}/{name}/program.md
# Read experiment history
cat .autoresearch/{domain}/{name}/results.tsv
# Checkout the experiment branch
git checkout autoresearch/{domain}/{name}
```
### Step 3: Decide what to try
Review results.tsv:
- What changes were kept? What pattern do they share?
- What was discarded? Avoid repeating those approaches.
- What crashed? Understand why.
- How many runs so far? (Escalate strategy accordingly)
**Strategy escalation:**
- Runs 1-5: Low-hanging fruit (obvious improvements)
- Runs 6-15: Systematic exploration (vary one parameter)
- Runs 16-30: Structural changes (algorithm swaps)
- Runs 30+: Radical experiments (completely different approaches)
### Step 4: Make ONE change
Edit only the target file specified in config.cfg. Change one thing. Keep it simple.
### Step 5: Commit and evaluate
```bash
git add {target}
git commit -m "experiment: {short description of what changed}"
python {skill_path}/scripts/run_experiment.py \
--experiment {domain}/{name} --single
```
### Step 6: Report result
Read the script output. Tell the user:
- **KEEP**: "Improvement! {metric}: {value} ({delta} from previous best)"
- **DISCARD**: "No improvement. {metric}: {value} vs best {best}. Reverted."
- **CRASH**: "Evaluation failed: {reason}. Reverted."
### Step 7: Self-improvement check
After every 10th experiment (check results.tsv line count), update the Strategy section of program.md with patterns learned.
## Rules
- ONE change per iteration. Don't change 5 things at once.
- NEVER modify the evaluator (evaluate.py). It's ground truth.
- Simplicity wins. Equal performance with simpler code is an improvement.
- No new dependencies.

View File

@@ -0,0 +1,77 @@
---
name: "setup"
description: "Set up a new autoresearch experiment interactively. Collects domain, target file, eval command, metric, direction, and evaluator."
command: /ar:setup
---
# /ar:setup — Create New Experiment
Set up a new autoresearch experiment with all required configuration.
## Usage
```
/ar:setup # Interactive mode
/ar:setup engineering api-speed src/api.py "pytest bench.py" p50_ms lower
/ar:setup --list # Show existing experiments
/ar:setup --list-evaluators # Show available evaluators
```
## What It Does
### If arguments provided
Pass them directly to the setup script:
```bash
python {skill_path}/scripts/setup_experiment.py \
--domain {domain} --name {name} \
--target {target} --eval "{eval_cmd}" \
--metric {metric} --direction {direction} \
[--evaluator {evaluator}] [--scope {scope}]
```
### If no arguments (interactive mode)
Collect each parameter one at a time:
1. **Domain** — Ask: "What domain? (engineering, marketing, content, prompts, custom)"
2. **Name** — Ask: "Experiment name? (e.g., api-speed, blog-titles)"
3. **Target file** — Ask: "Which file to optimize?" Verify it exists.
4. **Eval command** — Ask: "How to measure it? (e.g., pytest bench.py, python evaluate.py)"
5. **Metric** — Ask: "What metric does the eval output? (e.g., p50_ms, ctr_score)"
6. **Direction** — Ask: "Is lower or higher better?"
7. **Evaluator** (optional) — Show built-in evaluators. Ask: "Use a built-in evaluator, or your own?"
8. **Scope** — Ask: "Store in project (.autoresearch/) or user (~/.autoresearch/)?"
Then run `setup_experiment.py` with the collected parameters.
### Listing
```bash
# Show existing experiments
python {skill_path}/scripts/setup_experiment.py --list
# Show available evaluators
python {skill_path}/scripts/setup_experiment.py --list-evaluators
```
## Built-in Evaluators
| Name | Metric | Use Case |
|------|--------|----------|
| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
| `llm_judge_content` | `ctr_score` (higher) | Headlines, titles, descriptions |
| `llm_judge_prompt` | `quality_score` (higher) | System prompts, agent instructions |
| `llm_judge_copy` | `engagement_score` (higher) | Social posts, ad copy, emails |
## After Setup
Report to the user:
- Experiment path and branch name
- Whether the eval command worked and the baseline metric
- Suggest: "Run `/ar:run {domain}/{name}` to start iterating, or `/ar:loop {domain}/{name}` for autonomous mode."

View File

@@ -0,0 +1,71 @@
---
name: "status"
description: "Show experiment dashboard with results, active loops, and progress."
command: /ar:status
---
# /ar:status — Experiment Dashboard
Show experiment results, active loops, and progress across all experiments.
## Usage
```
/ar:status # Full dashboard
/ar:status engineering/api-speed # Single experiment detail
/ar:status --domain engineering # All experiments in a domain
/ar:status --format markdown # Export as markdown
/ar:status --format csv --output results.csv # Export as CSV
```
## What It Does
### Single experiment
```bash
python {skill_path}/scripts/log_results.py --experiment {domain}/{name}
```
Also check for active loop:
```bash
cat .autoresearch/{domain}/{name}/loop.json 2>/dev/null
```
If loop.json exists, show:
```
Active loop: every {interval} (cron ID: {id}, started: {date})
```
### Domain view
```bash
python {skill_path}/scripts/log_results.py --domain {domain}
```
### Full dashboard
```bash
python {skill_path}/scripts/log_results.py --dashboard
```
For each experiment, also check for loop.json and show loop status.
### Export
```bash
# CSV
python {skill_path}/scripts/log_results.py --dashboard --format csv --output {file}
# Markdown
python {skill_path}/scripts/log_results.py --dashboard --format markdown --output {file}
```
## Output Example
```
DOMAIN EXPERIMENT RUNS KEPT BEST CHANGE STATUS LOOP
engineering api-speed 47 14 185ms -76.9% active every 1h
engineering bundle-size 23 8 412KB -58.3% paused —
marketing medium-ctr 31 11 8.4/10 +68.0% active daily
prompts support-tone 15 6 82/100 +46.4% done —
```