**Bug fixes (run_experiment.py):** - Fix broken revert logic: was saving HEAD as pre_commit (no-op revert), now uses git reset --hard HEAD~1 for correct rollback - Remove broken --loop mode (agent IS the loop, script handles one iteration) - Fix shell injection: all git commands use subprocess list form - Replace shell tail with Python file read **Bug fixes (other scripts):** - setup_experiment.py: fix shell injection in git branch creation, remove dead --skip-baseline flag, fix evaluator docstring parsing - log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None), add domain_filter to CSV/markdown export, move import time to top - evaluators: add FileNotFoundError handling, fix output format mismatch in llm_judge_copy, add peak_kb on macOS, add ValueError handling **Plugin packaging (NEW):** - plugin.json, settings.json, CLAUDE.md for plugin registry - 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume - /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly) - experiment-runner agent for autonomous loop iterations - Registered in marketplace.json as plugin #20 **SKILL.md rewrite:** - Replace ambiguous "Loop Protocol" with clear "Agent Protocol" - Add results.tsv format spec, strategy escalation, self-improvement - Replace "NEVER STOP" with resumable stopping logic **Docs & sync:** - Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill - 6 new MkDocs pages, mkdocs.yml nav updated - Counts updated: 17 agents, 22 slash commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
67 lines
1.8 KiB
Markdown
67 lines
1.8 KiB
Markdown
# Autoresearch Agent — Claude Code Instructions
|
|
|
|
This plugin runs autonomous experiment loops that optimize any file by a measurable metric.
|
|
|
|
## Commands
|
|
|
|
Use the `/ar:` namespace for all commands:
|
|
|
|
- `/ar:setup` — Set up a new experiment interactively
|
|
- `/ar:run` — Run a single experiment iteration
|
|
- `/ar:loop` — Start an autonomous loop with user-selected interval
|
|
- `/ar:status` — Show dashboard and results
|
|
- `/ar:resume` — Resume a paused experiment
|
|
|
|
## How it works
|
|
|
|
You (the AI agent) are the experiment loop. The scripts handle evaluation and git rollback.
|
|
|
|
1. You edit the target file with ONE change
|
|
2. You commit it
|
|
3. You call `run_experiment.py --single` — it evaluates and prints KEEP/DISCARD/CRASH
|
|
4. You repeat
|
|
|
|
Results persist in `results.tsv` and git log. Sessions can be resumed.
|
|
|
|
## When to use each command
|
|
|
|
### Starting fresh
|
|
```
|
|
/ar:setup
|
|
```
|
|
Creates the experiment directory, config, program.md, results.tsv, and git branch.
|
|
|
|
### Running one iteration at a time
|
|
```
|
|
/ar:run engineering/api-speed
|
|
```
|
|
Read history, make one change, evaluate, report result.
|
|
|
|
### Autonomous background loop
|
|
```
|
|
/ar:loop engineering/api-speed
|
|
```
|
|
Prompts for interval (10min, 1h, daily, weekly, monthly), then creates a recurring job.
|
|
|
|
### Checking progress
|
|
```
|
|
/ar:status
|
|
```
|
|
Shows the dashboard across all experiments with metrics and trends.
|
|
|
|
### Resuming after context limit or break
|
|
```
|
|
/ar:resume engineering/api-speed
|
|
```
|
|
Reads results history, checks out the branch, and continues where you left off.
|
|
|
|
## Agents
|
|
|
|
- **experiment-runner**: Spawned for each loop iteration. Reads config, results history, decides what to try, edits target, commits, evaluates.
|
|
|
|
## Key principle
|
|
|
|
**One change per experiment. Measure everything. Compound improvements.**
|
|
|
|
The agent never modifies the evaluator. The evaluator is ground truth.
|