Inspired by Karpathy's autoresearch. The agent modifies a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely — no human in the loop. Includes: - SKILL.md with setup wizard, 4 domain configs, experiment loop protocol - 3 stdlib-only Python scripts (setup, run, log — 687 lines) - Reference docs: experiment domains guide, program.md templates Domains: ML training (val_bpb), prompt engineering (eval_score), code performance (p50_ms), agent skill optimization (pass_rate). Cherry-picked from feat/autoresearch-agent and rebased onto dev. Fixes: timeout inconsistency (2x→2.5x), results.tsv tracking clarity, zero-metric edge case, installation section aligned with multi-tool support.
4.7 KiB
4.7 KiB
program.md Templates
Copy the template for your domain and paste into your project root as program.md.
ML Training (Karpathy-style)
# autoresearch — ML Training
## Goal
Minimize val_bpb on the validation set. Lower is better.
## What You Can Change (train.py only)
- Model architecture (depth, width, attention heads, FFN ratio)
- Optimizer (learning rate, warmup, scheduler, weight decay)
- Training loop (batch size, gradient accumulation, clipping)
- Regularization (dropout, weight tying, etc.)
- Any self-contained improvement that doesn't require new packages
## What You Cannot Change
- prepare.py (fixed — contains evaluation harness)
- Dependencies (pyproject.toml is locked)
- Time budget (always 5 minutes, wall clock)
- Evaluation metric (val_bpb is the ground truth)
## Strategy
1. First run: establish baseline. Do not change anything.
2. Explore learning rate range (try 2x and 0.5x current)
3. Try depth changes (±2 layers)
4. Try optimizer changes (Muon vs. AdamW variants)
5. If things improve, double down. If stuck, try something radical.
## Simplicity Rule
A small improvement with ugly code is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.
## Stop When
val_bpb < 0.95 OR after 100 experiments, whichever comes first.
Prompt Engineering
# autoresearch — Prompt Optimization
## Goal
Maximize eval_score on the test suite. Higher is better (0-100).
## What You Can Change (prompt.md only)
- System prompt instructions
- Examples and few-shot demonstrations
- Output format specifications
- Chain-of-thought instructions
- Persona and tone
- Task decomposition strategies
## What You Cannot Change
- evaluate.py (fixed evaluation harness)
- Test cases in tests/ (ground truth)
- Model being evaluated (specified in evaluate.py)
- Scoring criteria (defined in evaluate.py)
## Strategy
1. First run: baseline with current prompt (or empty)
2. Add clear role/persona definition
3. Add output format specification
4. Add chain-of-thought instruction
5. Add 2-3 diverse examples
6. Refine based on failure modes from run.log
## Evaluation
- evaluate.py runs the prompt against 20 test cases
- Each test case is scored 0-5 by GPT-4o
- eval_score = average * 20 (maps to 0-100)
- Run log shows which test cases failed
## Stop When
eval_score >= 85 OR after 50 experiments.
Code Performance
# autoresearch — Performance Optimization
## Goal
Minimize p50_ms (median latency). Lower is better.
## What You Can Change (src/module.py only)
- Algorithm implementation
- Data structures (use faster alternatives)
- Caching and memoization
- Vectorization (NumPy, etc.)
- Loop optimization
- I/O patterns
- Memory allocation patterns
## What You Cannot Change
- benchmark.py (fixed benchmark harness)
- Public API (function signatures must stay the same)
- External dependencies (add nothing new)
- Correctness tests (tests/ must still pass)
## Constraints
- Correctness is non-negotiable. benchmark.py runs tests first.
- If tests fail → immediate crash status, no metric recorded.
- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x.
## Strategy
1. Baseline: profile first, don't guess
2. Check if there's any O(n²) → O(n log n) opportunity
3. Try caching repeated computations
4. Try NumPy vectorization for loops
5. Try algorithm-level changes last (higher risk)
## Stop When
p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments.
Agent Skill Optimization
# autoresearch — Skill Optimization
## Goal
Maximize pass_rate on the task evaluation suite. Higher is better (0-1).
## What You Can Change (SKILL.md only)
- Skill description and trigger phrases
- Core workflow steps and ordering
- Decision frameworks and rules
- Output format specifications
- Example inputs/outputs
- Related skills disambiguation
- Proactive trigger conditions
## What You Cannot Change
- scripts/skill_evaluator.py (fixed evaluation)
- Test tasks in tests/ (ground truth benchmark)
- Skill name (used for routing)
- License or metadata
## Evaluation
- skill_evaluator.py runs SKILL.md against 15 standardized tasks
- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- pass_rate = sum(scores) / 15
## Strategy
1. Baseline: run as-is
2. Improve trigger description (better routing = more passes)
3. Sharpen the core workflow (clearer = better execution)
4. Add missing edge cases to the rules section
5. Improve disambiguation (reduce false-positive routing)
## Simplicity Rule
A shorter SKILL.md that achieves the same score is better.
Aim for 200-400 lines total.
## Stop When
pass_rate >= 0.90 OR after 30 experiments.