Files
claude-skills-reference/engineering/autoresearch-agent/references/program-template.md
Reza Rezvani 7911cf957a feat(autoresearch-agent): fix critical bugs, package as plugin with 5 slash commands
**Bug fixes (run_experiment.py):**
- Fix broken revert logic: was saving HEAD as pre_commit (no-op revert),
  now uses git reset --hard HEAD~1 for correct rollback
- Remove broken --loop mode (agent IS the loop, script handles one iteration)
- Fix shell injection: all git commands use subprocess list form
- Replace shell tail with Python file read

**Bug fixes (other scripts):**
- setup_experiment.py: fix shell injection in git branch creation,
  remove dead --skip-baseline flag, fix evaluator docstring parsing
- log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None),
  add domain_filter to CSV/markdown export, move import time to top
- evaluators: add FileNotFoundError handling, fix output format mismatch
  in llm_judge_copy, add peak_kb on macOS, add ValueError handling

**Plugin packaging (NEW):**
- plugin.json, settings.json, CLAUDE.md for plugin registry
- 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume
- /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly)
- experiment-runner agent for autonomous loop iterations
- Registered in marketplace.json as plugin #20

**SKILL.md rewrite:**
- Replace ambiguous "Loop Protocol" with clear "Agent Protocol"
- Add results.tsv format spec, strategy escalation, self-improvement
- Replace "NEVER STOP" with resumable stopping logic

**Docs & sync:**
- Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill
- 6 new MkDocs pages, mkdocs.yml nav updated
- Counts updated: 17 agents, 22 slash commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 14:38:59 +01:00

4.7 KiB

program.md Templates

Copy the template for your domain and paste into your project root as program.md.


ML Training (Karpathy-style)

# autoresearch — ML Training

## Goal
Minimize val_bpb on the validation set. Lower is better.

## What You Can Change (train.py only)
- Model architecture (depth, width, attention heads, FFN ratio)
- Optimizer (learning rate, warmup, scheduler, weight decay)
- Training loop (batch size, gradient accumulation, clipping)
- Regularization (dropout, weight tying, etc.)
- Any self-contained improvement that doesn't require new packages

## What You Cannot Change
- prepare.py (fixed — contains evaluation harness)
- Dependencies (pyproject.toml is locked)
- Time budget (always 5 minutes, wall clock)
- Evaluation metric (val_bpb is the ground truth)

## Strategy
1. First run: establish baseline. Do not change anything.
2. Explore learning rate range (try 2x and 0.5x current)
3. Try depth changes (±2 layers)
4. Try optimizer changes (Muon vs. AdamW variants)
5. If things improve, double down. If stuck, try something radical.

## Simplicity Rule
A small improvement with ugly code is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.

## Stop When
val_bpb < 0.95 OR after 100 experiments, whichever comes first.

Prompt Engineering

# autoresearch — Prompt Optimization

## Goal
Maximize eval_score on the test suite. Higher is better (0-100).

## What You Can Change (prompt.md only)
- System prompt instructions
- Examples and few-shot demonstrations
- Output format specifications
- Chain-of-thought instructions
- Persona and tone
- Task decomposition strategies

## What You Cannot Change
- evaluate.py (fixed evaluation harness)
- Test cases in tests/ (ground truth)
- Model being evaluated (specified in evaluate.py)
- Scoring criteria (defined in evaluate.py)

## Strategy
1. First run: baseline with current prompt (or empty)
2. Add clear role/persona definition
3. Add output format specification
4. Add chain-of-thought instruction
5. Add 2-3 diverse examples
6. Refine based on failure modes from run.log

## Evaluation
- evaluate.py runs the prompt against 20 test cases
- Each test case is scored 1-10 by your CLI tool (Claude, Codex, or Gemini)
- quality_score = average * 10 (maps to 10-100)
- Run log shows which test cases failed

## Stop When
eval_score >= 85 OR after 50 experiments.

Code Performance

# autoresearch — Performance Optimization

## Goal
Minimize p50_ms (median latency). Lower is better.

## What You Can Change (src/module.py only)
- Algorithm implementation
- Data structures (use faster alternatives)
- Caching and memoization
- Vectorization (NumPy, etc.)
- Loop optimization
- I/O patterns
- Memory allocation patterns

## What You Cannot Change
- benchmark.py (fixed benchmark harness)
- Public API (function signatures must stay the same)
- External dependencies (add nothing new)
- Correctness tests (tests/ must still pass)

## Constraints
- Correctness is non-negotiable. benchmark.py runs tests first.
- If tests fail → immediate crash status, no metric recorded.
- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x.

## Strategy
1. Baseline: profile first, don't guess
2. Check if there's any O(n²) → O(n log n) opportunity
3. Try caching repeated computations
4. Try NumPy vectorization for loops
5. Try algorithm-level changes last (higher risk)

## Stop When
p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments.

Agent Skill Optimization

# autoresearch — Skill Optimization

## Goal
Maximize pass_rate on the task evaluation suite. Higher is better (0-1).

## What You Can Change (SKILL.md only)
- Skill description and trigger phrases
- Core workflow steps and ordering
- Decision frameworks and rules
- Output format specifications
- Example inputs/outputs
- Related skills disambiguation
- Proactive trigger conditions

## What You Cannot Change
- your custom evaluate.py (see Custom Evaluators in SKILL.md)
- Test tasks in tests/ (ground truth benchmark)
- Skill name (used for routing)
- License or metadata

## Evaluation
- evaluate.py runs SKILL.md against 15 standardized tasks
- Your CLI tool scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- pass_rate = sum(scores) / 15

## Strategy
1. Baseline: run as-is
2. Improve trigger description (better routing = more passes)
3. Sharpen the core workflow (clearer = better execution)
4. Add missing edge cases to the rules section
5. Improve disambiguation (reduce false-positive routing)

## Simplicity Rule
A shorter SKILL.md that achieves the same score is better.
Aim for 200-400 lines total.

## Stop When
pass_rate >= 0.90 OR after 30 experiments.