firefrost-gaming/claude-skills-reference

Files

Leo a799d8bdb8 feat: add autoresearch-agent — autonomous experiment loop for ML, prompt, code & skill optimization

Inspired by Karpathy's autoresearch. The agent modifies a target file, runs a
fixed evaluation, keeps improvements (git commit), discards failures (git reset),
and loops indefinitely — no human in the loop.

Includes:
- SKILL.md with setup wizard, 4 domain configs, experiment loop protocol
- 3 stdlib-only Python scripts (setup, run, log — 687 lines)
- Reference docs: experiment domains guide, program.md templates

Domains: ML training (val_bpb), prompt engineering (eval_score),
code performance (p50_ms), agent skill optimization (pass_rate).

Cherry-picked from feat/autoresearch-agent and rebased onto dev.
Fixes: timeout inconsistency (2x→2.5x), results.tsv tracking clarity,
zero-metric edge case, installation section aligned with multi-tool support.

2026-03-13 07:21:44 +01:00

4.7 KiB

Raw Blame History

program.md Templates

Copy the template for your domain and paste into your project root as program.md.

ML Training (Karpathy-style)

# autoresearch — ML Training

## Goal
Minimize val_bpb on the validation set. Lower is better.

## What You Can Change (train.py only)
- Model architecture (depth, width, attention heads, FFN ratio)
- Optimizer (learning rate, warmup, scheduler, weight decay)
- Training loop (batch size, gradient accumulation, clipping)
- Regularization (dropout, weight tying, etc.)
- Any self-contained improvement that doesn't require new packages

## What You Cannot Change
- prepare.py (fixed — contains evaluation harness)
- Dependencies (pyproject.toml is locked)
- Time budget (always 5 minutes, wall clock)
- Evaluation metric (val_bpb is the ground truth)

## Strategy
1. First run: establish baseline. Do not change anything.
2. Explore learning rate range (try 2x and 0.5x current)
3. Try depth changes (±2 layers)
4. Try optimizer changes (Muon vs. AdamW variants)
5. If things improve, double down. If stuck, try something radical.

## Simplicity Rule
A small improvement with ugly code is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.

## Stop When
val_bpb < 0.95 OR after 100 experiments, whichever comes first.

Prompt Engineering

# autoresearch — Prompt Optimization

## Goal
Maximize eval_score on the test suite. Higher is better (0-100).

## What You Can Change (prompt.md only)
- System prompt instructions
- Examples and few-shot demonstrations
- Output format specifications
- Chain-of-thought instructions
- Persona and tone
- Task decomposition strategies

## What You Cannot Change
- evaluate.py (fixed evaluation harness)
- Test cases in tests/ (ground truth)
- Model being evaluated (specified in evaluate.py)
- Scoring criteria (defined in evaluate.py)

## Strategy
1. First run: baseline with current prompt (or empty)
2. Add clear role/persona definition
3. Add output format specification
4. Add chain-of-thought instruction
5. Add 2-3 diverse examples
6. Refine based on failure modes from run.log

## Evaluation
- evaluate.py runs the prompt against 20 test cases
- Each test case is scored 0-5 by GPT-4o
- eval_score = average * 20 (maps to 0-100)
- Run log shows which test cases failed

## Stop When
eval_score >= 85 OR after 50 experiments.

Code Performance

# autoresearch — Performance Optimization

## Goal
Minimize p50_ms (median latency). Lower is better.

## What You Can Change (src/module.py only)
- Algorithm implementation
- Data structures (use faster alternatives)
- Caching and memoization
- Vectorization (NumPy, etc.)
- Loop optimization
- I/O patterns
- Memory allocation patterns

## What You Cannot Change
- benchmark.py (fixed benchmark harness)
- Public API (function signatures must stay the same)
- External dependencies (add nothing new)
- Correctness tests (tests/ must still pass)

## Constraints
- Correctness is non-negotiable. benchmark.py runs tests first.
- If tests fail → immediate crash status, no metric recorded.
- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x.

## Strategy
1. Baseline: profile first, don't guess
2. Check if there's any O(n²) → O(n log n) opportunity
3. Try caching repeated computations
4. Try NumPy vectorization for loops
5. Try algorithm-level changes last (higher risk)

## Stop When
p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments.

Agent Skill Optimization

# autoresearch — Skill Optimization

## Goal
Maximize pass_rate on the task evaluation suite. Higher is better (0-1).

## What You Can Change (SKILL.md only)
- Skill description and trigger phrases
- Core workflow steps and ordering
- Decision frameworks and rules
- Output format specifications
- Example inputs/outputs
- Related skills disambiguation
- Proactive trigger conditions

## What You Cannot Change
- scripts/skill_evaluator.py (fixed evaluation)
- Test tasks in tests/ (ground truth benchmark)
- Skill name (used for routing)
- License or metadata

## Evaluation
- skill_evaluator.py runs SKILL.md against 15 standardized tasks
- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- pass_rate = sum(scores) / 15

## Strategy
1. Baseline: run as-is
2. Improve trigger description (better routing = more passes)
3. Sharpen the core workflow (clearer = better execution)
4. Add missing edge cases to the rules section
5. Improve disambiguation (reduce false-positive routing)

## Simplicity Rule
A shorter SKILL.md that achieves the same score is better.
Aim for 200-400 lines total.

## Stop When
pass_rate >= 0.90 OR after 30 experiments.

4.7 KiB Raw Blame History

program.md Templates

ML Training (Karpathy-style)

Prompt Engineering

Code Performance

Agent Skill Optimization

4.7 KiB

Raw Blame History