firefrost-gaming/claude-skills-reference

Files

Leo 12591282da refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo.

Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation

New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed

Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output

Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view

SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy

2026-03-13 08:22:29 +01:00

6.5 KiB

Raw Blame History

Experiment Domains Guide

Domain: Engineering

Code Speed Optimization

python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "python -m pytest tests/bench_search.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --evaluator benchmark_speed

What the agent optimizes: Algorithm, data structures, caching, query patterns, I/O. Cost: Free — just runs benchmarks. Speed: ~5 min/experiment, ~12/hour, ~100 overnight.

Bundle Size Reduction

python scripts/setup_experiment.py \
  --domain engineering \
  --name bundle-size \
  --target webpack.config.js \
  --eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
  --metric size_bytes \
  --direction lower \
  --evaluator benchmark_size

Edit evaluate.py to set TARGET_FILE = "dist/main.js" and add BUILD_CMD = "npm run build".

Test Pass Rate

python scripts/setup_experiment.py \
  --domain engineering \
  --name fix-flaky-tests \
  --target src/utils/parser.py \
  --eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
  --metric pass_rate \
  --direction higher \
  --evaluator test_pass_rate

Docker Build Speed

python scripts/setup_experiment.py \
  --domain engineering \
  --name docker-build \
  --target Dockerfile \
  --eval "python .autoresearch/engineering/docker-build/evaluate.py" \
  --metric build_seconds \
  --direction lower \
  --evaluator build_speed

Memory Optimization

python scripts/setup_experiment.py \
  --domain engineering \
  --name memory-usage \
  --target src/processor.py \
  --eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
  --metric peak_mb \
  --direction lower \
  --evaluator memory_usage

ML Training (Karpathy-style)

Requires NVIDIA GPU. See autoresearch.

python scripts/setup_experiment.py \
  --domain engineering \
  --name ml-training \
  --target train.py \
  --eval "uv run train.py" \
  --metric val_bpb \
  --direction lower \
  --time-budget 5

Domain: Marketing

Medium Article Headlines

python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content

Edit evaluate.py: set TARGET_FILE = "content/titles.md" and CLI_TOOL = "claude".

What the agent optimizes: Title phrasing, curiosity gaps, specificity, emotional triggers. Cost: Uses your CLI subscription (Claude Max = unlimited). Speed: ~2 min/experiment, ~30/hour.

python scripts/setup_experiment.py \
  --domain marketing \
  --name twitter-engagement \
  --target social/tweets.md \
  --eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy

Edit evaluate.py: set PLATFORM = "twitter" (or linkedin, instagram).

Email Subject Lines

python scripts/setup_experiment.py \
  --domain marketing \
  --name email-open-rate \
  --target emails/subjects.md \
  --eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy

Edit evaluate.py: set PLATFORM = "email".

Ad Copy

python scripts/setup_experiment.py \
  --domain marketing \
  --name ad-copy-q2 \
  --target ads/google-search.md \
  --eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy

Edit evaluate.py: set PLATFORM = "ad".

Domain: Content

Article Structure & Readability

python scripts/setup_experiment.py \
  --domain content \
  --name article-structure \
  --target drafts/my-article.md \
  --eval "python .autoresearch/content/article-structure/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content

SEO Descriptions

python scripts/setup_experiment.py \
  --domain content \
  --name seo-meta \
  --target seo/descriptions.md \
  --eval "python .autoresearch/content/seo-meta/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content

Domain: Prompts

System Prompt Optimization

python scripts/setup_experiment.py \
  --domain prompts \
  --name support-bot \
  --target prompts/support-system.md \
  --eval "python .autoresearch/prompts/support-bot/evaluate.py" \
  --metric quality_score \
  --direction higher \
  --evaluator llm_judge_prompt

Requires tests/cases.json with test inputs and expected outputs:

[
  {
    "input": "I can't log in to my account",
    "expected": "Ask for email, check account status, offer password reset"
  },
  {
    "input": "How do I cancel my subscription?",
    "expected": "Empathetic response, explain cancellation steps, offer retention"
  }
]

Agent Skill Optimization

python scripts/setup_experiment.py \
  --domain prompts \
  --name skill-improvement \
  --target SKILL.md \
  --eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
  --metric quality_score \
  --direction higher \
  --evaluator llm_judge_prompt

Choosing Your Domain

I want to...	Domain	Evaluator	Cost
Speed up my code	engineering	benchmark_speed	Free
Shrink my bundle	engineering	benchmark_size	Free
Fix flaky tests	engineering	test_pass_rate	Free
Speed up Docker builds	engineering	build_speed	Free
Reduce memory usage	engineering	memory_usage	Free
Train ML models	engineering	(custom)	Free + GPU
Write better headlines	marketing	llm_judge_content	Subscription
Improve social posts	marketing	llm_judge_copy	Subscription
Optimize email subjects	marketing	llm_judge_copy	Subscription
Improve ad copy	marketing	llm_judge_copy	Subscription
Optimize article structure	content	llm_judge_content	Subscription
Improve SEO descriptions	content	llm_judge_content	Subscription
Optimize system prompts	prompts	llm_judge_prompt	Subscription
Improve agent skills	prompts	llm_judge_prompt	Subscription

First time? Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.

6.5 KiB Raw Blame History